Skip to content

Operations Overview

Runbooks

Document Description
Deployment Guide How to deploy changes to infrastructure
Emergency Access SSM/IAP backdoor procedures
OOB & Physical Access Serial console and physical access for Owl
Troubleshooting Common issues and solutions
Disaster Recovery Owl gateway DR procedures and drill checklist

Quick Reference

SSH Access

# Cloud instances
ssh joe@pluto       # AWS production
ssh joe@dumbo       # GCE general
ssh joe@bogart      # GCE PowerDNS
ssh joe@mickey      # AWS dev (ephemeral)

# Gateways (via ZeroTier)
ssh joe@192.168.194.10  # Owl
ssh joe@10.15.0.1       # Blue (from Blue site)

Emergency Access

# AWS (SSM)
aws ssm start-session --target i-05e7dd5e009d6d766 --region us-west-2

# GCE (IAP)
gcloud compute ssh dumbo --zone=us-central1-a --tunnel-through-iap

Ansible

# Full deployment
ansible-playbook -i inventory/production.yml playbooks/site.yml --limit HOST

# Specific role
ansible-playbook -i inventory/production.yml playbooks/base.yml --limit HOST

Terraform

# Plan changes (always use -target)
terraform plan -target=aws_instance.pluto

# Apply changes
terraform apply -target=aws_instance.pluto

Daily Operations

Check Host Status

# ZeroTier connectivity
zerotier-cli listnetworks
zerotier-cli listpeers

# Service status (Linux)
sudo systemctl status zerotier-one
sudo systemctl status fail2ban
sudo systemctl status cloudflared

# Service status (OPNsense)
sudo configctl service status

View Logs

# fail2ban
sudo journalctl -u fail2ban -f

# ZeroTier
sudo journalctl -u zerotier-one -f

# SSH auth
sudo journalctl -u sshd -f

DNS Operations

# Test internal DNS
dig @10.10.10.10 owl.scandora.net

# Test external DNS
dig @1.1.1.1 owl.scandora.net

# Update PowerDNS record
curl -X PATCH "http://10.10.10.10:8081/api/v1/servers/localhost/zones/scandora.net." \
  -H "X-API-Key: $KEY" \
  -d '{"rrsets":[...]}'

Change Management

Before Making Changes

  1. Read existing code - Understand what you're modifying
  2. Test in dev - Use mickey or OPNsense dev VM
  3. Backup - Snapshot AMI or export config.xml
  4. Document - Update relevant docs

After Making Changes

  1. Verify - Test the change works
  2. Commit - Git commit with descriptive message
  3. Push - Push to remote after milestones
  4. Reboot test - Where appropriate

Maintenance Windows

No formal maintenance windows - changes are made as needed with appropriate testing.

For multi-host changes:

  1. mickey (dev) - Test first
  2. bogart (untrusted) - Low-risk
  3. dumbo (GCE) - Secondary production
  4. pluto (AWS) - Primary production
  5. Gateways (last) - Most impact