Skip to content

Disaster Recovery — Owl Gateway

Procedures for recovering the production Owl gateway (DEC700, Urbandale, Iowa) from partial or total failure.

Last tested: Not yet drilled — schedule first drill by 2026-03-15.

Severity Levels

Level Symptoms Response
S1 — Service degraded Slow DNS, partial packet loss, monitoring alerts Diagnose remotely, restart services
S2 — Remote access lost ZeroTier down, SSH timeout WAN SSH fallback, API checks
S3 — Gateway unresponsive No SSH (ZeroTier or WAN), no web UI, LAN clients offline Physical console at Iowa site
S4 — Hardware failure S3 + physical console unresponsive, POST failure, power loss Replace hardware, restore config

Impact Assessment

When Owl goes down, the following are affected:

Dependent Impact
Owl LAN clients (10.7.0.0/16) No internet, no DHCP, no DNS
28 static DHCP devices APs, switches, Home Assistant, etc.
Triton (10.7.1.20, RPi 5) Docker host loses connectivity
pdns-dhcp-watcher New DHCP leases stop registering in PowerDNS
ZeroTier overlay Other nodes lose path to Owl LAN
Monitoring (SNMP + node_exporter) Prometheus on Dumbo loses Owl metrics
Remote syslog Owl log stream stops
HE IPv6 tunnel IPv6 for Owl LAN lost
Site-to-site Iowa isolated from Colorado (Blue) and cloud

Not affected: Cloud instances (pluto, dumbo, bogart, mickey, rocky), Blue gateway, PowerDNS — all operate independently.


S1 — Service Degraded

1. Connect and assess

# Preferred — ZeroTier
ssh joe@192.168.194.10

# Fallback — WAN
ssh joe@46.110.77.34

2. Check service health

# On Owl:
configctl zerotier status
configctl firmware status
pfctl -s info | head -5
dig @127.0.0.1 google.com
ping -c 3 8.8.8.8

3. Restart specific services

# DNS (Unbound)
configctl dns restart

# ZeroTier
configctl zerotier restart

# DHCP (Kea)
configctl kea restart

# Suricata IDS
configctl ids restart

# Full service reload (non-disruptive)
configctl service reload all

4. Check recent config changes

# From luna:
./scripts/backup/check-config-drift.sh owl

If drift is detected, review and either accept or restore via Ansible:

cd cloud/ansible && ./scripts/run-opnsense.sh owl

S2 — Remote Access Lost

ZeroTier is down or SSH is timing out on the overlay.

1. Try WAN SSH

ssh joe@46.110.77.34

If WAN SSH works, fix ZeroTier:

configctl zerotier restart
zerotier-cli listnetworks

2. Try API connectivity

curl -sk -u "API_KEY:API_SECRET" \
  https://46.110.77.34/api/diagnostics/interface/getInterfaceNames

3. Deploy via WAN

cd cloud/ansible && ./scripts/run-opnsense.sh owl --wan

The --wan flag overrides both ansible_host (API) and opn_ssh_host (SSH play) to use 46.110.77.34.

4. Check if banned

If SSH connects but hangs or resets:

# From a different IP or device:
ssh joe@46.110.77.34

# Once in, check fail2ban / sshlockout
pfctl -t sshlockout -T show

S3 — Gateway Unresponsive

No response on ZeroTier, WAN SSH, or API. Requires physical console at the Iowa site.

1. Physical console access

  1. Connect to the DEC700 via serial console (or monitor + keyboard)
  2. You will see the OPNsense menu:
  0) Logout                              7) Ping host
  1) Assign interfaces                   8) Shell
  2) Set interface(s) IP address         9) pfTop
  3) Reset the root password            10) Firewall log
  4) Reset to factory defaults          11) Reboot
  5) Power off system                   12) Upgrade from console
  6) Restore a configuration

2. Basic diagnostics

Select 8 (Shell) and check:

# Network
ifconfig igb1        # WAN — should have 46.110.77.34
ifconfig igb0        # LAN — should have 10.7.0.1
ping 8.8.8.8         # Internet connectivity

# Disk
df -h                # Check disk space
mount                # Verify /conf is mounted rw

# Processes
top -b -n 1 | head -20

3. Restore config if corrupted

If the config is corrupted or the gateway won't boot properly, restore from backup.

Choose a restore method based on available connectivity:

Option A — Web UI (if accessible on LAN)

  1. Browse to https://10.7.0.1
  2. System → Configuration → Backups → Restore Configuration
  3. Upload a config from gateways/owl/emergency-restore/
  4. Reboot

Option B — SCP (if LAN IP works)

# From serial console:
# 1) Assign interfaces: igb1=WAN, igb0=LAN
# 2) Set LAN IP: 10.7.0.1/16

# From a machine on the LAN:
scp gateways/owl/emergency-restore/restore-2025-09-20-stable-pre-gap.xml \
  root@10.7.0.1:/conf/config.xml

# Serial console → option 11 (Reboot)

Option C — USB drive

# Prep: copy config to FAT32 USB drive
# Insert USB into DEC700

# Serial console → option 8 (Shell)
mount -t msdosfs /dev/da0s1 /mnt
cp /mnt/restore-2025-09-20-stable-pre-gap.xml /conf/config.xml
umount /mnt
reboot

Option D — HTTP fetch (if any network path exists)

# On a reachable host, serve the config file:
cd gateways/owl/emergency-restore && python3 -m http.server 8080

# Serial console → option 8 (Shell)
fetch http://<server-ip>:8080/restore-2025-09-20-stable-pre-gap.xml \
  -o /conf/config.xml
reboot

4. Emergency restore configs (in priority order)

Priority File Date Notes
1 restore-2025-09-20-stable-pre-gap.xml 2025-09-20 Longest stable period
2 restore-2025-11-23-stable.xml 2025-11-23 5 weeks stable after
3 restore-2025-07-20-pre-he-tunnel.xml 2025-07-20 Before HE tunnel config
4 restore-2026-01-02-last-backup.xml 2026-01-02 Last automated backup
5 restore-2025-06-23-earliest-stable.xml 2025-06-23 Earliest stable config

These are in gateways/owl/emergency-restore/. Try in priority order.


S4 — Hardware Failure

DEC700 is dead (no POST, power supply failure, disk failure).

1. Obtain replacement hardware

  • DEC700 or equivalent 3-port Intel NIC (igb) mini-PC
  • Install OPNsense from USB installer (download from opnsense.org)
  • Perform base installation accepting defaults

2. Restore from Ansible IaC

After a fresh OPNsense install with basic network connectivity:

# 1. Confirm SSH access to fresh install
ssh root@<new-ip>

# 2. Create joe user + SSH key via serial console or web UI

# 3. Run full Ansible deployment
cd cloud/ansible && ./scripts/run-opnsense.sh owl --wan

Ansible covers ~78% of customized config. Remaining manual steps:

  • DNSBL configuration (OPNsense 26.1 model change, not yet in IaC)
  • pdns-dhcp-watcher service setup
  • os-git-backup SSH deploy key
  • ZeroTier network authorization (via my.zerotier.com)

3. Alternative: restore config.xml directly

If time-critical, skip Ansible and restore a full config.xml:

# From the git-backup repo (most recent):
git clone git@github.com:scandora/opnsense-owl.git
cd opnsense-owl
scp config.xml root@<new-ip>:/conf/config.xml
# Reboot the gateway

Or use the curated emergency-restore configs from gateways/owl/emergency-restore/.

4. Post-restore verification

# From the gateway:
ping 8.8.8.8                              # Internet
drill google.com                          # DNS
zerotier-cli listnetworks                 # ZeroTier overlay

# From luna:
ssh joe@46.110.77.34                      # WAN SSH
ssh joe@192.168.194.10                    # ZeroTier SSH
curl -sk https://192.168.194.10/api/diagnostics/interface/getInterfaceNames \
  -u "API_KEY:API_SECRET"                 # API access
./scripts/backup/pull-config.sh owl       # Backup verification
./scripts/backup/check-config-drift.sh owl  # Drift baseline

Backup Inventory

All available backup sources, from freshest to oldest:

Source Location Frequency
Pre-run snapshot ~/.config/scandora/backups/owl/config-pre-run-*.xml Every Ansible run
pull-config.sh ~/.config/scandora/backups/owl/config-*.xml Daily (cron 03:00), 90-day retention
os-git-backup github.com/scandora/opnsense-owl.git Every config change
Google Drive OPNsense built-in Automatic, 3 retained
Milestone configs gateways/owl/configs/ Manual, in git repo
Emergency restore gateways/owl/emergency-restore/ Curated, 5 configs

Listing local backups

./scripts/backup/pull-config.sh owl --list

Comparing backups

./scripts/backup/pull-config.sh owl --diff

Recovery Scripts Reference

Script Purpose Key flags
scripts/backup/pull-config.sh SSH-based config backup --list, --diff, --wan, --quiet
scripts/backup/check-config-drift.sh Detect manual changes --quiet, --save, --wan
cloud/ansible/scripts/run-opnsense.sh Full Ansible deployment --wan, --tags, --check
gateways/owl/get-production-config.sh Pull current production config
gateways/owl/compare-configs.sh Diff two config files

DR Drill Checklist

Schedule quarterly. Target: full recovery in under 60 minutes.

Preparation

  • Verify emergency-restore configs exist in gateways/owl/emergency-restore/
  • Verify pull-config.sh owl --list shows recent backups
  • Verify check-config-drift.sh owl returns exit 0 (no drift)
  • Verify 1Password contains OPNsense API - Owl credentials
  • Verify git clone git@github.com:scandora/opnsense-owl.git succeeds

Drill Procedure (non-destructive)

  1. Backup test: Run pull-config.sh owl — confirm config downloaded and valid
  2. Drift test: Run check-config-drift.sh owl — confirm clean or review drift
  3. Dry-run deploy: Run run-opnsense.sh owl --check — confirm no errors
  4. WAN access test: Run ssh joe@46.110.77.34 "echo OK" — confirm WAN SSH works
  5. API fallback test: Run run-opnsense.sh owl --wan --check — confirm WAN deployment works
  6. Restore verify: Confirm at least 3 backup sources have configs < 24 hours old
  7. Document results: Record drill date, duration, and any issues found

After Drill

  • Update "Last tested" date at the top of this page
  • File issues for any gaps discovered
  • Update emergency-restore configs if current ones are > 3 months old