Disaster Recovery — Owl Gateway¶
Procedures for recovering the production Owl gateway (DEC700, Urbandale, Iowa) from partial or total failure.
Last tested: Not yet drilled — schedule first drill by 2026-03-15.
Severity Levels¶
| Level | Symptoms | Response |
|---|---|---|
| S1 — Service degraded | Slow DNS, partial packet loss, monitoring alerts | Diagnose remotely, restart services |
| S2 — Remote access lost | ZeroTier down, SSH timeout | WAN SSH fallback, API checks |
| S3 — Gateway unresponsive | No SSH (ZeroTier or WAN), no web UI, LAN clients offline | Physical console at Iowa site |
| S4 — Hardware failure | S3 + physical console unresponsive, POST failure, power loss | Replace hardware, restore config |
Impact Assessment¶
When Owl goes down, the following are affected:
| Dependent | Impact |
|---|---|
| Owl LAN clients (10.7.0.0/16) | No internet, no DHCP, no DNS |
| 28 static DHCP devices | APs, switches, Home Assistant, etc. |
| Triton (10.7.1.20, RPi 5) | Docker host loses connectivity |
| pdns-dhcp-watcher | New DHCP leases stop registering in PowerDNS |
| ZeroTier overlay | Other nodes lose path to Owl LAN |
| Monitoring (SNMP + node_exporter) | Prometheus on Dumbo loses Owl metrics |
| Remote syslog | Owl log stream stops |
| HE IPv6 tunnel | IPv6 for Owl LAN lost |
| Site-to-site | Iowa isolated from Colorado (Blue) and cloud |
Not affected: Cloud instances (pluto, dumbo, bogart, mickey, rocky), Blue gateway, PowerDNS — all operate independently.
S1 — Service Degraded¶
1. Connect and assess¶
2. Check service health¶
# On Owl:
configctl zerotier status
configctl firmware status
pfctl -s info | head -5
dig @127.0.0.1 google.com
ping -c 3 8.8.8.8
3. Restart specific services¶
# DNS (Unbound)
configctl dns restart
# ZeroTier
configctl zerotier restart
# DHCP (Kea)
configctl kea restart
# Suricata IDS
configctl ids restart
# Full service reload (non-disruptive)
configctl service reload all
4. Check recent config changes¶
If drift is detected, review and either accept or restore via Ansible:
S2 — Remote Access Lost¶
ZeroTier is down or SSH is timing out on the overlay.
1. Try WAN SSH¶
If WAN SSH works, fix ZeroTier:
2. Try API connectivity¶
3. Deploy via WAN¶
The --wan flag overrides both ansible_host (API) and opn_ssh_host (SSH play) to use 46.110.77.34.
4. Check if banned¶
If SSH connects but hangs or resets:
# From a different IP or device:
ssh joe@46.110.77.34
# Once in, check fail2ban / sshlockout
pfctl -t sshlockout -T show
S3 — Gateway Unresponsive¶
No response on ZeroTier, WAN SSH, or API. Requires physical console at the Iowa site.
1. Physical console access¶
- Connect to the DEC700 via serial console (or monitor + keyboard)
- You will see the OPNsense menu:
0) Logout 7) Ping host
1) Assign interfaces 8) Shell
2) Set interface(s) IP address 9) pfTop
3) Reset the root password 10) Firewall log
4) Reset to factory defaults 11) Reboot
5) Power off system 12) Upgrade from console
6) Restore a configuration
2. Basic diagnostics¶
Select 8 (Shell) and check:
# Network
ifconfig igb1 # WAN — should have 46.110.77.34
ifconfig igb0 # LAN — should have 10.7.0.1
ping 8.8.8.8 # Internet connectivity
# Disk
df -h # Check disk space
mount # Verify /conf is mounted rw
# Processes
top -b -n 1 | head -20
3. Restore config if corrupted¶
If the config is corrupted or the gateway won't boot properly, restore from backup.
Choose a restore method based on available connectivity:
Option A — Web UI (if accessible on LAN)¶
- Browse to
https://10.7.0.1 - System → Configuration → Backups → Restore Configuration
- Upload a config from
gateways/owl/emergency-restore/ - Reboot
Option B — SCP (if LAN IP works)¶
# From serial console:
# 1) Assign interfaces: igb1=WAN, igb0=LAN
# 2) Set LAN IP: 10.7.0.1/16
# From a machine on the LAN:
scp gateways/owl/emergency-restore/restore-2025-09-20-stable-pre-gap.xml \
root@10.7.0.1:/conf/config.xml
# Serial console → option 11 (Reboot)
Option C — USB drive¶
# Prep: copy config to FAT32 USB drive
# Insert USB into DEC700
# Serial console → option 8 (Shell)
mount -t msdosfs /dev/da0s1 /mnt
cp /mnt/restore-2025-09-20-stable-pre-gap.xml /conf/config.xml
umount /mnt
reboot
Option D — HTTP fetch (if any network path exists)¶
# On a reachable host, serve the config file:
cd gateways/owl/emergency-restore && python3 -m http.server 8080
# Serial console → option 8 (Shell)
fetch http://<server-ip>:8080/restore-2025-09-20-stable-pre-gap.xml \
-o /conf/config.xml
reboot
4. Emergency restore configs (in priority order)¶
| Priority | File | Date | Notes |
|---|---|---|---|
| 1 | restore-2025-09-20-stable-pre-gap.xml |
2025-09-20 | Longest stable period |
| 2 | restore-2025-11-23-stable.xml |
2025-11-23 | 5 weeks stable after |
| 3 | restore-2025-07-20-pre-he-tunnel.xml |
2025-07-20 | Before HE tunnel config |
| 4 | restore-2026-01-02-last-backup.xml |
2026-01-02 | Last automated backup |
| 5 | restore-2025-06-23-earliest-stable.xml |
2025-06-23 | Earliest stable config |
These are in gateways/owl/emergency-restore/. Try in priority order.
S4 — Hardware Failure¶
DEC700 is dead (no POST, power supply failure, disk failure).
1. Obtain replacement hardware¶
- DEC700 or equivalent 3-port Intel NIC (igb) mini-PC
- Install OPNsense from USB installer (download from opnsense.org)
- Perform base installation accepting defaults
2. Restore from Ansible IaC¶
After a fresh OPNsense install with basic network connectivity:
# 1. Confirm SSH access to fresh install
ssh root@<new-ip>
# 2. Create joe user + SSH key via serial console or web UI
# 3. Run full Ansible deployment
cd cloud/ansible && ./scripts/run-opnsense.sh owl --wan
Ansible covers ~78% of customized config. Remaining manual steps:
- DNSBL configuration (OPNsense 26.1 model change, not yet in IaC)
- pdns-dhcp-watcher service setup
- os-git-backup SSH deploy key
- ZeroTier network authorization (via my.zerotier.com)
3. Alternative: restore config.xml directly¶
If time-critical, skip Ansible and restore a full config.xml:
# From the git-backup repo (most recent):
git clone git@github.com:scandora/opnsense-owl.git
cd opnsense-owl
scp config.xml root@<new-ip>:/conf/config.xml
# Reboot the gateway
Or use the curated emergency-restore configs from gateways/owl/emergency-restore/.
4. Post-restore verification¶
# From the gateway:
ping 8.8.8.8 # Internet
drill google.com # DNS
zerotier-cli listnetworks # ZeroTier overlay
# From luna:
ssh joe@46.110.77.34 # WAN SSH
ssh joe@192.168.194.10 # ZeroTier SSH
curl -sk https://192.168.194.10/api/diagnostics/interface/getInterfaceNames \
-u "API_KEY:API_SECRET" # API access
./scripts/backup/pull-config.sh owl # Backup verification
./scripts/backup/check-config-drift.sh owl # Drift baseline
Backup Inventory¶
All available backup sources, from freshest to oldest:
| Source | Location | Frequency |
|---|---|---|
| Pre-run snapshot | ~/.config/scandora/backups/owl/config-pre-run-*.xml |
Every Ansible run |
| pull-config.sh | ~/.config/scandora/backups/owl/config-*.xml |
Daily (cron 03:00), 90-day retention |
| os-git-backup | github.com/scandora/opnsense-owl.git |
Every config change |
| Google Drive | OPNsense built-in | Automatic, 3 retained |
| Milestone configs | gateways/owl/configs/ |
Manual, in git repo |
| Emergency restore | gateways/owl/emergency-restore/ |
Curated, 5 configs |
Listing local backups¶
Comparing backups¶
Recovery Scripts Reference¶
| Script | Purpose | Key flags |
|---|---|---|
scripts/backup/pull-config.sh |
SSH-based config backup | --list, --diff, --wan, --quiet |
scripts/backup/check-config-drift.sh |
Detect manual changes | --quiet, --save, --wan |
cloud/ansible/scripts/run-opnsense.sh |
Full Ansible deployment | --wan, --tags, --check |
gateways/owl/get-production-config.sh |
Pull current production config | — |
gateways/owl/compare-configs.sh |
Diff two config files | — |
DR Drill Checklist¶
Schedule quarterly. Target: full recovery in under 60 minutes.
Preparation¶
- Verify emergency-restore configs exist in
gateways/owl/emergency-restore/ - Verify
pull-config.sh owl --listshows recent backups - Verify
check-config-drift.sh owlreturns exit 0 (no drift) - Verify 1Password contains
OPNsense API - Owlcredentials - Verify
git clone git@github.com:scandora/opnsense-owl.gitsucceeds
Drill Procedure (non-destructive)¶
- Backup test: Run
pull-config.sh owl— confirm config downloaded and valid - Drift test: Run
check-config-drift.sh owl— confirm clean or review drift - Dry-run deploy: Run
run-opnsense.sh owl --check— confirm no errors - WAN access test: Run
ssh joe@46.110.77.34 "echo OK"— confirm WAN SSH works - API fallback test: Run
run-opnsense.sh owl --wan --check— confirm WAN deployment works - Restore verify: Confirm at least 3 backup sources have configs < 24 hours old
- Document results: Record drill date, duration, and any issues found
After Drill¶
- Update "Last tested" date at the top of this page
- File issues for any gaps discovered
- Update emergency-restore configs if current ones are > 3 months old