Deployment Guide¶
Overview¶
Infrastructure changes use:
- Terraform for provisioning (instances, networks, IPs)
- Ansible for configuration (packages, services, files)
Workstation Prerequisites¶
Before deploying, ensure credentials are loaded on luna.
AWS (Automatic)¶
AWS CLI uses credential_process for automatic 1Password integration:
# Credentials load automatically - just run commands
aws ec2 describe-instances
terraform plan # For AWS resources
GCP (Session-Based)¶
GCP requires sourcing credentials once per shell session:
# Load credentials (Touch ID prompt)
source scripts/gcp/gcp-env.sh
# Or use alias (add to ~/.bashrc or ~/.zshrc)
alias gcp-auth='source ~/src/scandora.net/scripts/gcp/gcp-env.sh'
gcp-auth
# Credentials auto-cleanup on shell exit
See Secrets Management for details.
Linux Host Deployment¶
Linux cloud hosts (pluto, dumbo, bogart, rocky, and their -dev mirrors) use
per-host run scripts in cloud/ansible/scripts/. These handle 1Password credential
loading, state detection, and the correct Ansible invocation automatically.
Run Scripts¶
cd /Users/joe/src/scandora.net/cloud/ansible
# Full deployment (detect state automatically)
./scripts/run-pluto.sh --prod
./scripts/run-dumbo.sh --prod
./scripts/run-bogart.sh --prod
./scripts/run-rocky.sh --prod
# Dev mirrors (no --prod flag needed — dev is the default)
./scripts/run-pluto.sh
./scripts/run-dumbo.sh
./scripts/run-bogart.sh
# With Ansible tags
./scripts/run-pluto.sh --prod --tags base,zerotier
# Factory-reset (State C): first boot with root password, ansible user not yet created
./scripts/run-rocky.sh --password 'root-password-here' --prod
State Detection¶
Run scripts detect the host's current state and adapt automatically:
| State | Condition | Action |
|---|---|---|
| A | ZeroTier reachable, ansible user exists |
Full deploy via ZeroTier IP |
| B | Public IP reachable, ansible user exists |
Full deploy via public IP |
| C | Only root access (factory reset) | Bootstrap (creates ansible user) → full deploy |
Shared Library¶
All run scripts source cloud/ansible/scripts/lib/run-script-common.sh for:
load_op_credential— retrieves secrets from 1Password via dynamic variable assignmentrun_script_init— sets up colors, cleanup trap, prerequisite checksload_prd_token— switches toscandora-prd-automationservice account
First-Boot Behavior (Base Role)¶
On first boot, Ubuntu starts unattended-upgrades immediately. The base role cooperates:
lock_timeout: 300waits up to 5 min for the dpkg lock (no killing/racing)NEEDRESTART_SUSPEND=1skips needrestart's process scan (saves 30–120s per apt call)dpkg_options: "force-confdef,force-confold"auto-resolves config file conflicts- unattended-upgrades is stopped and disabled after our upgrade completes
Total first-boot time including package upgrade: ~8–12 minutes on a bare metal host.
Ansible Deployment¶
Full Site Deployment¶
cd /Users/joe/src/scandora.net/cloud/ansible
# Deploy everything to a host
ansible-playbook -i inventory/production.yml playbooks/site.yml --limit pluto
# Deploy to multiple hosts
ansible-playbook -i inventory/production.yml playbooks/site.yml --limit "pluto,dumbo"
Specific Role¶
# Base configuration only
ansible-playbook -i inventory/production.yml playbooks/base.yml --limit pluto
# With tags
ansible-playbook -i inventory/production.yml playbooks/site.yml \
--limit pluto --tags cloudflared
With Secrets¶
# Get token from 1Password
TOKEN=$(op item get "Cloudflare Tunnel Token - pluto" --fields credential --reveal)
# Pass as extra var
ansible-playbook -i inventory/production.yml playbooks/site.yml \
--limit pluto --tags cloudflared \
-e cloudflared_tunnel_token="$TOKEN"
Dry Run¶
# Check what would change
ansible-playbook -i inventory/production.yml playbooks/site.yml \
--limit pluto --check --diff
Terraform Deployment¶
Plan Changes¶
cd /Users/joe/src/scandora.net/cloud/terraform/environments/production/aws/pluto
# Always use -target for instance changes
terraform plan -target=aws_instance.pluto
Apply Changes¶
Never run without -target
Running terraform apply without -target could accidentally affect IP associations or other critical resources.
Static IPs¶
Static IPs are in separate directories and should rarely be touched:
# AWS
cd cloud/terraform/environments/production/aws/static-ips/
# GCE
cd cloud/terraform/environments/production/gce/static-ips/
Gateway Deployment¶
OPNsense gateways are managed via Ansible (REST API) with config.xml as fallback.
Ansible IaC (Preferred)¶
The opnsense Ansible role manages gateway configuration via the OPNsense REST API:
cd /Users/joe/src/scandora.net/cloud/ansible
# Full deployment (all subsystems)
./scripts/run-opnsense.sh owl --prod
# Specific subsystems
./scripts/run-opnsense.sh owl --prod --tags dhcp,dns
# Dry run (check mode)
./scripts/run-opnsense.sh owl --prod --check
Available tags (API play): system, interfaces, packages, firewall, dhcp, dns, zerotier, ipv6-tunnel, ids, syslog, monitoring, gateways, users, monit
Available tags (SSH play): system-identity, sysctl, ssh-hardening, sudo, webgui, gdrive-cleanup, fw-cleanup, git-backup, ids-cron-cleanup
The run-opnsense.sh script retrieves API credentials from 1Password, validates API connectivity, and takes a pre-run config.xml snapshot before launching Ansible.
Dnsmasq DHCP (Standalone Playbook)¶
The dnsmasq-dhcp role manages DHCP ranges and static reservations via idempotent API calls:
cd /Users/joe/src/scandora.net/cloud/ansible
# Deploy DHCP configuration
ansible-playbook -i inventory/owl.yml playbooks/dnsmasq-dhcp.yml \
-e "opn_api_key=$(op item get 'OPNsense API - Owl' --vault Private --fields 'api key' --reveal)" \
-e "opn_api_secret=$(op item get 'OPNsense API - Owl' --vault Private --fields 'api secret' --reveal)" \
-e "opn_firewall=192.168.194.10"
Idempotency: The role checks for existing entries before creating, preventing duplicates on re-runs.
Deployment record: See gateways/owl/DNSMASQ-DEPLOYMENT-RECORD.md for rollback procedures.
Note: This playbook is separate from the main opnsense.yml playbook and must be run independently.
WAN Fallback¶
If ZeroTier is down, use the --wan flag to deploy via the WAN IP:
This overrides both ansible_host (API) and opn_ssh_host (SSH play) with the WAN IP.
Config Backup & Drift Detection¶
Independent of Ansible, backup and drift tools run via SSH:
# Pull config.xml backup (daily cron at 03:00, 90-day retention)
./scripts/backup/pull-config.sh owl
# List existing backups
./scripts/backup/pull-config.sh owl --list
# Check for manual changes made outside Ansible
./scripts/backup/check-config-drift.sh owl
Dev VM Testing¶
Test changes against a 4-NIC GCE/KVM dev VM before production:
# One command: provision → wait → tunnels → validate (~3-5 min)
./scripts/opnsense-dev/dev-up.sh
# Test Ansible changes (creds read directly from GCE host, no 1Password needed)
cd cloud/ansible
./scripts/run-opnsense.sh opnsense-dev --tags dhcp,dns,ids
# Validate idempotency (rerun — expect changed=12, all oxlorg.raw false-positives)
./scripts/run-opnsense.sh opnsense-dev
# One command: kill tunnels + terraform destroy
./scripts/opnsense-dev/dev-down.sh
What dev-up.sh automates:
terraform apply— provisions GCE VM from golden image (~90s)- Waits for SSH (IAP) and OPNsense boot
- Waits for API key auto-creation (Terraform startup script runs
virsh-create-apikey.py) - Opens SSH tunnels:
localhost:8443→ HTTPS,localhost:2222→ SSH - Validates API and SSH connectivity through tunnels
- Prints summary with next-step commands
No 1Password dependency for dev — API credentials are auto-created on the GCE host at boot and read directly via gcloud ssh.
See gateways/owl/docs/DEV-WORKFLOW.md for architecture details.
Manual Changes (Fallback)¶
For quick changes via SSH:
# SSH to gateway
ssh joe@192.168.194.10
# Edit configuration
sudo vi /conf/config.xml
# Restart affected service
sudo configctl unbound restart # DNS
sudo configctl zerotier restart # VPN
Export Config¶
# Pull current config
ssh joe@192.168.194.10 "cat /conf/config.xml" > owl-config-$(date +%Y%m%d).xml
Dynamic DNS¶
DDNS updates run automatically via cron. Manual trigger:
# On any host with cf-ddns.sh
sudo /usr/local/bin/cf-ddns.sh
# View logs
sudo journalctl -u cron -f | grep cf-ddns
Verification¶
After Ansible¶
# Verify services running
ssh joe@pluto "sudo systemctl status zerotier-one fail2ban"
# Verify configuration
ssh joe@pluto "cat /etc/ssh/sshd_config | grep -E 'PasswordAuth|PermitRoot'"
After Terraform¶
# Verify instance running
aws ec2 describe-instances --instance-ids i-xxx --query 'Reservations[].Instances[].State'
# Verify IP association
aws ec2 describe-addresses --allocation-ids eipalloc-xxx
After Gateway Changes¶
# Test DNS resolution
dig @10.7.0.1 owl.scandora.net
# Test ZeroTier
zerotier-cli listnetworks
# Test cross-site connectivity
ping 192.168.194.x
Rollback¶
Ansible¶
Re-run previous version:
# Check out previous commit
git checkout HEAD~1 cloud/ansible/
# Run playbook
ansible-playbook -i inventory/production.yml playbooks/site.yml --limit pluto
# Return to current
git checkout HEAD cloud/ansible/
Terraform¶
# Destroy and recreate instance
terraform destroy -target=aws_instance.pluto
terraform apply -target=aws_instance.pluto
Gateway¶
Restore from backup:
# Option 1: Restore from local backup (most recent pull-config.sh snapshot)
./scripts/backup/pull-config.sh owl --list
scp ~/.config/scandora/backups/owl/config-YYYYMMDD-HHMMSS.xml \
joe@192.168.194.10:/tmp/config.xml
ssh joe@192.168.194.10 "sudo cp /tmp/config.xml /conf/config.xml && sudo reboot"
# Option 2: Restore from git-backup repo
git clone git@github.com:scandora/opnsense-owl.git
cd opnsense-owl && git log --oneline
scp config.xml joe@192.168.194.10:/tmp/config.xml
ssh joe@192.168.194.10 "sudo cp /tmp/config.xml /conf/config.xml && sudo reboot"
# Option 3: Curated emergency configs (if nothing else works)
# See gateways/owl/emergency-restore/README.md
Fast DHCP Rollback (No Reboot)¶
For DHCP-only changes (Dnsmasq), use the pre-deployment backup:
# Find the backup (timestamped pre-dnsmasq)
BACKUP="config.xml.backup-YYYYMMDD-HHMMSS-pre-dnsmasq"
# Restore without reboot (2 minutes)
scp ~/backups/owl/$BACKUP joe@192.168.194.10:/tmp/restore.xml
ssh joe@192.168.194.10 << 'REMOTE'
sudo cp /tmp/restore.xml /conf/config.xml
sudo pluginctl -s dnsmasq restart
sudo pluginctl -s unbound restart
echo "✅ DHCP config restored"
REMOTE
For complete disaster recovery procedures, see the DR Runbook.
Reboot Testing¶
Always test configuration changes with a reboot:
Verify services come back up: