Skip to content

Deployment Guide

Overview

Infrastructure changes use:

  • Terraform for provisioning (instances, networks, IPs)
  • Ansible for configuration (packages, services, files)

Workstation Prerequisites

Before deploying, ensure credentials are loaded on luna.

AWS (Automatic)

AWS CLI uses credential_process for automatic 1Password integration:

# Credentials load automatically - just run commands
aws ec2 describe-instances
terraform plan  # For AWS resources

GCP (Session-Based)

GCP requires sourcing credentials once per shell session:

# Load credentials (Touch ID prompt)
source scripts/gcp/gcp-env.sh

# Or use alias (add to ~/.bashrc or ~/.zshrc)
alias gcp-auth='source ~/src/scandora.net/scripts/gcp/gcp-env.sh'
gcp-auth

# Credentials auto-cleanup on shell exit

See Secrets Management for details.

Linux Host Deployment

Linux cloud hosts (pluto, dumbo, bogart, rocky, and their -dev mirrors) use per-host run scripts in cloud/ansible/scripts/. These handle 1Password credential loading, state detection, and the correct Ansible invocation automatically.

Run Scripts

cd /Users/joe/src/scandora.net/cloud/ansible

# Full deployment (detect state automatically)
./scripts/run-pluto.sh --prod
./scripts/run-dumbo.sh --prod
./scripts/run-bogart.sh --prod
./scripts/run-rocky.sh --prod

# Dev mirrors (no --prod flag needed — dev is the default)
./scripts/run-pluto.sh
./scripts/run-dumbo.sh
./scripts/run-bogart.sh

# With Ansible tags
./scripts/run-pluto.sh --prod --tags base,zerotier

# Factory-reset (State C): first boot with root password, ansible user not yet created
./scripts/run-rocky.sh --password 'root-password-here' --prod

State Detection

Run scripts detect the host's current state and adapt automatically:

State Condition Action
A ZeroTier reachable, ansible user exists Full deploy via ZeroTier IP
B Public IP reachable, ansible user exists Full deploy via public IP
C Only root access (factory reset) Bootstrap (creates ansible user) → full deploy

Shared Library

All run scripts source cloud/ansible/scripts/lib/run-script-common.sh for:

  • load_op_credential — retrieves secrets from 1Password via dynamic variable assignment
  • run_script_init — sets up colors, cleanup trap, prerequisite checks
  • load_prd_token — switches to scandora-prd-automation service account

First-Boot Behavior (Base Role)

On first boot, Ubuntu starts unattended-upgrades immediately. The base role cooperates:

  1. lock_timeout: 300 waits up to 5 min for the dpkg lock (no killing/racing)
  2. NEEDRESTART_SUSPEND=1 skips needrestart's process scan (saves 30–120s per apt call)
  3. dpkg_options: "force-confdef,force-confold" auto-resolves config file conflicts
  4. unattended-upgrades is stopped and disabled after our upgrade completes

Total first-boot time including package upgrade: ~8–12 minutes on a bare metal host.

Ansible Deployment

Full Site Deployment

cd /Users/joe/src/scandora.net/cloud/ansible

# Deploy everything to a host
ansible-playbook -i inventory/production.yml playbooks/site.yml --limit pluto

# Deploy to multiple hosts
ansible-playbook -i inventory/production.yml playbooks/site.yml --limit "pluto,dumbo"

Specific Role

# Base configuration only
ansible-playbook -i inventory/production.yml playbooks/base.yml --limit pluto

# With tags
ansible-playbook -i inventory/production.yml playbooks/site.yml \
  --limit pluto --tags cloudflared

With Secrets

# Get token from 1Password
TOKEN=$(op item get "Cloudflare Tunnel Token - pluto" --fields credential --reveal)

# Pass as extra var
ansible-playbook -i inventory/production.yml playbooks/site.yml \
  --limit pluto --tags cloudflared \
  -e cloudflared_tunnel_token="$TOKEN"

Dry Run

# Check what would change
ansible-playbook -i inventory/production.yml playbooks/site.yml \
  --limit pluto --check --diff

Terraform Deployment

Plan Changes

cd /Users/joe/src/scandora.net/cloud/terraform/environments/production/aws/pluto

# Always use -target for instance changes
terraform plan -target=aws_instance.pluto

Apply Changes

terraform apply -target=aws_instance.pluto

Never run without -target

Running terraform apply without -target could accidentally affect IP associations or other critical resources.

Static IPs

Static IPs are in separate directories and should rarely be touched:

# AWS
cd cloud/terraform/environments/production/aws/static-ips/

# GCE
cd cloud/terraform/environments/production/gce/static-ips/

Gateway Deployment

OPNsense gateways are managed via Ansible (REST API) with config.xml as fallback.

Ansible IaC (Preferred)

The opnsense Ansible role manages gateway configuration via the OPNsense REST API:

cd /Users/joe/src/scandora.net/cloud/ansible

# Full deployment (all subsystems)
./scripts/run-opnsense.sh owl --prod

# Specific subsystems
./scripts/run-opnsense.sh owl --prod --tags dhcp,dns

# Dry run (check mode)
./scripts/run-opnsense.sh owl --prod --check

Available tags (API play): system, interfaces, packages, firewall, dhcp, dns, zerotier, ipv6-tunnel, ids, syslog, monitoring, gateways, users, monit

Available tags (SSH play): system-identity, sysctl, ssh-hardening, sudo, webgui, gdrive-cleanup, fw-cleanup, git-backup, ids-cron-cleanup

The run-opnsense.sh script retrieves API credentials from 1Password, validates API connectivity, and takes a pre-run config.xml snapshot before launching Ansible.

Dnsmasq DHCP (Standalone Playbook)

The dnsmasq-dhcp role manages DHCP ranges and static reservations via idempotent API calls:

cd /Users/joe/src/scandora.net/cloud/ansible

# Deploy DHCP configuration
ansible-playbook -i inventory/owl.yml playbooks/dnsmasq-dhcp.yml \
  -e "opn_api_key=$(op item get 'OPNsense API - Owl' --vault Private --fields 'api key' --reveal)" \
  -e "opn_api_secret=$(op item get 'OPNsense API - Owl' --vault Private --fields 'api secret' --reveal)" \
  -e "opn_firewall=192.168.194.10"

Idempotency: The role checks for existing entries before creating, preventing duplicates on re-runs.

Deployment record: See gateways/owl/DNSMASQ-DEPLOYMENT-RECORD.md for rollback procedures.

Note: This playbook is separate from the main opnsense.yml playbook and must be run independently.

WAN Fallback

If ZeroTier is down, use the --wan flag to deploy via the WAN IP:

./scripts/run-opnsense.sh owl --prod --wan

This overrides both ansible_host (API) and opn_ssh_host (SSH play) with the WAN IP.

Config Backup & Drift Detection

Independent of Ansible, backup and drift tools run via SSH:

# Pull config.xml backup (daily cron at 03:00, 90-day retention)
./scripts/backup/pull-config.sh owl

# List existing backups
./scripts/backup/pull-config.sh owl --list

# Check for manual changes made outside Ansible
./scripts/backup/check-config-drift.sh owl

Dev VM Testing

Test changes against a 4-NIC GCE/KVM dev VM before production:

# One command: provision → wait → tunnels → validate (~3-5 min)
./scripts/opnsense-dev/dev-up.sh

# Test Ansible changes (creds read directly from GCE host, no 1Password needed)
cd cloud/ansible
./scripts/run-opnsense.sh opnsense-dev --tags dhcp,dns,ids

# Validate idempotency (rerun — expect changed=12, all oxlorg.raw false-positives)
./scripts/run-opnsense.sh opnsense-dev

# One command: kill tunnels + terraform destroy
./scripts/opnsense-dev/dev-down.sh

What dev-up.sh automates:

  1. terraform apply — provisions GCE VM from golden image (~90s)
  2. Waits for SSH (IAP) and OPNsense boot
  3. Waits for API key auto-creation (Terraform startup script runs virsh-create-apikey.py)
  4. Opens SSH tunnels: localhost:8443 → HTTPS, localhost:2222 → SSH
  5. Validates API and SSH connectivity through tunnels
  6. Prints summary with next-step commands

No 1Password dependency for dev — API credentials are auto-created on the GCE host at boot and read directly via gcloud ssh.

See gateways/owl/docs/DEV-WORKFLOW.md for architecture details.

Manual Changes (Fallback)

For quick changes via SSH:

# SSH to gateway
ssh joe@192.168.194.10

# Edit configuration
sudo vi /conf/config.xml

# Restart affected service
sudo configctl unbound restart    # DNS
sudo configctl zerotier restart   # VPN

Export Config

# Pull current config
ssh joe@192.168.194.10 "cat /conf/config.xml" > owl-config-$(date +%Y%m%d).xml

Dynamic DNS

DDNS updates run automatically via cron. Manual trigger:

# On any host with cf-ddns.sh
sudo /usr/local/bin/cf-ddns.sh

# View logs
sudo journalctl -u cron -f | grep cf-ddns

Verification

After Ansible

# Verify services running
ssh joe@pluto "sudo systemctl status zerotier-one fail2ban"

# Verify configuration
ssh joe@pluto "cat /etc/ssh/sshd_config | grep -E 'PasswordAuth|PermitRoot'"

After Terraform

# Verify instance running
aws ec2 describe-instances --instance-ids i-xxx --query 'Reservations[].Instances[].State'

# Verify IP association
aws ec2 describe-addresses --allocation-ids eipalloc-xxx

After Gateway Changes

# Test DNS resolution
dig @10.7.0.1 owl.scandora.net

# Test ZeroTier
zerotier-cli listnetworks

# Test cross-site connectivity
ping 192.168.194.x

Rollback

Ansible

Re-run previous version:

# Check out previous commit
git checkout HEAD~1 cloud/ansible/

# Run playbook
ansible-playbook -i inventory/production.yml playbooks/site.yml --limit pluto

# Return to current
git checkout HEAD cloud/ansible/

Terraform

# Destroy and recreate instance
terraform destroy -target=aws_instance.pluto
terraform apply -target=aws_instance.pluto

Gateway

Restore from backup:

# Option 1: Restore from local backup (most recent pull-config.sh snapshot)
./scripts/backup/pull-config.sh owl --list
scp ~/.config/scandora/backups/owl/config-YYYYMMDD-HHMMSS.xml \
  joe@192.168.194.10:/tmp/config.xml
ssh joe@192.168.194.10 "sudo cp /tmp/config.xml /conf/config.xml && sudo reboot"

# Option 2: Restore from git-backup repo
git clone git@github.com:scandora/opnsense-owl.git
cd opnsense-owl && git log --oneline
scp config.xml joe@192.168.194.10:/tmp/config.xml
ssh joe@192.168.194.10 "sudo cp /tmp/config.xml /conf/config.xml && sudo reboot"

# Option 3: Curated emergency configs (if nothing else works)
# See gateways/owl/emergency-restore/README.md

Fast DHCP Rollback (No Reboot)

For DHCP-only changes (Dnsmasq), use the pre-deployment backup:

# Find the backup (timestamped pre-dnsmasq)
BACKUP="config.xml.backup-YYYYMMDD-HHMMSS-pre-dnsmasq"

# Restore without reboot (2 minutes)
scp ~/backups/owl/$BACKUP joe@192.168.194.10:/tmp/restore.xml
ssh joe@192.168.194.10 << 'REMOTE'
sudo cp /tmp/restore.xml /conf/config.xml
sudo pluginctl -s dnsmasq restart
sudo pluginctl -s unbound restart
echo "✅ DHCP config restored"
REMOTE

For complete disaster recovery procedures, see the DR Runbook.

Reboot Testing

Always test configuration changes with a reboot:

# Cloud instances
sudo reboot

# Gateway
sudo reboot
# or via OPNsense web UI

Verify services come back up:

# After reboot
ssh joe@pluto
sudo systemctl status zerotier-one fail2ban