Rocky Deployment Runbook¶
Rocky is the Meanservers bare metal VPS (Chicago data center). It runs as a WireGuard VPN exit node for the Denver region, plus standard monitoring and DNS infrastructure.
Validated: 2026-03-01 — Full factory-reset → fully-configured deployment confirmed.
Quick Reference¶
| Property | Value |
|---|---|
| Public IP | 193.8.172.100 |
| ZeroTier IP | 192.168.194.103 |
| SSH user | ansible (automation) / joe (human) |
| Roles | base, dotfiles, zerotier, internal-dns, wireguard, ddns, node-exporter, blackbox-exporter |
| Docker | No (lightweight server) |
| Cloud SQL | No |
| Monitoring stack | No (scrape target only; stack lives on dumbo) |
| WireGuard | Yes — exit node for 10.99.0.0/24 (Denver) |
Script: run-rocky.sh¶
The deployment entry point is:
State Machine¶
The script auto-detects the server's current state and drives it to a fully configured end state with no manual intervention. It probes connection paths in order:
| State | Condition | Action |
|---|---|---|
| A — Subsequent run | ZeroTier reachable at 192.168.194.103 | Full deployment via ZeroTier |
| B — First full deploy | ansible SSH works on public IP; ZeroTier not yet installed |
Full deployment via public IP |
| C — Fresh OS | No ansible SSH access anywhere |
Bootstrap (create user + key), then full deploy |
State C requires --password 'ROOT_PASSWORD'. States A and B need no extra flags.
The probe timeout for detecting State C is ~60 seconds — expected behavior while SSH connection attempts fail over from ZeroTier to public IP to "give up".
Example Commands¶
# Fresh OS (State C) — factory reset or new VPS
./cloud/ansible/scripts/run-rocky.sh --prod --password 'VPS_ROOT_PASSWORD'
# Subsequent run (State A/B) — idempotent config refresh
./cloud/ansible/scripts/run-rocky.sh --prod
# Single role re-run
./cloud/ansible/scripts/run-rocky.sh --prod --tags blackbox-exporter
# Dry run
./cloud/ansible/scripts/run-rocky.sh --check
Full Factory-Reset Procedure¶
This is the validated end-to-end sequence for deploying rocky from a freshly reinstalled OS.
Prerequisites¶
opCLI authenticated (1Password desktop app running, Touch ID available)- SSH known_hosts is not pre-populated for rocky's IPs (or will be auto-cleaned by script)
- Root password for the fresh OS install
Step 1: Verify Connectivity¶
# Check public IP is reachable
ping -c 3 193.8.172.100
# Check SSH port is open
nc -zv 193.8.172.100 22
Step 2: Run the Deployment¶
The script will:
- Inject secrets from 1Password (
.vars.rocky.yml) — Touch ID may fire here - Detect State C (no
ansibleSSH) — auto-clean known_hosts for 193.8.172.100 and 192.168.194.103 - Bootstrap: run
migrate-to-ansible-user.ymlas root with password auth - Full deployment: run
site.ymlasansibleuser
Expected duration: ~5–10 minutes for a fresh deploy.
Step 3: Handle Transient Failures¶
If a role fails due to a transient network issue (e.g., GitHub download timeout), re-run just that tag:
The blackbox-exporter role downloads from GitHub releases. Transient connection resets from GitHub are common — simply re-run.
Step 4: Verify Deployment¶
After a successful run, the script prints verification commands. Run them:
# 1. Node Exporter metrics (use ZeroTier IP after State A)
curl -s http://192.168.194.103:9100/metrics | head
# 2. WireGuard VPN
ssh rocky 'sudo wg show'
# 3. Internal DNS
ssh rocky 'dig bogart.scandora.net'
# 4. ZeroTier network membership
ssh rocky 'sudo zerotier-cli listnetworks'
Expected ZeroTier output: OK PRIVATE with IP 192.168.194.103.
Secrets¶
All credentials are injected at deploy time via op inject from:
| Variable | 1Password Item | Vault |
|---|---|---|
zerotier_api_token |
zerotier_api_token_network_management |
scandora-automation |
cf_api_key |
cloudflare_api_token_dns_automation |
scandora-automation |
pdns_api_key |
powerdns_api_key_bogart_production |
scandora-prd-automation |
ansible_authorized_key |
ssh_key_ansible (public_key field) |
scandora-automation |
The pdns_api_key is in scandora-prd-automation — Touch ID required (production vault).
Known Issues and Gotchas¶
op:// in YAML Comments Breaks op inject¶
Bug discovered: 2026-03-01 during rocky production deployment.
op inject scans the entire file for op:// patterns — including YAML comments.
Any comment containing op:// will be parsed as a secret reference and fail.
# BROKEN — op inject tries to parse this comment:
# DO NOT put actual secret values here — op:// references only.
# CORRECT:
# DO NOT put actual secret values here — secret references only.
All .vars.*.yml files were updated to remove op:// from their comments (commit e177e80).
op inject --force Required When Output File Pre-Exists¶
Bug discovered: 2026-03-01 during rocky production deployment.
mktemp creates an empty file. When op inject then tries to write to it, it asks
interactively: "Overwrite existing file?" There is no TTY in a background subprocess,
so the prompt hangs or fails with:
Fix: Always use op inject --force when writing to a pre-created temp file.
All six run-*.sh scripts now use --force (commit e177e80).
Stale known_hosts After Server Reset¶
When rocky is reinstalled, its SSH host key changes. The client refuses to connect with "REMOTE HOST IDENTIFICATION HAS CHANGED". The script auto-clears known_hosts for both public IP and ZeroTier IP on State C detection:
This is handled automatically — no manual intervention needed.
ZeroTier Static IP Assignment¶
Rocky's ZeroTier node ID changes after a factory reset. The ZeroTier network assigns
IP by node ID — a new node ID gets a new IP unless ipAssignments is configured in
ZeroTier Central for the specific MAC address.
The zerotier Ansible role auto-authorizes the node using the API token. After
authorization, the IP assignment 192.168.194.103 is enforced via ZeroTier's
static IP assignments in the network config (configured in ZeroTier Central,
not in this IaC repo).
If the IP fails to be 192.168.194.103, check ZeroTier Central → Members → rocky,
and verify the IP assignment is set to 192.168.194.103.
Roles Deployed to Rocky¶
From cloud/ansible/playbooks/site.yml (with rocky's inventory vars):
| Role | Tag | Purpose | Condition |
|---|---|---|---|
base |
base |
Packages, users, SSH hardening, fail2ban | Always |
dotfiles |
dotfiles |
Shell config, aliases | Always |
docker |
docker |
Docker CE | Skipped (docker_enabled: false) |
zerotier |
zerotier |
Overlay network, auto-authorize | Always |
internal-dns |
internal-dns |
Routes *.scandora.net to bogart |
Always |
wireguard |
wireguard |
VPN server — Denver exit node, 10.99.0.0/24 | wireguard_enabled: true |
ddns |
ddns |
rocky.scandora.net → public IP via Cloudflare |
cf_api_key is defined |
node-exporter |
node-exporter |
Prometheus metrics, binds to ZeroTier IP | zerotier_ip is defined |
blackbox-exporter |
blackbox-exporter |
ICMP/TCP probing mesh | zerotier_ip is defined |
Skipped roles (not applicable to rocky): home-disk, docker, cloudsql-client,
powerdns, monitoring-stack, iac-tools, github-runner.
WireGuard Configuration¶
Rocky runs the WireGuard server for the Denver exit node:
| Property | Value |
|---|---|
| Server address | 10.99.0.1/24 |
| Listen port | 51820 |
| NAT interface | eth0 |
| Luna peer | 10.99.0.2/32 — key z2IPPElnNhvnBXH0GzYZrtPAUMWv+78xVtidfxYIEXs= |
| Rocky WireGuard public key | Y2G2hNJNj0XckPdVuxDMRYoEZpV97wAMcnGRAPE710w= |
The server's private key is generated by the wireguard Ansible role and stored in
/etc/wireguard/wg0.conf on rocky (not in this repo — regenerated on fresh deploy).
After a factory reset, the server's key pair is regenerated. Follow the Post-Reset: 1Password Key Sync procedure below to update 1Password and luna's client config before testing the tunnel.
Prometheus Monitoring¶
Rocky is a scrape target on dumbo's Prometheus stack:
| Target | URL |
|---|---|
| Node Exporter | http://192.168.194.103:9100/metrics |
| Blackbox Exporter | http://192.168.194.103:9115/metrics |
After a factory reset, re-run the monitoring stack deployment to restore scraping:
Post-Reset Checklist¶
After a full factory reset deployment, verify:
- ZeroTier:
ssh rocky 'sudo zerotier-cli listnetworks'→OK PRIVATE 192.168.194.103 - SSH via ZeroTier:
ssh ansible@192.168.194.103 echo ok - Node Exporter:
curl -s http://192.168.194.103:9100/metrics | grep node_uname_info - Blackbox Exporter:
curl -s http://192.168.194.103:9115/metrics | grep blackbox_exporter_build_info - WireGuard:
ssh rocky 'sudo wg show'→ shows peers and latest handshake - Internal DNS:
ssh rocky 'dig bogart.scandora.net +short'→192.168.194.133 - DDNS:
dig rocky.scandora.net +short→193.8.172.100 - Prometheus scraping: Check dumbo Grafana targets page for rocky
- WireGuard keys synced to 1Password (see Post-Reset: 1Password Key Sync below)
- Luna tunnel:
wg-quick down rocky-denver && wg-quick up rocky-denver→ no errors - Tunnel handshake:
sudo wg showon luna shows a recent handshake with rocky peer
Post-Reset: 1Password Key Sync¶
After every factory reset, the wireguard Ansible role generates a fresh keypair.
Run these commands to sync the new keys into 1Password and update luna's client config.
Step 1: Retrieve the new server keypair¶
# Get the new private key
NEW_PRIVKEY=$(ssh rocky 'sudo cat /etc/wireguard/wg0.conf' | grep PrivateKey | awk '{print $3}')
# Get the new public key (derived from the private key by WireGuard)
NEW_PUBKEY=$(ssh rocky 'sudo wg show wg0 public-key')
echo "New public key: $NEW_PUBKEY"
Step 2: Update 1Password¶
# Update the server item (Touch ID not required — scandora-automation vault)
op item edit wireguard_rocky_denver_server \
--vault scandora-automation \
"private_key=$NEW_PRIVKEY" \
"public_key=$NEW_PUBKEY"
# Update the client item's cross-reference
op item edit wireguard_luna_denver_client \
--vault scandora-automation \
"server_public_key=$NEW_PUBKEY"
Step 3: Update luna's client config¶
# Edit the WireGuard client config on luna
# Replace the [Peer] PublicKey line with the new value
sed -i '' "s|^PublicKey = .*|PublicKey = $NEW_PUBKEY|" ~/.wireguard/rocky-denver.conf
# Verify
grep PublicKey ~/.wireguard/rocky-denver.conf
Step 4: Restart the tunnel and verify¶
Expected: a handshake with the rocky peer within ~30 seconds.
Related Files¶
| File | Purpose |
|---|---|
cloud/ansible/scripts/run-rocky.sh |
Deployment entry point |
cloud/ansible/inventory/rocky.yml |
Host variables |
cloud/ansible/playbooks/site.yml |
Site playbook (shared, all hosts) |
cloud/ansible/playbooks/migrate-to-ansible-user.yml |
Bootstrap playbook |
scripts/env-files/.vars.rocky.yml |
Secret references (1Password) |
docs/operations/REBOOT-CHECKLIST.md |
Post-reboot validation steps |