Monitoring Stack¶
The scandora.net infrastructure is monitored using a Prometheus-based stack running on Dumbo.
Architecture¶
┌─────────────────────────────────────────────────────┐
│ Dumbo (GCE) │
│ 192.168.194.131 │
│ ┌─────────────────────────────────────────────────┐│
│ │ Docker Compose Stack ││
│ │ ┌───────────┐ ┌─────────────┐ ┌─────────────┐ ││
│ │ │Prometheus │ │AlertManager │ │SNMP Exporter│ ││
│ │ │ :9090 │ │ :9093 │ │ :9116 │ ││
│ │ └─────┬─────┘ └──────┬──────┘ └──────┬──────┘ ││
│ │ │ │ │ ││
│ │ ┌─────▼──────────────▼───────────────▼──────┐ ││
│ │ │ Grafana :3000 │ ││
│ │ └──────────────────┬────────────────────────┘ ││
│ └────────────────────┬┼────────────────────────┘ │
│ ││ │
│ Cloud SQL Proxy :5432 │
│ │└─► scandora-postgres │
└───────────────────────┼───────────────────────────┘
│
┌────────────ZeroTier (192.168.194.0/24)────────────┐
│ │ │
┌─────┴─────┐ ┌──────────┐ ┌──┴───────┐ ┌──────────┐ ┌┴─────────┐
│ Pluto │ │ Bogart │ │ Rocky │ │ Blue │ │ Owl │
│ .6 │ │ .133 │ │ .132 │ │ .205 │ │ .10 │
│node_exp │ │node_exp │ │node_exp │ │ SNMP │ │ SNMP │
│ :9100 │ │ :9100 │ │ :9100 │ │ :161 │ │ :161 │
└───────────┘ └──────────┘ └──────────┘ └────┬─────┘ └─────┬────┘
AWS GCE Meanservers OPNsense OPNsense
│ │
Blue LAN (10.15.x.x) │
│ Owl LAN (10.7.x.x)
┌──────┴──────┐ │
│ Rpios │┌──────┴──────┐
│ 10.15.1.50 ││ Triton │
│ node_exp ││ 10.7.1.20 │
│ :9100 ││ node_exp │
└─────────────┘│ :9100 │
Raspberry Pi └─────────────┘
Raspberry Pi
Components¶
| Component | Purpose | Port | URL |
|---|---|---|---|
| Prometheus | Metrics collection & storage | 9090 | http://192.168.194.131:9090 |
| Grafana | Dashboards & visualization | 3000 | http://192.168.194.131:3000 |
| AlertManager | Alert routing & notifications | 9093 | http://192.168.194.131:9093 |
| SNMP Exporter | OPNsense interface stats (legacy) | 9116 | http://192.168.194.131:9116 |
| OPNsense Exporter | Rich OPNsense metrics via API | 8080 | Per-gateway containers |
| ZeroTier Exporter | ZeroTier network member status | 9811 | http://192.168.194.131:9811 |
| node_exporter | Linux host metrics | 9100 | On each host's ZeroTier IP |
ZeroTier Metrics (Dual Approach)¶
ZeroTier is monitored via two complementary methods:
| Method | What it shows | How it works |
|---|---|---|
| Central API exporter (port 9811) | Network-wide membership status (who's online/offline) | Polls ZeroTier Central API from Dumbo |
| Built-in agent metrics (via node_exporter) | Per-node connectivity health (packet counts, peer latency, path status) | ZeroTier writes metrics.prom; node_exporter textfile collector picks it up |
Built-in agent metrics are enabled on all IaC-managed Linux hosts (dumbo, pluto, bogart, rocky). They are not available on OPNsense gateways (different OS/plugin) or LAN-only hosts without ZeroTier.
How it works:
- ZeroTier's
local.confhasenableMetrics: true, which causes it to write/var/lib/zerotier-one/metrics.prom - A symlink connects this to node_exporter's textfile collector directory:
/var/lib/node_exporter/textfile_collector/zerotier.prom - node_exporter serves the
zt_*metrics alongside standardnode_*metrics on:9100 - No new Prometheus scrape jobs needed — metrics appear in the existing
nodejob
Key zt_* metrics:
| Metric | Type | Description |
|---|---|---|
zt_packet_incoming_count |
Counter | Packets received by the ZeroTier engine |
zt_packet_outgoing_count |
Counter | Packets sent by the ZeroTier engine |
zt_packet_error_count |
Counter | Packet errors |
zt_peer_latency_count |
Counter | Peer latency measurement count |
zt_peer_latency_sum |
Counter | Peer latency sum (for computing averages) |
zt_peer_path_count |
Gauge | Number of active paths to each peer |
Querying agent metrics in Prometheus:
# Average peer latency over 5 minutes
rate(zt_peer_latency_sum[5m]) / rate(zt_peer_latency_count[5m])
# Packet error rate
rate(zt_packet_error_count[5m])
Application Exporters¶
| Service | Exporter | Endpoint | Metrics |
|---|---|---|---|
| PowerDNS | Native (built-in) | http://192.168.194.133:8081/metrics | DNS queries, cache, latency |
| OPNsense (Owl) | opnsense-exporter | http://dumbo:8080/metrics | Firewall, WireGuard, services |
| OPNsense (Blue) | opnsense-exporter | http://dumbo:8080/metrics | Firewall, WireGuard, services |
Access¶
All monitoring services are bound to ZeroTier IPs only—not accessible from the public internet.
Grafana:
URL: http://192.168.194.131:3000
Username: admin
Password: (from 1Password: "Monitoring - Grafana Admin")
Monitored Hosts¶
Linux Hosts (via node_exporter)¶
| Host | ZeroTier IP | Metrics Endpoint |
|---|---|---|
| Dumbo | 192.168.194.131 | http://192.168.194.131:9100/metrics |
| Pluto | 192.168.194.6 | http://192.168.194.6:9100/metrics |
| Bogart | 192.168.194.133 | http://192.168.194.133:9100/metrics |
| Rocky | 192.168.194.132 | http://192.168.194.132:9100/metrics |
OPNsense Gateways (via node_exporter + SNMP)¶
| Gateway | ZeroTier IP | node_exporter | SNMP |
|---|---|---|---|
| Blue | 192.168.194.205 | :9100 ✓ | UDP 161 ✓ |
| Owl | 192.168.194.10 | :9100 ✓ | UDP 161 ✓ |
Note: OPNsense gateways have both node_exporter (full system metrics) and SNMP (interface stats). node_exporter provides richer data and is preferred for dashboards.
LAN Hosts (via gateway routing)¶
These hosts don't have ZeroTier directly installed. Prometheus reaches them via gateway routing through the ZeroTier overlay.
| Host | LAN IP | Gateway | Metrics Endpoint | Notes |
|---|---|---|---|---|
| Triton | 10.7.1.20 | Owl | http://10.7.1.20:9100/metrics | Raspberry Pi 5, manual node_exporter install |
| Rpios | 10.15.1.50 | Blue | http://10.15.1.50:9100/metrics | Raspberry Pi, manual node_exporter install |
Routing paths:
- Triton: Dumbo → ZeroTier → Owl (192.168.194.10) → Owl LAN (10.7.x.x) → Triton
- Rpios: Dumbo → ZeroTier → Blue (192.168.194.205) → Blue LAN (10.15.x.x) → Rpios
Note: These LAN hosts are not under Ansible IaC management. node_exporter was installed manually following the same patterns as Ansible-managed hosts (v1.8.2, systemd service, binds to specific IP).
Ephemeral Dev Environments (Cost Control)¶
Ephemeral development VMs like opnsense-dev have special monitoring requirements to prevent runaway costs.
| Host | ZeroTier IP | Metrics Endpoint | Cost | Status |
|---|---|---|---|---|
| opnsense-dev | 192.168.194.199 | http://192.168.194.199:9100/metrics | ~$0.14/hr | Commented out when not running |
Monitoring Pattern:
Ephemeral instances follow the standard cloud instance monitoring policy but with special alert rules for cost control:
- When provisioning: Uncomment target in
cloud/ansible/roles/monitoring-stack/defaults/main.yml - When running: Alerts fire to prevent extended runtime costs
- When torn down: Comment out target to avoid false "InstanceDown" alerts
Cost Control Alerts:
| Alert | Threshold | Severity | Action |
|---|---|---|---|
DevVMRunningTooLong |
1 hour uptime | Warning | Consider tearing down with dev-down.sh |
DevVMRunningCriticallyLong |
4 hours uptime | Critical | URGENT: Run dev-down.sh immediately |
Configuration Location:
Target: cloud/ansible/roles/monitoring-stack/defaults/main.yml
# Ephemeral dev environments (via ZeroTier)
# Uncomment when dev VM is running to enable monitoring
# - name: opnsense-dev
# address: "192.168.194.199:9100"
Alerts: cloud/ansible/roles/monitoring-stack/templates/alert-rules.yml.j2
- name: dev_environment_alerts
rules:
- alert: DevVMRunningTooLong
expr: node_boot_time_seconds{instance="opnsense-dev"} > 0 and (time() - node_boot_time_seconds{instance="opnsense-dev"}) > 3600
...
Workflow:
# 1. Provision dev VM (automatically adds monitoring)
./scripts/opnsense-dev/dev-up.sh
# 2. Uncomment opnsense-dev target in defaults/main.yml
cd cloud/ansible
vim roles/monitoring-stack/defaults/main.yml
# 3. Deploy monitoring configuration
./scripts/run-monitoring.sh --prod dumbo deploy
# 4. Work with dev VM (alerts fire after 1hr/4hr)
# 5. Tear down dev VM when done
./scripts/opnsense-dev/dev-down.sh
# 6. Comment out opnsense-dev target in defaults/main.yml
# 7. Redeploy monitoring to stop false "InstanceDown" alerts
./scripts/run-monitoring.sh --prod dumbo deploy
Why Comment Out When Not Running?
When a target is configured in Prometheus but the host is unreachable:
InstanceDownalerts fire immediately (critical severity)- Creates alert noise and fatigue
- Wastes monitoring resources on non-existent targets
- Commenting out the target prevents scraping entirely
Instance Label Matching:
Prometheus uses explicit instance labels for alert targeting:
# In prometheus.yml.j2 (generated)
- targets: ['192.168.194.199:9100']
labels:
instance: 'opnsense-dev' # Explicit label, not derived from address
This allows alerts to use instance="opnsense-dev" regardless of the IP:port address.
Alert Rules¶
Instance Alerts¶
| Alert | Condition | Severity |
|---|---|---|
| InstanceDown | Target unreachable for 2 minutes | Critical |
| GatewayDown | SNMP unreachable for 2 minutes | Critical |
Host Alerts¶
| Alert | Condition | Severity |
|---|---|---|
| HighCpuUsage | CPU > 85% for 5 minutes | Warning |
| HighMemoryUsage | Memory > 85% for 5 minutes | Warning |
| DiskSpaceLow | Disk < 15% free for 5 minutes | Warning |
| DiskSpaceCritical | Disk < 5% free for 2 minutes | Critical |
| SystemdServiceFailed | Systemd unit in failed state | Warning |
| HighLoadAverage | Load > 2x CPU cores for 15 minutes | Warning |
ZeroTier Alerts¶
| Alert | Condition | Severity |
|---|---|---|
| ZeroTierAPIDown | API unreachable for 5 minutes | Warning |
| ZeroTierMemberOffline | Authorized member offline for 5 minutes | Warning |
| ZeroTierMemberOfflineLong | Member offline for over 1 hour | Critical |
PowerDNS Alerts¶
| Alert | Condition | Severity |
|---|---|---|
| PowerDNSDown | Server unreachable for 2 minutes | Critical |
| PowerDNSHighBackendLatency | Query latency > 100ms for 5 minutes | Warning |
| PowerDNSHighServfail | SERVFAIL rate > 1/sec for 5 minutes | Warning |
OPNsense Alerts¶
| Alert | Condition | Severity |
|---|---|---|
| OPNsenseExporterDown | API unreachable for 2 minutes | Critical |
| OPNsenseHighStateCount | Firewall states > 50,000 for 5 minutes | Warning |
| OPNsenseGatewayDown | Gateway status check failed for 2 minutes | Critical |
| OPNsenseWireguardPeerOffline | WG peer handshake > 180s for 5 minutes | Warning |
Dev Environment Alerts (Cost Control)¶
| Alert | Condition | Severity |
|---|---|---|
| DevVMRunningTooLong | opnsense-dev uptime > 1 hour | Warning |
| DevVMRunningCriticallyLong | opnsense-dev uptime > 4 hours | Critical |
Purpose: Prevent runaway costs on ephemeral development VMs (~$0.14/hr).
Alert Expression:
# Warning after 1 hour
node_boot_time_seconds{instance="opnsense-dev"} > 0
and (time() - node_boot_time_seconds{instance="opnsense-dev"}) > 3600
# Critical after 4 hours
node_boot_time_seconds{instance="opnsense-dev"} > 0
and (time() - node_boot_time_seconds{instance="opnsense-dev"}) > 14400
Action Items:
- Warning: Consider tearing down with
./scripts/opnsense-dev/dev-down.sh - Critical: URGENT - terminate immediately to avoid excessive costs
Notification Channels¶
| Severity | Channels |
|---|---|
| Critical | Email + Pushover (high priority) + Slack |
| Warning | Email + Slack |
Deployment¶
Prerequisites¶
- Create Grafana database:
ssh dumbo
psql -h 127.0.0.1 -U joe -d postgres
CREATE DATABASE grafana;
CREATE USER grafana WITH ENCRYPTED PASSWORD '<from-1password>';
GRANT ALL PRIVILEGES ON DATABASE grafana TO grafana;
\c grafana
GRANT ALL ON SCHEMA public TO grafana;
- Create secrets in 1Password:
gcp_postgres_grafana_password(database password in scandora-automation)grafana_admin_password(admin password in scandora-automation)snmp_community_monitoring(OPNsense SNMP in scandora-automation)OPNsense API - Owl(api key + api secret fields)OPNsense API - Blue(api key + api secret fields)Monitoring - Slack Webhook(optional)-
Monitoring - Pushover(optional, with user key and token) -
Configure SNMP on OPNsense gateways: (legacy, optional)
- Services → Net-SNMP → Enable
- Listen Interface: ZeroTier only
- Community String: (from 1Password)
-
Firewall: Allow UDP 161 from 192.168.194.131
-
Configure OPNsense API for monitoring:
The opnsense-exporter provides much richer metrics than SNMP. Enable the API on each gateway:
a. Create API user (Web UI):
- System → Access → Users → Add
- Username: monitoring
- Generate a scrambled password (not used for API)
- Save, then edit the user
- Under "API keys", click the + button to generate a key/secret pair
- Download/copy both the key and secret immediately
b. Set permissions (Web UI):
- System → Access → Groups → Add
- Group name: monitoring
- Add the monitoring user to this group
- Under Privileges, add:
- Diagnostics: ARP Table, Firewall Statistics, Netstat
- Services: Unbound (MVC)
- Status: DHCP Leases, DNS Overview, IPsec, OpenVPN, Services
- System: Firmware, Gateways, Settings (Cron), Status
- VPN: OpenVPN Instances, WireGuard
c. Store in 1Password:
```bash
op item create --category="API Credential" \
--title="OPNsense API - Owl" \
--vault="Private" \
"api key=<key-from-opnsense>" \
"api secret=<secret-from-opnsense>" \
"hostname=192.168.194.10"
```
d. Enable Extended Statistics (optional): - Services → Unbound DNS → Advanced → Enable "Extended Statistics" - This provides DNS query metrics in the exporter
- Install node_exporter on OPNsense gateways:
# SSH to gateway (e.g., Blue)
ssh 10.15.0.1
# Install the plugin
sudo pkg install -y os-node_exporter
# Enable and configure (bind to ZeroTier IP only)
sudo sysrc node_exporter_enable=YES
sudo sysrc node_exporter_listen_address="192.168.194.205:9100" # Use gateway's ZT IP
# Start the service
sudo service node_exporter start
# Verify
curl http://192.168.194.205:9100/metrics | head -5
Repeat for Owl using ZeroTier IP 192.168.194.10.
Deploy ZeroTier Agent Metrics¶
To enable ZeroTier built-in metrics on a host, deploy both the zerotier and node-exporter tags:
cd cloud/ansible
# Deploy to all IaC hosts (dumbo, pluto, bogart, rocky)
ansible-playbook -i inventory/dumbo.yml playbooks/site.yml --tags zerotier,node-exporter
ansible-playbook -i inventory/pluto.yml playbooks/site.yml --tags zerotier,node-exporter
ansible-playbook -i inventory/bogart.yml playbooks/site.yml --tags zerotier,node-exporter
ansible-playbook -i inventory/rocky.yml playbooks/site.yml --tags zerotier,node-exporter
What happens:
- ZeroTier role deploys
local.confwithenableMetrics: trueand restarts ZeroTier (~2s blip) - node-exporter role creates textfile collector directory and symlinks
metrics.prom - node-exporter service is restarted with
--collector.textfile.directoryflag
Verify:
# Check metrics file exists on host
ssh <host> ls -la /var/lib/zerotier-one/metrics.prom
# Check symlink
ssh <host> ls -la /var/lib/node_exporter/textfile_collector/zerotier.prom
# Check metrics appear in node_exporter
curl http://<zerotier_ip>:9100/metrics | grep zt_
# Check from Prometheus
curl -s 'http://192.168.194.131:9090/api/v1/query?query=zt_peer_latency_count' | python3 -m json.tool
Deploy with Helper Script¶
cd cloud/ansible
# Deploy node_exporter to all hosts
./scripts/run-monitoring.sh --prod all node-exporter
# Deploy full stack to Dumbo
./scripts/run-monitoring.sh --prod dumbo deploy
# Dry-run to see changes
./scripts/run-monitoring.sh --prod dumbo check
Deploy Manually¶
cd cloud/ansible
# Deploy node_exporter to a specific host
ansible-playbook -i inventory/pluto.yml playbooks/site.yml --tags node-exporter
# Deploy monitoring stack to Dumbo (uses run-monitoring.sh which retrieves secrets automatically)
cd cloud/ansible
./scripts/run-monitoring.sh --prod dumbo deploy
Grafana Dashboards¶
Installed Dashboards¶
| Dashboard | URL | Purpose |
|---|---|---|
| Node Exporter Full | /d/rYdddlPWk | All hosts including OPNsense gateways |
| OPNsense Gateways | /d/opnsense-gw | Gateway traffic and interface stats |
| GCP Cost Estimates | /d/gcp-cost-estimates | GCP billing breakdown by service |
| ZeroTier Agent Metrics | /d/zerotier-agent | Peer latency, traffic, and health for ZT agent |
Quick Links:
Importing Additional Dashboards¶
Import from Grafana.com by ID:
| Dashboard | ID | Purpose |
|---|---|---|
| Node Exporter Full | 1860 | Comprehensive Linux host metrics |
| PowerDNS Authoritative | 14768 | DNS queries, cache, backend latency |
| OPNsense (AthennaMind) | 21113 | Firewall, WireGuard, services, traffic |
How to Import¶
- Go to Grafana → Dashboards → Import
- Enter the dashboard ID
- Select "Prometheus" as the data source
- Click Import
Troubleshooting¶
Check Container Status¶
ssh dumbo
cd /opt/monitoring
docker compose ps
docker compose logs prometheus
docker compose logs grafana
Check Prometheus Targets¶
Visit http://192.168.194.131:9090/targets or:
curl -s http://192.168.194.131:9090/api/v1/targets | \
jq '.data.activeTargets[] | {instance: .labels.instance, health: .health}'
Test node_exporter¶
Test SNMP¶
# From Dumbo
snmpwalk -v2c -c <community> 192.168.194.205 sysDescr
snmpwalk -v2c -c <community> 192.168.194.10 sysDescr
Reload Configuration¶
Prometheus and AlertManager support hot reload:
# Prometheus
curl -X POST http://192.168.194.131:9090/-/reload
# AlertManager
curl -X POST http://192.168.194.131:9093/-/reload
Common Issues¶
"Target is down" in Prometheus:
- Check if node_exporter is running:
ssh <host> systemctl status node-exporter - Check ZeroTier connectivity:
ping 192.168.194.x - Verify firewall allows port 9100 from Dumbo
Grafana database connection failed:
- Verify Cloud SQL proxy is running:
ssh dumbo systemctl status cloud-sql-proxy - Check database exists:
psql -h 127.0.0.1 -U joe -d postgres -c "\\l" - Verify grafana user permissions
AlertManager not sending notifications:
- Check AlertManager logs:
docker logs alertmanager - Verify webhook URLs are correct
- Test with a manual alert in Prometheus
Target reachable but scrape times out (MTU issue):
Symptoms: Small responses work (e.g., curl http://host:9100/ returns OK), but large responses time out (e.g., /metrics hangs). TCP handshake succeeds but data packets are dropped.
This indicates an MTU fragmentation issue, especially over IPv6 paths between GCP projects.
- Test with small vs large response:
# Small response - works
curl -m 5 http://192.168.194.133:9100/
# Large response - times out
curl -m 5 http://192.168.194.133:9100/metrics
- Check effective MTU on the host:
- The scandora.net ZeroTier network enforces MTU 1400 network-wide (see Network Configuration below). If a host still has issues, verify ZeroTier has applied the policy.
Cloud SQL Proxy port conflict with Prometheus:
If Prometheus fails to start on Dumbo with "port 9090 already in use", the Cloud SQL proxy health check is conflicting. The proxy is configured to use port 9091 for health checks instead:
If you see this issue, redeploy the cloudsql-client role:
Network Configuration¶
ZeroTier MTU Policy¶
The scandora.net ZeroTier network (6ab565387a4b9177) enforces MTU 1320 network-wide.
Why 1320?
- Blue site uses Starlink, which has lower MTU (~1420-1480 bytes)
- ZeroTier adds ~32 bytes of encapsulation overhead
- Combined with Starlink's MTU, the effective path MTU is ~1370 bytes
- MTU 1320 provides headroom for all paths including satellite links
Configuration: The MTU is set via ZeroTier Central API as a network-wide policy. Individual hosts do not need local configuration—they receive the MTU setting when they join the network.
To verify current MTU on a host:
zerotier-cli listnetworks
# Look for the MTU column
# Or check interface directly
ip link show zt+ | grep mtu
History: This was discovered when monitoring Bogart (coop-389306) from Dumbo (scandoraproject). Small HTTP responses worked but large metrics payloads timed out due to TCP packets exceeding the path MTU being silently dropped.
Security Considerations¶
| Component | Security Measure |
|---|---|
| node_exporter | Binds to ZeroTier IP only (not 0.0.0.0) |
| Grafana | Admin password required, no public signup |
| SNMP | Community string from 1Password, ZeroTier interface only |
| Prometheus/AlertManager | Listen on ZeroTier network only |
| Secrets | Passed via extra-vars, never committed |
Data Retention¶
- Prometheus: 30 days or 1GB (whichever comes first)
- Grafana: PostgreSQL backend on Cloud SQL (backed up)
- AlertManager: In-memory only (silences are not persisted)