Skip to content

Monitoring Stack

The scandora.net infrastructure is monitored using a Prometheus-based stack running on Dumbo.

Architecture

                    ┌─────────────────────────────────────────────────────┐
                    │                    Dumbo (GCE)                       │
                    │                 192.168.194.131                      │
                    │  ┌─────────────────────────────────────────────────┐│
                    │  │              Docker Compose Stack               ││
                    │  │  ┌───────────┐ ┌─────────────┐ ┌─────────────┐ ││
                    │  │  │Prometheus │ │AlertManager │ │SNMP Exporter│ ││
                    │  │  │  :9090    │ │   :9093     │ │   :9116     │ ││
                    │  │  └─────┬─────┘ └──────┬──────┘ └──────┬──────┘ ││
                    │  │        │              │               │        ││
                    │  │  ┌─────▼──────────────▼───────────────▼──────┐ ││
                    │  │  │               Grafana :3000               │ ││
                    │  │  └──────────────────┬────────────────────────┘ ││
                    │  └────────────────────┬┼────────────────────────┘ │
                    │                       ││                          │
                    │              Cloud SQL Proxy :5432                │
                    │                       │└─► scandora-postgres      │
                    └───────────────────────┼───────────────────────────┘
          ┌────────────ZeroTier (192.168.194.0/24)────────────┐
          │                         │                          │
    ┌─────┴─────┐  ┌──────────┐  ┌──┴───────┐  ┌──────────┐  ┌┴─────────┐
    │   Pluto   │  │  Bogart  │  │  Rocky   │  │   Blue   │  │   Owl    │
    │   .6      │  │   .133   │  │   .132   │  │   .205   │  │   .10    │
    │node_exp   │  │node_exp  │  │node_exp  │  │  SNMP    │  │  SNMP    │
    │  :9100    │  │  :9100   │  │  :9100   │  │  :161    │  │  :161    │
    └───────────┘  └──────────┘  └──────────┘  └────┬─────┘  └─────┬────┘
       AWS            GCE        Meanservers     OPNsense      OPNsense
                                                    │              │
                                           Blue LAN (10.15.x.x)   │
                                                    │     Owl LAN (10.7.x.x)
                                             ┌──────┴──────┐       │
                                             │    Rpios    │┌──────┴──────┐
                                             │ 10.15.1.50  ││   Triton    │
                                             │  node_exp   ││  10.7.1.20  │
                                             │   :9100     ││  node_exp   │
                                             └─────────────┘│   :9100     │
                                              Raspberry Pi  └─────────────┘
                                                             Raspberry Pi

Components

Component Purpose Port URL
Prometheus Metrics collection & storage 9090 http://192.168.194.131:9090
Grafana Dashboards & visualization 3000 http://192.168.194.131:3000
AlertManager Alert routing & notifications 9093 http://192.168.194.131:9093
SNMP Exporter OPNsense interface stats (legacy) 9116 http://192.168.194.131:9116
OPNsense Exporter Rich OPNsense metrics via API 8080 Per-gateway containers
ZeroTier Exporter ZeroTier network member status 9811 http://192.168.194.131:9811
node_exporter Linux host metrics 9100 On each host's ZeroTier IP

ZeroTier Metrics (Dual Approach)

ZeroTier is monitored via two complementary methods:

Method What it shows How it works
Central API exporter (port 9811) Network-wide membership status (who's online/offline) Polls ZeroTier Central API from Dumbo
Built-in agent metrics (via node_exporter) Per-node connectivity health (packet counts, peer latency, path status) ZeroTier writes metrics.prom; node_exporter textfile collector picks it up

Built-in agent metrics are enabled on all IaC-managed Linux hosts (dumbo, pluto, bogart, rocky). They are not available on OPNsense gateways (different OS/plugin) or LAN-only hosts without ZeroTier.

How it works:

  1. ZeroTier's local.conf has enableMetrics: true, which causes it to write /var/lib/zerotier-one/metrics.prom
  2. A symlink connects this to node_exporter's textfile collector directory: /var/lib/node_exporter/textfile_collector/zerotier.prom
  3. node_exporter serves the zt_* metrics alongside standard node_* metrics on :9100
  4. No new Prometheus scrape jobs needed — metrics appear in the existing node job

Key zt_* metrics:

Metric Type Description
zt_packet_incoming_count Counter Packets received by the ZeroTier engine
zt_packet_outgoing_count Counter Packets sent by the ZeroTier engine
zt_packet_error_count Counter Packet errors
zt_peer_latency_count Counter Peer latency measurement count
zt_peer_latency_sum Counter Peer latency sum (for computing averages)
zt_peer_path_count Gauge Number of active paths to each peer

Querying agent metrics in Prometheus:

# Average peer latency over 5 minutes
rate(zt_peer_latency_sum[5m]) / rate(zt_peer_latency_count[5m])

# Packet error rate
rate(zt_packet_error_count[5m])

Application Exporters

Service Exporter Endpoint Metrics
PowerDNS Native (built-in) http://192.168.194.133:8081/metrics DNS queries, cache, latency
OPNsense (Owl) opnsense-exporter http://dumbo:8080/metrics Firewall, WireGuard, services
OPNsense (Blue) opnsense-exporter http://dumbo:8080/metrics Firewall, WireGuard, services

Access

All monitoring services are bound to ZeroTier IPs only—not accessible from the public internet.

Grafana:

URL: http://192.168.194.131:3000
Username: admin
Password: (from 1Password: "Monitoring - Grafana Admin")

Monitored Hosts

Linux Hosts (via node_exporter)

Host ZeroTier IP Metrics Endpoint
Dumbo 192.168.194.131 http://192.168.194.131:9100/metrics
Pluto 192.168.194.6 http://192.168.194.6:9100/metrics
Bogart 192.168.194.133 http://192.168.194.133:9100/metrics
Rocky 192.168.194.132 http://192.168.194.132:9100/metrics

OPNsense Gateways (via node_exporter + SNMP)

Gateway ZeroTier IP node_exporter SNMP
Blue 192.168.194.205 :9100 ✓ UDP 161 ✓
Owl 192.168.194.10 :9100 ✓ UDP 161 ✓

Note: OPNsense gateways have both node_exporter (full system metrics) and SNMP (interface stats). node_exporter provides richer data and is preferred for dashboards.

LAN Hosts (via gateway routing)

These hosts don't have ZeroTier directly installed. Prometheus reaches them via gateway routing through the ZeroTier overlay.

Host LAN IP Gateway Metrics Endpoint Notes
Triton 10.7.1.20 Owl http://10.7.1.20:9100/metrics Raspberry Pi 5, manual node_exporter install
Rpios 10.15.1.50 Blue http://10.15.1.50:9100/metrics Raspberry Pi, manual node_exporter install

Routing paths:

  • Triton: Dumbo → ZeroTier → Owl (192.168.194.10) → Owl LAN (10.7.x.x) → Triton
  • Rpios: Dumbo → ZeroTier → Blue (192.168.194.205) → Blue LAN (10.15.x.x) → Rpios

Note: These LAN hosts are not under Ansible IaC management. node_exporter was installed manually following the same patterns as Ansible-managed hosts (v1.8.2, systemd service, binds to specific IP).

Ephemeral Dev Environments (Cost Control)

Ephemeral development VMs like opnsense-dev have special monitoring requirements to prevent runaway costs.

Host ZeroTier IP Metrics Endpoint Cost Status
opnsense-dev 192.168.194.199 http://192.168.194.199:9100/metrics ~$0.14/hr Commented out when not running

Monitoring Pattern:

Ephemeral instances follow the standard cloud instance monitoring policy but with special alert rules for cost control:

  1. When provisioning: Uncomment target in cloud/ansible/roles/monitoring-stack/defaults/main.yml
  2. When running: Alerts fire to prevent extended runtime costs
  3. When torn down: Comment out target to avoid false "InstanceDown" alerts

Cost Control Alerts:

Alert Threshold Severity Action
DevVMRunningTooLong 1 hour uptime Warning Consider tearing down with dev-down.sh
DevVMRunningCriticallyLong 4 hours uptime Critical URGENT: Run dev-down.sh immediately

Configuration Location:

Target: cloud/ansible/roles/monitoring-stack/defaults/main.yml

# Ephemeral dev environments (via ZeroTier)
# Uncomment when dev VM is running to enable monitoring
# - name: opnsense-dev
#   address: "192.168.194.199:9100"

Alerts: cloud/ansible/roles/monitoring-stack/templates/alert-rules.yml.j2

- name: dev_environment_alerts
  rules:
    - alert: DevVMRunningTooLong
      expr: node_boot_time_seconds{instance="opnsense-dev"} > 0 and (time() - node_boot_time_seconds{instance="opnsense-dev"}) > 3600
      ...

Workflow:

# 1. Provision dev VM (automatically adds monitoring)
./scripts/opnsense-dev/dev-up.sh

# 2. Uncomment opnsense-dev target in defaults/main.yml
cd cloud/ansible
vim roles/monitoring-stack/defaults/main.yml

# 3. Deploy monitoring configuration
./scripts/run-monitoring.sh --prod dumbo deploy

# 4. Work with dev VM (alerts fire after 1hr/4hr)

# 5. Tear down dev VM when done
./scripts/opnsense-dev/dev-down.sh

# 6. Comment out opnsense-dev target in defaults/main.yml
# 7. Redeploy monitoring to stop false "InstanceDown" alerts
./scripts/run-monitoring.sh --prod dumbo deploy

Why Comment Out When Not Running?

When a target is configured in Prometheus but the host is unreachable:

  • InstanceDown alerts fire immediately (critical severity)
  • Creates alert noise and fatigue
  • Wastes monitoring resources on non-existent targets
  • Commenting out the target prevents scraping entirely

Instance Label Matching:

Prometheus uses explicit instance labels for alert targeting:

# In prometheus.yml.j2 (generated)
- targets: ['192.168.194.199:9100']
  labels:
    instance: 'opnsense-dev'  # Explicit label, not derived from address

This allows alerts to use instance="opnsense-dev" regardless of the IP:port address.

Alert Rules

Instance Alerts

Alert Condition Severity
InstanceDown Target unreachable for 2 minutes Critical
GatewayDown SNMP unreachable for 2 minutes Critical

Host Alerts

Alert Condition Severity
HighCpuUsage CPU > 85% for 5 minutes Warning
HighMemoryUsage Memory > 85% for 5 minutes Warning
DiskSpaceLow Disk < 15% free for 5 minutes Warning
DiskSpaceCritical Disk < 5% free for 2 minutes Critical
SystemdServiceFailed Systemd unit in failed state Warning
HighLoadAverage Load > 2x CPU cores for 15 minutes Warning

ZeroTier Alerts

Alert Condition Severity
ZeroTierAPIDown API unreachable for 5 minutes Warning
ZeroTierMemberOffline Authorized member offline for 5 minutes Warning
ZeroTierMemberOfflineLong Member offline for over 1 hour Critical

PowerDNS Alerts

Alert Condition Severity
PowerDNSDown Server unreachable for 2 minutes Critical
PowerDNSHighBackendLatency Query latency > 100ms for 5 minutes Warning
PowerDNSHighServfail SERVFAIL rate > 1/sec for 5 minutes Warning

OPNsense Alerts

Alert Condition Severity
OPNsenseExporterDown API unreachable for 2 minutes Critical
OPNsenseHighStateCount Firewall states > 50,000 for 5 minutes Warning
OPNsenseGatewayDown Gateway status check failed for 2 minutes Critical
OPNsenseWireguardPeerOffline WG peer handshake > 180s for 5 minutes Warning

Dev Environment Alerts (Cost Control)

Alert Condition Severity
DevVMRunningTooLong opnsense-dev uptime > 1 hour Warning
DevVMRunningCriticallyLong opnsense-dev uptime > 4 hours Critical

Purpose: Prevent runaway costs on ephemeral development VMs (~$0.14/hr).

Alert Expression:

# Warning after 1 hour
node_boot_time_seconds{instance="opnsense-dev"} > 0
  and (time() - node_boot_time_seconds{instance="opnsense-dev"}) > 3600

# Critical after 4 hours
node_boot_time_seconds{instance="opnsense-dev"} > 0
  and (time() - node_boot_time_seconds{instance="opnsense-dev"}) > 14400

Action Items:

  • Warning: Consider tearing down with ./scripts/opnsense-dev/dev-down.sh
  • Critical: URGENT - terminate immediately to avoid excessive costs

Notification Channels

Severity Channels
Critical Email + Pushover (high priority) + Slack
Warning Email + Slack

Deployment

Prerequisites

  1. Create Grafana database:
ssh dumbo
psql -h 127.0.0.1 -U joe -d postgres
CREATE DATABASE grafana;
CREATE USER grafana WITH ENCRYPTED PASSWORD '<from-1password>';
GRANT ALL PRIVILEGES ON DATABASE grafana TO grafana;
\c grafana
GRANT ALL ON SCHEMA public TO grafana;
  1. Create secrets in 1Password:
  2. gcp_postgres_grafana_password (database password in scandora-automation)
  3. grafana_admin_password (admin password in scandora-automation)
  4. snmp_community_monitoring (OPNsense SNMP in scandora-automation)
  5. OPNsense API - Owl (api key + api secret fields)
  6. OPNsense API - Blue (api key + api secret fields)
  7. Monitoring - Slack Webhook (optional)
  8. Monitoring - Pushover (optional, with user key and token)

  9. Configure SNMP on OPNsense gateways: (legacy, optional)

  10. Services → Net-SNMP → Enable
  11. Listen Interface: ZeroTier only
  12. Community String: (from 1Password)
  13. Firewall: Allow UDP 161 from 192.168.194.131

  14. Configure OPNsense API for monitoring:

The opnsense-exporter provides much richer metrics than SNMP. Enable the API on each gateway:

a. Create API user (Web UI): - System → Access → Users → Add - Username: monitoring - Generate a scrambled password (not used for API) - Save, then edit the user - Under "API keys", click the + button to generate a key/secret pair - Download/copy both the key and secret immediately

b. Set permissions (Web UI): - System → Access → Groups → Add - Group name: monitoring - Add the monitoring user to this group - Under Privileges, add: - Diagnostics: ARP Table, Firewall Statistics, Netstat - Services: Unbound (MVC) - Status: DHCP Leases, DNS Overview, IPsec, OpenVPN, Services - System: Firmware, Gateways, Settings (Cron), Status - VPN: OpenVPN Instances, WireGuard

c. Store in 1Password:

  ```bash
  op item create --category="API Credential" \
    --title="OPNsense API - Owl" \
    --vault="Private" \
    "api key=<key-from-opnsense>" \
    "api secret=<secret-from-opnsense>" \
    "hostname=192.168.194.10"
  ```

d. Enable Extended Statistics (optional): - Services → Unbound DNS → Advanced → Enable "Extended Statistics" - This provides DNS query metrics in the exporter

  1. Install node_exporter on OPNsense gateways:
# SSH to gateway (e.g., Blue)
ssh 10.15.0.1

# Install the plugin
sudo pkg install -y os-node_exporter

# Enable and configure (bind to ZeroTier IP only)
sudo sysrc node_exporter_enable=YES
sudo sysrc node_exporter_listen_address="192.168.194.205:9100"  # Use gateway's ZT IP

# Start the service
sudo service node_exporter start

# Verify
curl http://192.168.194.205:9100/metrics | head -5

Repeat for Owl using ZeroTier IP 192.168.194.10.

Deploy ZeroTier Agent Metrics

To enable ZeroTier built-in metrics on a host, deploy both the zerotier and node-exporter tags:

cd cloud/ansible

# Deploy to all IaC hosts (dumbo, pluto, bogart, rocky)
ansible-playbook -i inventory/dumbo.yml playbooks/site.yml --tags zerotier,node-exporter
ansible-playbook -i inventory/pluto.yml playbooks/site.yml --tags zerotier,node-exporter
ansible-playbook -i inventory/bogart.yml playbooks/site.yml --tags zerotier,node-exporter
ansible-playbook -i inventory/rocky.yml playbooks/site.yml --tags zerotier,node-exporter

What happens:

  1. ZeroTier role deploys local.conf with enableMetrics: true and restarts ZeroTier (~2s blip)
  2. node-exporter role creates textfile collector directory and symlinks metrics.prom
  3. node-exporter service is restarted with --collector.textfile.directory flag

Verify:

# Check metrics file exists on host
ssh <host> ls -la /var/lib/zerotier-one/metrics.prom

# Check symlink
ssh <host> ls -la /var/lib/node_exporter/textfile_collector/zerotier.prom

# Check metrics appear in node_exporter
curl http://<zerotier_ip>:9100/metrics | grep zt_

# Check from Prometheus
curl -s 'http://192.168.194.131:9090/api/v1/query?query=zt_peer_latency_count' | python3 -m json.tool

Deploy with Helper Script

cd cloud/ansible

# Deploy node_exporter to all hosts
./scripts/run-monitoring.sh --prod all node-exporter

# Deploy full stack to Dumbo
./scripts/run-monitoring.sh --prod dumbo deploy

# Dry-run to see changes
./scripts/run-monitoring.sh --prod dumbo check

Deploy Manually

cd cloud/ansible

# Deploy node_exporter to a specific host
ansible-playbook -i inventory/pluto.yml playbooks/site.yml --tags node-exporter

# Deploy monitoring stack to Dumbo (uses run-monitoring.sh which retrieves secrets automatically)
cd cloud/ansible
./scripts/run-monitoring.sh --prod dumbo deploy

Grafana Dashboards

Installed Dashboards

Dashboard URL Purpose
Node Exporter Full /d/rYdddlPWk All hosts including OPNsense gateways
OPNsense Gateways /d/opnsense-gw Gateway traffic and interface stats
GCP Cost Estimates /d/gcp-cost-estimates GCP billing breakdown by service
ZeroTier Agent Metrics /d/zerotier-agent Peer latency, traffic, and health for ZT agent

Quick Links:

Importing Additional Dashboards

Import from Grafana.com by ID:

Dashboard ID Purpose
Node Exporter Full 1860 Comprehensive Linux host metrics
PowerDNS Authoritative 14768 DNS queries, cache, backend latency
OPNsense (AthennaMind) 21113 Firewall, WireGuard, services, traffic

How to Import

  1. Go to Grafana → Dashboards → Import
  2. Enter the dashboard ID
  3. Select "Prometheus" as the data source
  4. Click Import

Troubleshooting

Check Container Status

ssh dumbo
cd /opt/monitoring
docker compose ps
docker compose logs prometheus
docker compose logs grafana

Check Prometheus Targets

Visit http://192.168.194.131:9090/targets or:

curl -s http://192.168.194.131:9090/api/v1/targets | \
  jq '.data.activeTargets[] | {instance: .labels.instance, health: .health}'

Test node_exporter

# From any ZeroTier-connected host
curl http://192.168.194.131:9100/metrics | head -20

Test SNMP

# From Dumbo
snmpwalk -v2c -c <community> 192.168.194.205 sysDescr
snmpwalk -v2c -c <community> 192.168.194.10 sysDescr

Reload Configuration

Prometheus and AlertManager support hot reload:

# Prometheus
curl -X POST http://192.168.194.131:9090/-/reload

# AlertManager
curl -X POST http://192.168.194.131:9093/-/reload

Common Issues

"Target is down" in Prometheus:

  1. Check if node_exporter is running: ssh <host> systemctl status node-exporter
  2. Check ZeroTier connectivity: ping 192.168.194.x
  3. Verify firewall allows port 9100 from Dumbo

Grafana database connection failed:

  1. Verify Cloud SQL proxy is running: ssh dumbo systemctl status cloud-sql-proxy
  2. Check database exists: psql -h 127.0.0.1 -U joe -d postgres -c "\\l"
  3. Verify grafana user permissions

AlertManager not sending notifications:

  1. Check AlertManager logs: docker logs alertmanager
  2. Verify webhook URLs are correct
  3. Test with a manual alert in Prometheus

Target reachable but scrape times out (MTU issue):

Symptoms: Small responses work (e.g., curl http://host:9100/ returns OK), but large responses time out (e.g., /metrics hangs). TCP handshake succeeds but data packets are dropped.

This indicates an MTU fragmentation issue, especially over IPv6 paths between GCP projects.

  1. Test with small vs large response:
# Small response - works
curl -m 5 http://192.168.194.133:9100/

# Large response - times out
curl -m 5 http://192.168.194.133:9100/metrics
  1. Check effective MTU on the host:
ip link show zt+ | grep mtu
  1. The scandora.net ZeroTier network enforces MTU 1400 network-wide (see Network Configuration below). If a host still has issues, verify ZeroTier has applied the policy.

Cloud SQL Proxy port conflict with Prometheus:

If Prometheus fails to start on Dumbo with "port 9090 already in use", the Cloud SQL proxy health check is conflicting. The proxy is configured to use port 9091 for health checks instead:

# In inventory/dumbo.yml
cloudsql_proxy_health_port: 9091

If you see this issue, redeploy the cloudsql-client role:

ansible-playbook -i inventory/dumbo.yml playbooks/site.yml --tags cloudsql

Network Configuration

ZeroTier MTU Policy

The scandora.net ZeroTier network (6ab565387a4b9177) enforces MTU 1320 network-wide.

Why 1320?

  • Blue site uses Starlink, which has lower MTU (~1420-1480 bytes)
  • ZeroTier adds ~32 bytes of encapsulation overhead
  • Combined with Starlink's MTU, the effective path MTU is ~1370 bytes
  • MTU 1320 provides headroom for all paths including satellite links

Configuration: The MTU is set via ZeroTier Central API as a network-wide policy. Individual hosts do not need local configuration—they receive the MTU setting when they join the network.

To verify current MTU on a host:

zerotier-cli listnetworks
# Look for the MTU column

# Or check interface directly
ip link show zt+ | grep mtu

History: This was discovered when monitoring Bogart (coop-389306) from Dumbo (scandoraproject). Small HTTP responses worked but large metrics payloads timed out due to TCP packets exceeding the path MTU being silently dropped.

Security Considerations

Component Security Measure
node_exporter Binds to ZeroTier IP only (not 0.0.0.0)
Grafana Admin password required, no public signup
SNMP Community string from 1Password, ZeroTier interface only
Prometheus/AlertManager Listen on ZeroTier network only
Secrets Passed via extra-vars, never committed

Data Retention

  • Prometheus: 30 days or 1GB (whichever comes first)
  • Grafana: PostgreSQL backend on Cloud SQL (backed up)
  • AlertManager: In-memory only (silences are not persisted)