Monitoring Stack¶

The scandora.net infrastructure is monitored using a Prometheus-based stack running on Dumbo.

Architecture¶

                    ┌─────────────────────────────────────────────────────┐
                    │                    Dumbo (GCE)                       │
                    │                 192.168.194.131                      │
                    │  ┌─────────────────────────────────────────────────┐│
                    │  │              Docker Compose Stack               ││
                    │  │  ┌───────────┐ ┌─────────────┐ ┌─────────────┐ ││
                    │  │  │Prometheus │ │AlertManager │ │SNMP Exporter│ ││
                    │  │  │  :9090    │ │   :9093     │ │   :9116     │ ││
                    │  │  └─────┬─────┘ └──────┬──────┘ └──────┬──────┘ ││
                    │  │        │              │               │        ││
                    │  │  ┌─────▼──────────────▼───────────────▼──────┐ ││
                    │  │  │               Grafana :3000               │ ││
                    │  │  └──────────────────┬────────────────────────┘ ││
                    │  └────────────────────┬┼────────────────────────┘ │
                    │                       ││                          │
                    │              Cloud SQL Proxy :5432                │
                    │                       │└─► scandora-postgres      │
                    └───────────────────────┼───────────────────────────┘
                                            │
          ┌────────────ZeroTier (192.168.194.0/24)────────────┐
          │                         │                          │
    ┌─────┴─────┐  ┌──────────┐  ┌──┴───────┐  ┌──────────┐  ┌┴─────────┐
    │   Pluto   │  │  Bogart  │  │  Rocky   │  │   Blue   │  │   Owl    │
    │   .6      │  │   .133   │  │   .132   │  │   .205   │  │   .10    │
    │node_exp   │  │node_exp  │  │node_exp  │  │  SNMP    │  │  SNMP    │
    │  :9100    │  │  :9100   │  │  :9100   │  │  :161    │  │  :161    │
    └───────────┘  └──────────┘  └──────────┘  └────┬─────┘  └─────┬────┘
       AWS            GCE        Meanservers     OPNsense      OPNsense
                                                    │              │
                                           Blue LAN (10.15.x.x)   │
                                                    │     Owl LAN (10.7.x.x)
                                             ┌──────┴──────┐       │
                                             │    Rpios    │┌──────┴──────┐
                                             │ 10.15.1.50  ││   Triton    │
                                             │  node_exp   ││  10.7.1.20  │
                                             │   :9100     ││  node_exp   │
                                             └─────────────┘│   :9100     │
                                              Raspberry Pi  └─────────────┘
                                                             Raspberry Pi

Components¶

Component	Purpose	Port	URL
Prometheus	Metrics collection & storage	9090	http://192.168.194.131:9090
Grafana	Dashboards & visualization	3000	http://192.168.194.131:3000
AlertManager	Alert routing & notifications	9093	http://192.168.194.131:9093
SNMP Exporter	OPNsense interface stats (legacy)	9116	http://192.168.194.131:9116
OPNsense Exporter	Rich OPNsense metrics via API	8080	Per-gateway containers
ZeroTier Exporter	ZeroTier network member status	9811	http://192.168.194.131:9811
node_exporter	Linux host metrics	9100	On each host's ZeroTier IP

ZeroTier Metrics (Dual Approach)¶

ZeroTier is monitored via two complementary methods:

Method	What it shows	How it works
Central API exporter (port 9811)	Network-wide membership status (who's online/offline)	Polls ZeroTier Central API from Dumbo
Built-in agent metrics (via node_exporter)	Per-node connectivity health (packet counts, peer latency, path status)	ZeroTier writes `metrics.prom`; node_exporter textfile collector picks it up

Built-in agent metrics are enabled on all IaC-managed Linux hosts (dumbo, pluto, bogart, rocky). They are not available on OPNsense gateways (different OS/plugin) or LAN-only hosts without ZeroTier.

How it works:

ZeroTier's local.conf has enableMetrics: true, which causes it to write /var/lib/zerotier-one/metrics.prom
A symlink connects this to node_exporter's textfile collector directory: /var/lib/node_exporter/textfile_collector/zerotier.prom
node_exporter serves the zt_* metrics alongside standard node_* metrics on :9100
No new Prometheus scrape jobs needed — metrics appear in the existing node job

Key zt_* metrics:

Metric	Type	Description
`zt_packet_incoming_count`	Counter	Packets received by the ZeroTier engine
`zt_packet_outgoing_count`	Counter	Packets sent by the ZeroTier engine
`zt_packet_error_count`	Counter	Packet errors
`zt_peer_latency_count`	Counter	Peer latency measurement count
`zt_peer_latency_sum`	Counter	Peer latency sum (for computing averages)
`zt_peer_path_count`	Gauge	Number of active paths to each peer

Querying agent metrics in Prometheus:

# Average peer latency over 5 minutes
rate(zt_peer_latency_sum[5m]) / rate(zt_peer_latency_count[5m])

# Packet error rate
rate(zt_packet_error_count[5m])

Application Exporters¶

Service	Exporter	Endpoint	Metrics
PowerDNS	Native (built-in)	http://192.168.194.133:8081/metrics	DNS queries, cache, latency
OPNsense (Owl)	opnsense-exporter	http://dumbo:8080/metrics	Firewall, WireGuard, services
OPNsense (Blue)	opnsense-exporter	http://dumbo:8080/metrics	Firewall, WireGuard, services

Access¶

All monitoring services are bound to ZeroTier IPs only—not accessible from the public internet.

Grafana:

URL: http://192.168.194.131:3000
Username: admin
Password: (from 1Password: "Monitoring - Grafana Admin")

Monitored Hosts¶

Linux Hosts (via node_exporter)¶

Host	ZeroTier IP	Metrics Endpoint
Dumbo	192.168.194.131	http://192.168.194.131:9100/metrics
Pluto	192.168.194.6	http://192.168.194.6:9100/metrics
Bogart	192.168.194.133	http://192.168.194.133:9100/metrics
Rocky	192.168.194.132	http://192.168.194.132:9100/metrics

OPNsense Gateways (via node_exporter + SNMP)¶

Gateway	ZeroTier IP	node_exporter	SNMP
Blue	192.168.194.205	:9100 ✓	UDP 161 ✓
Owl	192.168.194.10	:9100 ✓	UDP 161 ✓

Note: OPNsense gateways have both node_exporter (full system metrics) and SNMP (interface stats). node_exporter provides richer data and is preferred for dashboards.

LAN Hosts (via gateway routing)¶

These hosts don't have ZeroTier directly installed. Prometheus reaches them via gateway routing through the ZeroTier overlay.

Host	LAN IP	Gateway	Metrics Endpoint	Notes
Triton	10.7.1.20	Owl	http://10.7.1.20:9100/metrics	Raspberry Pi 5, manual node_exporter install
Rpios	10.15.1.50	Blue	http://10.15.1.50:9100/metrics	Raspberry Pi, manual node_exporter install

Routing paths:

Triton: Dumbo → ZeroTier → Owl (192.168.194.10) → Owl LAN (10.7.x.x) → Triton
Rpios: Dumbo → ZeroTier → Blue (192.168.194.205) → Blue LAN (10.15.x.x) → Rpios

Note: These LAN hosts are not under Ansible IaC management. node_exporter was installed manually following the same patterns as Ansible-managed hosts (v1.8.2, systemd service, binds to specific IP).

Ephemeral Dev Environments (Cost Control)¶

Ephemeral development VMs like opnsense-dev have special monitoring requirements to prevent runaway costs.

Host	ZeroTier IP	Metrics Endpoint	Cost	Status
opnsense-dev	192.168.194.199	http://192.168.194.199:9100/metrics	~$0.14/hr	Commented out when not running

Monitoring Pattern:

Ephemeral instances follow the standard cloud instance monitoring policy but with special alert rules for cost control:

When provisioning: Uncomment target in cloud/ansible/roles/monitoring-stack/defaults/main.yml
When running: Alerts fire to prevent extended runtime costs
When torn down: Comment out target to avoid false "InstanceDown" alerts

Cost Control Alerts:

Alert	Threshold	Severity	Action
`DevVMRunningTooLong`	1 hour uptime	Warning	Consider tearing down with `dev-down.sh`
`DevVMRunningCriticallyLong`	4 hours uptime	Critical	URGENT: Run `dev-down.sh` immediately

Configuration Location:

Target: cloud/ansible/roles/monitoring-stack/defaults/main.yml

# Ephemeral dev environments (via ZeroTier)
# Uncomment when dev VM is running to enable monitoring
# - name: opnsense-dev
#   address: "192.168.194.199:9100"

Alerts: cloud/ansible/roles/monitoring-stack/templates/alert-rules.yml.j2

- name: dev_environment_alerts
  rules:
    - alert: DevVMRunningTooLong
      expr: node_boot_time_seconds{instance="opnsense-dev"} > 0 and (time() - node_boot_time_seconds{instance="opnsense-dev"}) > 3600
      ...

Workflow:

# 1. Provision dev VM (automatically adds monitoring)
./scripts/opnsense-dev/dev-up.sh

# 2. Uncomment opnsense-dev target in defaults/main.yml
cd cloud/ansible
vim roles/monitoring-stack/defaults/main.yml

# 3. Deploy monitoring configuration
./scripts/run-monitoring.sh --prod dumbo deploy

# 4. Work with dev VM (alerts fire after 1hr/4hr)

# 5. Tear down dev VM when done
./scripts/opnsense-dev/dev-down.sh

# 6. Comment out opnsense-dev target in defaults/main.yml
# 7. Redeploy monitoring to stop false "InstanceDown" alerts
./scripts/run-monitoring.sh --prod dumbo deploy

Why Comment Out When Not Running?

When a target is configured in Prometheus but the host is unreachable:

InstanceDown alerts fire immediately (critical severity)
Creates alert noise and fatigue
Wastes monitoring resources on non-existent targets
Commenting out the target prevents scraping entirely

Instance Label Matching:

Prometheus uses explicit instance labels for alert targeting:

# In prometheus.yml.j2 (generated)
- targets: ['192.168.194.199:9100']
  labels:
    instance: 'opnsense-dev'  # Explicit label, not derived from address

This allows alerts to use instance="opnsense-dev" regardless of the IP:port address.

Alert Rules¶

Instance Alerts¶

Alert	Condition	Severity
InstanceDown	Target unreachable for 2 minutes	Critical
GatewayDown	SNMP unreachable for 2 minutes	Critical

Host Alerts¶

Alert	Condition	Severity
HighCpuUsage	CPU > 85% for 5 minutes	Warning
HighMemoryUsage	Memory > 85% for 5 minutes	Warning
DiskSpaceLow	Disk < 15% free for 5 minutes	Warning
DiskSpaceCritical	Disk < 5% free for 2 minutes	Critical
SystemdServiceFailed	Systemd unit in failed state	Warning
HighLoadAverage	Load > 2x CPU cores for 15 minutes	Warning

ZeroTier Alerts¶

Alert	Condition	Severity
ZeroTierAPIDown	API unreachable for 5 minutes	Warning
ZeroTierMemberOffline	Authorized member offline for 5 minutes	Warning
ZeroTierMemberOfflineLong	Member offline for over 1 hour	Critical

PowerDNS Alerts¶

Alert	Condition	Severity
PowerDNSDown	Server unreachable for 2 minutes	Critical
PowerDNSHighBackendLatency	Query latency > 100ms for 5 minutes	Warning
PowerDNSHighServfail	SERVFAIL rate > 1/sec for 5 minutes	Warning

OPNsense Alerts¶

Alert	Condition	Severity
OPNsenseExporterDown	API unreachable for 2 minutes	Critical
OPNsenseHighStateCount	Firewall states > 50,000 for 5 minutes	Warning
OPNsenseGatewayDown	Gateway status check failed for 2 minutes	Critical
OPNsenseWireguardPeerOffline	WG peer handshake > 180s for 5 minutes	Warning

Dev Environment Alerts (Cost Control)¶

Alert	Condition	Severity
DevVMRunningTooLong	opnsense-dev uptime > 1 hour	Warning
DevVMRunningCriticallyLong	opnsense-dev uptime > 4 hours	Critical

Purpose: Prevent runaway costs on ephemeral development VMs (~$0.14/hr).

Alert Expression:

# Warning after 1 hour
node_boot_time_seconds{instance="opnsense-dev"} > 0
  and (time() - node_boot_time_seconds{instance="opnsense-dev"}) > 3600

# Critical after 4 hours
node_boot_time_seconds{instance="opnsense-dev"} > 0
  and (time() - node_boot_time_seconds{instance="opnsense-dev"}) > 14400

Action Items:

Warning: Consider tearing down with ./scripts/opnsense-dev/dev-down.sh
Critical: URGENT - terminate immediately to avoid excessive costs

Notification Channels¶

Severity	Channels
Critical	Email + Pushover (high priority) + Slack
Warning	Email + Slack

Deployment¶

Prerequisites¶

Create Grafana database:

ssh dumbo
psql -h 127.0.0.1 -U joe -d postgres
CREATE DATABASE grafana;
CREATE USER grafana WITH ENCRYPTED PASSWORD '<from-1password>';
GRANT ALL PRIVILEGES ON DATABASE grafana TO grafana;
\c grafana
GRANT ALL ON SCHEMA public TO grafana;

Create secrets in 1Password:
gcp_postgres_grafana_password (database password in scandora-automation)
grafana_admin_password (admin password in scandora-automation)
snmp_community_monitoring (OPNsense SNMP in scandora-automation)
OPNsense API - Owl (api key + api secret fields)
OPNsense API - Blue (api key + api secret fields)
Monitoring - Slack Webhook (optional)
Monitoring - Pushover (optional, with user key and token)
Configure SNMP on OPNsense gateways: (legacy, optional)
Services → Net-SNMP → Enable
Listen Interface: ZeroTier only
Community String: (from 1Password)
Firewall: Allow UDP 161 from 192.168.194.131
Configure OPNsense API for monitoring:

The opnsense-exporter provides much richer metrics than SNMP. Enable the API on each gateway:

a. Create API user (Web UI): - System → Access → Users → Add - Username: monitoring - Generate a scrambled password (not used for API) - Save, then edit the user - Under "API keys", click the + button to generate a key/secret pair - Download/copy both the key and secret immediately

b. Set permissions (Web UI): - System → Access → Groups → Add - Group name: monitoring - Add the monitoring user to this group - Under Privileges, add: - Diagnostics: ARP Table, Firewall Statistics, Netstat - Services: Unbound (MVC) - Status: DHCP Leases, DNS Overview, IPsec, OpenVPN, Services - System: Firmware, Gateways, Settings (Cron), Status - VPN: OpenVPN Instances, WireGuard

c. Store in 1Password:

  ```bash
  op item create --category="API Credential" \
    --title="OPNsense API - Owl" \
    --vault="Private" \
    "api key=<key-from-opnsense>" \
    "api secret=<secret-from-opnsense>" \
    "hostname=192.168.194.10"
  ```

d. Enable Extended Statistics (optional): - Services → Unbound DNS → Advanced → Enable "Extended Statistics" - This provides DNS query metrics in the exporter

Install node_exporter on OPNsense gateways:

# SSH to gateway (e.g., Blue)
ssh 10.15.0.1

# Install the plugin
sudo pkg install -y os-node_exporter

# Enable and configure (bind to ZeroTier IP only)
sudo sysrc node_exporter_enable=YES
sudo sysrc node_exporter_listen_address="192.168.194.205:9100"  # Use gateway's ZT IP

# Start the service
sudo service node_exporter start

# Verify
curl http://192.168.194.205:9100/metrics | head -5

Repeat for Owl using ZeroTier IP 192.168.194.10.

Deploy ZeroTier Agent Metrics¶

To enable ZeroTier built-in metrics on a host, deploy both the zerotier and node-exporter tags:

cd cloud/ansible

# Deploy to all IaC hosts (dumbo, pluto, bogart, rocky)
ansible-playbook -i inventory/dumbo.yml playbooks/site.yml --tags zerotier,node-exporter
ansible-playbook -i inventory/pluto.yml playbooks/site.yml --tags zerotier,node-exporter
ansible-playbook -i inventory/bogart.yml playbooks/site.yml --tags zerotier,node-exporter
ansible-playbook -i inventory/rocky.yml playbooks/site.yml --tags zerotier,node-exporter

What happens:

ZeroTier role deploys local.conf with enableMetrics: true and restarts ZeroTier (~2s blip)
node-exporter role creates textfile collector directory and symlinks metrics.prom
node-exporter service is restarted with --collector.textfile.directory flag

Verify:

# Check metrics file exists on host
ssh <host> ls -la /var/lib/zerotier-one/metrics.prom

# Check symlink
ssh <host> ls -la /var/lib/node_exporter/textfile_collector/zerotier.prom

# Check metrics appear in node_exporter
curl http://<zerotier_ip>:9100/metrics | grep zt_

# Check from Prometheus
curl -s 'http://192.168.194.131:9090/api/v1/query?query=zt_peer_latency_count' | python3 -m json.tool

Deploy with Helper Script¶

cd cloud/ansible

# Deploy node_exporter to all hosts
./scripts/run-monitoring.sh --prod all node-exporter

# Deploy full stack to Dumbo
./scripts/run-monitoring.sh --prod dumbo deploy

# Dry-run to see changes
./scripts/run-monitoring.sh --prod dumbo check

Deploy Manually¶

cd cloud/ansible

# Deploy node_exporter to a specific host
ansible-playbook -i inventory/pluto.yml playbooks/site.yml --tags node-exporter

# Deploy monitoring stack to Dumbo (uses run-monitoring.sh which retrieves secrets automatically)
cd cloud/ansible
./scripts/run-monitoring.sh --prod dumbo deploy

Grafana Dashboards¶

Installed Dashboards¶

Dashboard	URL	Purpose
Node Exporter Full	/d/rYdddlPWk	All hosts including OPNsense gateways
OPNsense Gateways	/d/opnsense-gw	Gateway traffic and interface stats
GCP Cost Estimates	/d/gcp-cost-estimates	GCP billing breakdown by service
ZeroTier Agent Metrics	/d/zerotier-agent	Peer latency, traffic, and health for ZT agent

Quick Links:

Importing Additional Dashboards¶

Import from Grafana.com by ID:

Dashboard	ID	Purpose
Node Exporter Full	1860	Comprehensive Linux host metrics
PowerDNS Authoritative	14768	DNS queries, cache, backend latency
OPNsense (AthennaMind)	21113	Firewall, WireGuard, services, traffic

How to Import¶

Go to Grafana → Dashboards → Import
Enter the dashboard ID
Select "Prometheus" as the data source
Click Import

Troubleshooting¶

Check Container Status¶

ssh dumbo
cd /opt/monitoring
docker compose ps
docker compose logs prometheus
docker compose logs grafana

Check Prometheus Targets¶

Visit http://192.168.194.131:9090/targets or:

curl -s http://192.168.194.131:9090/api/v1/targets | \
  jq '.data.activeTargets[] | {instance: .labels.instance, health: .health}'

Test node_exporter¶

# From any ZeroTier-connected host
curl http://192.168.194.131:9100/metrics | head -20

Test SNMP¶

# From Dumbo
snmpwalk -v2c -c <community> 192.168.194.205 sysDescr
snmpwalk -v2c -c <community> 192.168.194.10 sysDescr

Reload Configuration¶

Prometheus and AlertManager support hot reload:

# Prometheus
curl -X POST http://192.168.194.131:9090/-/reload

# AlertManager
curl -X POST http://192.168.194.131:9093/-/reload

Common Issues¶

"Target is down" in Prometheus:

Check if node_exporter is running: ssh <host> systemctl status node-exporter
Check ZeroTier connectivity: ping 192.168.194.x
Verify firewall allows port 9100 from Dumbo

Grafana database connection failed:

Verify Cloud SQL proxy is running: ssh dumbo systemctl status cloud-sql-proxy
Check database exists: psql -h 127.0.0.1 -U joe -d postgres -c "\\l"
Verify grafana user permissions

AlertManager not sending notifications:

Check AlertManager logs: docker logs alertmanager
Verify webhook URLs are correct
Test with a manual alert in Prometheus

Target reachable but scrape times out (MTU issue):

Symptoms: Small responses work (e.g., curl http://host:9100/ returns OK), but large responses time out (e.g., /metrics hangs). TCP handshake succeeds but data packets are dropped.

This indicates an MTU fragmentation issue, especially over IPv6 paths between GCP projects.

Test with small vs large response:

# Small response - works
curl -m 5 http://192.168.194.133:9100/

# Large response - times out
curl -m 5 http://192.168.194.133:9100/metrics

Check effective MTU on the host:

ip link show zt+ | grep mtu

The scandora.net ZeroTier network enforces MTU 1400 network-wide (see Network Configuration below). If a host still has issues, verify ZeroTier has applied the policy.

Cloud SQL Proxy port conflict with Prometheus:

If Prometheus fails to start on Dumbo with "port 9090 already in use", the Cloud SQL proxy health check is conflicting. The proxy is configured to use port 9091 for health checks instead:

# In inventory/dumbo.yml
cloudsql_proxy_health_port: 9091

If you see this issue, redeploy the cloudsql-client role:

ansible-playbook -i inventory/dumbo.yml playbooks/site.yml --tags cloudsql

Network Configuration¶

ZeroTier MTU Policy¶

The scandora.net ZeroTier network (6ab565387a4b9177) enforces MTU 1320 network-wide.

Why 1320?

Blue site uses Starlink, which has lower MTU (~1420-1480 bytes)
ZeroTier adds ~32 bytes of encapsulation overhead
Combined with Starlink's MTU, the effective path MTU is ~1370 bytes
MTU 1320 provides headroom for all paths including satellite links

Configuration: The MTU is set via ZeroTier Central API as a network-wide policy. Individual hosts do not need local configuration—they receive the MTU setting when they join the network.

To verify current MTU on a host:

zerotier-cli listnetworks
# Look for the MTU column

# Or check interface directly
ip link show zt+ | grep mtu

History: This was discovered when monitoring Bogart (coop-389306) from Dumbo (scandoraproject). Small HTTP responses worked but large metrics payloads timed out due to TCP packets exceeding the path MTU being silently dropped.

Security Considerations¶

Component	Security Measure
node_exporter	Binds to ZeroTier IP only (not 0.0.0.0)
Grafana	Admin password required, no public signup
SNMP	Community string from 1Password, ZeroTier interface only
Prometheus/AlertManager	Listen on ZeroTier network only
Secrets	Passed via extra-vars, never committed

Data Retention¶

Prometheus: 30 days or 1GB (whichever comes first)
Grafana: PostgreSQL backend on Cloud SQL (backed up)
AlertManager: In-memory only (silences are not persisted)