Monitoring and Observability
Comprehensive monitoring setup providing complete visibility into infrastructure, applications, and network performance.
Architecture Overview
The monitoring stack is built around the kube-prometheus-stack, which provides a complete observability solution for the Kubernetes cluster.
┌─────────────────────────────────────────────────────────┐
│ Metrics Collection Layer │
├─────────────────────────────────────────────────────────┤
│ • Node Exporter (all 5 Pi nodes) │
│ • kube-state-metrics (K8s objects) │
│ • UniFi Poller (network metrics) │
│ • SNMP Exporter (Synology NAS) │
│ • Kubelet (container metrics) │
│ • Control Plane: API Server, etcd, Controller Manager, │
│ Scheduler, kube-proxy │
│ • CoreDNS │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Prometheus (Time-Series DB) │
├─────────────────────────────────────────────────────────┤
│ • 50Gi persistent storage (Synology iSCSI) │
│ • 20-30s scrape intervals │
│ • Long-term retention │
│ • Recording rules for efficiency │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Visualization & Alerting Layer │
├─────────────────────────────────────────────────────────┤
│ • Grafana (dashboards and visualization) │
│ • AlertManager (alert routing) │
└─────────────────────────────────────────────────────────┘
Current Monitoring Stack
Core Components
Prometheus
Purpose: Central metrics collection and storage
- Version: v3.9.1
- Storage: 50Gi PVC on Synology NAS
- Deployment: ArgoCD-managed (sync-wave: -15)
- Namespace:
default
Capabilities:
- Time-series metric storage
- Powerful query language (PromQL)
- Service discovery
- Alert rule evaluation
Documentation: kube-prometheus-stack Guide
Grafana
Purpose: Metrics visualization and dashboarding
- Pre-loaded Dashboards: 49 dashboards (10 custom, 13 community, 26 from prometheus-stack)
- Custom Dashboards: Support for user-created visualizations
- Datasources: Pre-configured Prometheus connection
- Authentication: Secure admin access
Access:
kubectl port-forward -n default svc/kube-prometheus-stack-grafana 3000:80
AlertManager
Purpose: Alert aggregation and routing
- Integration: Prometheus alert rules
- Routing: Configurable notification channels
- Deduplication: Intelligent alert grouping
- Silencing: Temporary alert suppression
Metric Exporters
Node Exporter
Deployment: DaemonSet on all 5 Raspberry Pi nodes
Metrics Collected:
- CPU usage and temperature
- Memory utilization
- Disk I/O and space
- Network interface statistics
- System load
- Hardware sensors
Why It Matters: Critical for monitoring Pi cluster health, especially temperature and throttling.
kube-state-metrics
Purpose: Kubernetes object state metrics
Metrics Collected:
- Pod status and resource usage
- Deployment health and replicas
- Node conditions and capacity
- PersistentVolume status
- ConfigMap and Secret counts
UniFi Poller
Purpose: Network infrastructure monitoring
- Version: v2.33.0
- Deployment: ArgoCD-managed (sync-wave: -20)
- Namespace:
unipoller - Controller: 10.0.1.1
Network Metrics:
- Device status and uptime
- Port statistics and errors
- Wireless client connections
- Bandwidth utilization
- PoE power consumption
- Signal strength and interference
Documentation: UniFi Poller Guide
SNMP Exporter
Purpose: Synology NAS monitoring
- Version: v0.26.0
- Deployment: ArgoCD-managed (part of kube-prometheus-stack)
- Namespace:
default - Target: Synology DS925+ at 10.0.1.204
Storage Metrics:
- Disk health and temperature
- Volume capacity and usage
- RAID status
- iSCSI target statistics
- Network interface statistics
- System resource utilization
Documentation: SNMP Exporter Guide
Metrics Collection
Scrape Configuration
Prometheus is configured to scrape metrics from multiple sources:
| Target | Interval | Purpose |
|---|---|---|
| UniFi Poller | 20s | Network metrics |
| SNMP Exporter | 30s | NAS storage metrics |
| Node Exporter | 30s | Hardware metrics |
| kubelet | 30s | Container metrics |
| API Server | 30s | Control plane API |
| etcd | 30s | Control plane datastore |
| Controller Manager | 30s | Control plane controllers |
| Scheduler | 30s | Control plane scheduling |
| kube-proxy | 30s | Network proxy |
| kube-state-metrics | 30s | K8s objects |
| CoreDNS | 30s | DNS metrics |
Storage and Retention
Prometheus Storage:
- Size: 50Gi
- Backend: Synology NAS via iSCSI
- Storage Class:
synology-iscsi-retain - Retention Policy: Configured for long-term storage
PVC Details:
kubectl get pvc -n default | grep prometheus
Dashboards
Pre-Installed Grafana Dashboards
The kube-prometheus-stack includes comprehensive dashboards:
Cluster-Level:
- Kubernetes Cluster Overview
- Cluster Resource Usage
- Namespace Resource Usage
- Persistent Volumes
Node-Level:
- Node Exporter Full
- Node Resource Usage per Namespace
- Nodes Dashboard
- Node Temperature (critical for Pis!)
Application-Level:
- Deployment Status
- StatefulSet Status
- Pod Resource Usage
- Container Resource Usage
Infrastructure:
- API Server Performance
- etcd Metrics
- Controller Manager Metrics
- Scheduler Metrics
- kube-proxy Metrics
- CoreDNS Metrics
Network:
- Network I/O Pressure
- UniFi Network Performance (custom)
Creating Custom Dashboards
- Access Grafana UI
- Create new dashboard
- Add panels with PromQL queries
- Save dashboard
- Export as JSON
- Commit to git for version control
Alerting
Prometheus Alert Rules
Alert rules are defined using PrometheusRule CRDs:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: example-alert
spec:
groups:
- name: example
rules:
- alert: HighPodMemory
expr: container_memory_usage_bytes > 1e9
for: 5m
Custom PrometheusRule Alerts
In addition to the 100+ default alerts from kube-prometheus-stack, the following custom PrometheusRules are deployed:
| PrometheusRule | Alerts | Scope |
|---|---|---|
argo-workflows-alerts | 8 | Workflow failures, stuck, queue backlog, controller errors |
blackbox-exporter-alerts | 12 | Endpoint availability, SSL expiry, latency |
storage-alerts | 7 | Disk space prediction, PVC capacity, NAS health |
velero-alerts | 7 | Backup failures, schedule misses, restore errors |
istio-alerts | 9 | Control plane health, XDS convergence, DNS failures |
apm-alerts | 8 | OOMKill, CrashLoop, node CPU/memory/disk, API server |
ingress-nginx-alerts | 7 | 5xx/4xx rates, p95 latency, config reload, controller down |
trivy-operator-alerts | 12 | Critical CVEs, RBAC issues, compliance failures, exposed secrets |
Total Custom Alerts: ~70 across 8 PrometheusRules
Common Alert Categories
Node Alerts:
- Node down or unreachable
- High CPU usage (>80%)
- High memory usage (>90%)
- Disk space low (<10%)
- Node temperature critical (>75°C for Pi)
Pod Alerts:
- Pod crash looping
- Pod restart count high
- Container OOM killed
- Pod pending too long
Cluster Alerts:
- API server errors
- etcd performance degradation
- Persistent volume filling up
- Excessive pod evictions
AlertManager Configuration
Notification Channels:
- Slack (recommended for homelab)
- Discord
- Webhook
- PagerDuty
Configuration:
Edit AlertManager config in values.yaml and apply via ArgoCD.
Key Metrics for Raspberry Pi Cluster
Critical Metrics to Monitor
Temperature
Why: Raspberry Pis throttle at high temperatures
node_hwmon_temp_celsius
Alert Threshold: > 70°C (warning), > 75°C (critical)
CPU Throttling
Why: Indicates thermal or power issues
node_cpu_frequency_hertz / node_cpu_scaling_frequency_max_hertz < 0.9
Memory Pressure
Why: 16GB RAM per node can fill up quickly
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Alert Threshold: > 85% (warning), > 90% (critical)
NVMe Health
Why: Monitor SSD wear and health
node_disk_io_time_seconds_total
node_disk_read_bytes_total
node_disk_write_bytes_total
Network Performance
Why: Ensure cluster communication is healthy
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
Cluster-Wide Metrics
Pod Distribution
count by (node) (kube_pod_info)
Purpose: Ensure even workload distribution
Resource Requests vs Limits
sum(kube_pod_container_resource_requests_cpu_cores) / sum(kube_node_status_allocatable_cpu_cores) * 100
Purpose: Monitor cluster capacity utilization
Storage Usage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100
Purpose: Prevent PVC from filling up
Access Methods
Prometheus UI
Port Forward:
kubectl port-forward -n default svc/kube-prometheus-stack-prometheus 9090:9090
URL: http://localhost:9090
Features:
- PromQL query interface
- Target status page
- Alert rules viewer
- Configuration viewer
Grafana UI
Port Forward:
kubectl port-forward -n default svc/kube-prometheus-stack-grafana 3000:80
URL: http://localhost:3000
Login:
# Get admin password
kubectl get secret kube-prometheus-stack-grafana -n default \
-o jsonpath="{.data.admin-password}" | base64 -d
AlertManager UI
Port Forward:
kubectl port-forward -n default svc/kube-prometheus-stack-alertmanager 9093:9093
URL: http://localhost:9093
GitOps Management
All monitoring components are managed via ArgoCD:
Sync Waves:
-20: UniFi Poller (metrics collection)
-15: kube-prometheus-stack (monitoring stack)
Auto-Sync: Enabled with prune and self-heal
Configuration Changes:
- Edit values in
homelab/manifests/base/kube-prometheus-stack/values.yaml - Commit and push
- ArgoCD automatically syncs within ~3 minutes
Performance Tuning
For Raspberry Pi Cluster
Scrape Intervals:
- Balance between metric resolution and resource usage
- 20-30s intervals are appropriate for homelab
Retention:
- 50Gi provides months of retention
- Adjust based on growth rate
Cardinality:
- Be mindful of high-cardinality metrics
- Use recording rules for expensive queries
- Regularly review series count
Resource Limits:
- Set appropriate limits for 16GB node memory
- Monitor Prometheus memory usage
- Adjust if OOM occurs
Implemented Enhancements
Recently Added
- ✅ SNMP Monitoring for Synology NAS (December 2025)
- Disk health and temperature monitoring
- Volume capacity and usage tracking
- RAID status monitoring
- iSCSI target statistics
- Network interface statistics
- Comprehensive Grafana dashboard
Planned Enhancements
Recently Completed
- ✅ Blackbox Exporter (January 2026)
- HTTP/HTTPS endpoint monitoring
- SSL certificate expiration tracking
- Response time monitoring
- Grafana dashboard for probe status
Recently Completed (March 2026)
- ✅ Log Collector Migration: Migrated from Promtail (EOL March 2026) to Grafana Alloy v1.13.0
- ✅ Ingress NGINX Monitoring: 7 PrometheusRule alerts + dedicated Grafana dashboard (PR #498)
- ✅ Trivy Compliance Reporting: Weekly CronJob posting compliance summaries to AlertManager (PR #494)
- ✅ Loki Log-Based Alerting: 9 LogQL alert rules via embedded ruler (PR #489)
Coming Soon
- Alert Notification Setup
- Slack/Discord integration for critical alerts
- Daily health report summaries
Troubleshooting
Common Issues
Prometheus High Memory:
- Check cardinality:
curl localhost:9090/api/v1/status/tsdb - Review scrape configuration
- Adjust retention settings
Missing Metrics:
- Verify ServiceMonitor exists
- Check Prometheus targets page
- Review pod logs
Grafana Connection Issues:
- Verify datasource configuration
- Check Prometheus service endpoint
- Review Grafana logs
Useful Commands
# Check monitoring pods
kubectl get pods -n default | grep prometheus
# View Prometheus config
kubectl get prometheus -o yaml
# List all ServiceMonitors
kubectl get servicemonitor -A
# Check PrometheusRules
kubectl get prometheusrule -A
# Verify metrics endpoint
kubectl exec -n default prometheus-kube-prometheus-stack-prometheus-0 -c prometheus -- \
wget -qO- http://TARGET:PORT/metrics | head -20