Monitoring Troubleshooting
This guide covers common issues with the monitoring stack (Prometheus, Grafana, AlertManager, UniFi Poller) and their solutions.
Prometheus Issues
Prometheus Pod Stuck in Pending
Symptoms:
- Prometheus pod shows
Pendingstatus - No metrics being collected
Diagnosis:
kubectl describe pod prometheus-kube-prometheus-stack-prometheus-0 -n default
Common Causes:
-
PVC Not Bound:
kubectl get pvc -n default | grep prometheus- Check if PVC is in
Pendingstate - Verify Synology CSI driver is running:
kubectl get pods -n synology-csi - Ensure storage class exists:
kubectl get storageclass - Check Synology NAS storage capacity
- Check if PVC is in
-
Insufficient Node Resources:
- Check node capacity:
kubectl describe nodes - Prometheus requests 2Gi memory - ensure nodes have available capacity
- Review resource requests vs limits
- Check node capacity:
Solution:
# If CSI driver issue
kubectl rollout restart deployment/synology-csi-controller -n synology-csi
# If resource issue, check pod distribution
kubectl get pods -o wide -A | grep prometheus
Prometheus High Memory Usage
Symptoms:
- Prometheus pod using excessive memory
- Pod getting OOMKilled
- Slow query performance
Diagnosis:
# Check current memory usage
kubectl top pod prometheus-kube-prometheus-stack-prometheus-0 -n default
# Check cardinality
kubectl exec -n default prometheus-kube-prometheus-stack-prometheus-0 -c prometheus -- \
wget -qO- http://localhost:9090/api/v1/status/tsdb | grep numSeries
Common Causes:
- High-cardinality metrics
- Too many active series
- Aggressive scrape intervals
- Long retention period
Solution:
-
Review Scrape Intervals:
# Check prometheus configuration
kubectl get prometheus -n default -o yaml | grep interval -
Reduce Cardinality:
- Identify high-cardinality metrics
- Use recording rules for expensive queries
- Drop unnecessary labels
-
Adjust Resource Limits: Edit
manifests/base/kube-prometheus-stack/values.yaml:prometheus:
prometheusSpec:
resources:
limits:
memory: 6Gi # Increase from 4Gi -
Reduce Retention:
prometheus:
prometheusSpec:
retention: 15d # Reduce from 30d
No Metrics from Specific Target
Symptoms:
- Missing metrics in Prometheus
- Scrape target showing as down
Diagnosis:
# Port-forward to Prometheus UI
kubectl port-forward -n default svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/targets
# Check target status
Common Targets and Checks:
UniFi Poller:
# Check pod running
kubectl get pods -n unipoller
# Test metrics endpoint
kubectl exec -n default prometheus-kube-prometheus-stack-prometheus-0 -c prometheus -- \
wget -qO- http://unifi-poller.unipoller:9130/metrics | head -20
# Check UniFi Poller logs
kubectl logs -n unipoller deployment/unifi-poller
Node Exporter:
# Check DaemonSet (should have 5 pods for 5 nodes)
kubectl get ds -n default | grep node-exporter
# Verify all pods running
kubectl get pods -n default -l app.kubernetes.io/name=prometheus-node-exporter
Solution:
- Verify ServiceMonitor exists:
kubectl get servicemonitor -A - Check service endpoints:
kubectl get endpoints <service-name> - Review Prometheus logs:
kubectl logs -n default prometheus-kube-prometheus-stack-prometheus-0 -c prometheus
Prometheus Disk Full
Symptoms:
- Prometheus logs show disk space errors
- Metrics collection stops
- Write failures
Diagnosis:
# Check PVC size
kubectl get pvc -n default | grep prometheus
# Check actual usage (if metrics available)
kubectl exec -n default prometheus-kube-prometheus-stack-prometheus-0 -c prometheus -- \
df -h /prometheus
Solution:
-
Expand PVC:
kubectl edit pvc prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0 -n default
# Change:
spec:
resources:
requests:
storage: 100Gi # Increase from 50Gi -
Reduce Retention: Edit values to reduce data retention period
-
Clean Old Data (Emergency):
# Delete old time series (use with caution)
kubectl exec -n default prometheus-kube-prometheus-stack-prometheus-0 -c prometheus -- \
rm -rf /prometheus/chunks_head/*
# Restart Prometheus
kubectl delete pod prometheus-kube-prometheus-stack-prometheus-0 -n default
Grafana Issues
Cannot Access Grafana UI
Symptoms:
- Cannot connect to Grafana via port-forward
- Grafana dashboard not loading
Diagnosis:
# Check Grafana pod status
kubectl get pods -n default | grep grafana
# Check pod logs
kubectl logs -n default deployment/kube-prometheus-stack-grafana
# Verify service
kubectl get svc -n default | grep grafana
Solution:
# Port-forward to Grafana
kubectl port-forward -n default svc/kube-prometheus-stack-grafana 3000:80
# Open http://localhost:3000
# If pod crashed, check logs and restart
kubectl rollout restart deployment/kube-prometheus-stack-grafana -n default
Forgot Grafana Admin Password
Solution:
# Retrieve password from secret
kubectl get secret kube-prometheus-stack-grafana -n default \
-o jsonpath="{.data.admin-password}" | base64 -d
echo # Add newline after password
Grafana Shows "No Data" or "N/A"
Symptoms:
- Dashboards show no data
- Queries return empty results
- All panels show "N/A"
Diagnosis:
-
Check Prometheus Datasource:
- Go to Configuration > Data Sources in Grafana
- Click on Prometheus datasource
- Click "Test" button
-
Verify Prometheus Running:
kubectl get pods -n default | grep prometheus -
Check Time Range:
- Ensure selected time range has data
- Try "Last 5 minutes" for recent data
Solution:
-
Fix Datasource URL: Should be:
http://kube-prometheus-stack-prometheus.default:9090 -
Restart Grafana:
kubectl rollout restart deployment/kube-prometheus-stack-grafana -n default -
Verify Metrics Exist: Query Prometheus directly:
kubectl port-forward -n default svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090
# Run query: up
Dashboard Import Fails
Symptoms:
- Cannot import dashboard JSON
- Dashboard shows errors after import
Common Causes:
- Incompatible Grafana version
- Missing data sources
- Incorrect variable definitions
Solution:
- Verify Grafana version compatibility
- Check datasource name matches
- Review dashboard variables
- Import dashboards from grafana.com
AlertManager Issues
Alerts Not Firing
Symptoms:
- Expected alerts not appearing
- No notifications received
Diagnosis:
# Check AlertManager pod
kubectl get pods -n default | grep alertmanager
# View firing alerts
kubectl port-forward -n default svc/kube-prometheus-stack-alertmanager 9093:9093
# Open http://localhost:9093
# Check PrometheusRules
kubectl get prometheusrule -A
Solution:
-
Verify Alert Rules:
# Check rules loaded in Prometheus
# Port-forward to Prometheus: http://localhost:9090/rules -
Check Alert State:
- Pending: Alert condition met but not for long enough
- Firing: Alert actively firing
- Review
forduration in alert rules
-
Review AlertManager Config:
kubectl get secret alertmanager-kube-prometheus-stack-alertmanager -n default -o yaml
Notification Not Sent
Symptoms:
- Alerts firing but no notifications
- Slack/email not received
Diagnosis:
# Check AlertManager logs
kubectl logs -n default alertmanager-kube-prometheus-stack-alertmanager-0
# View AlertManager config
kubectl port-forward -n default svc/kube-prometheus-stack-alertmanager 9093:9093
# Open http://localhost:9093/#/status
Solution:
-
Configure Notification Channel: Edit
manifests/base/kube-prometheus-stack/values.yaml:alertmanager:
config:
receivers:
- name: 'slack'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK'
channel: '#alerts'
route:
receiver: 'slack' -
Test Configuration:
- Send test alert
- Check webhook/SMTP settings
- Verify network connectivity
UniFi Poller Issues
UniFi Poller Not Collecting Metrics
Symptoms:
- No UniFi metrics in Prometheus
- UniFi dashboards empty
Diagnosis:
# Check UniFi Poller pod
kubectl get pods -n unipoller
# View logs
kubectl logs -n unipoller deployment/unifi-poller
# Test metrics endpoint
kubectl exec -n unipoller deployment/unifi-poller -- \
wget -qO- http://localhost:9130/metrics | grep unifi
Common Causes:
-
UniFi Controller Connection Failed:
- Verify controller URL:
https://10.0.1.1 - Check credentials in secret
- Verify SSL verification disabled (controller has self-signed cert)
- Verify controller URL:
-
Wrong Site Name:
- Verify site name in config (should be
n37-gw)
- Verify site name in config (should be
Solution:
-
Check Configuration:
kubectl get configmap -n unipoller unifi-poller-config -o yaml -
Verify Secret:
kubectl get secret -n unipoller unifi-poller-secret -o yaml -
Test Controller Connection:
kubectl exec -n unipoller deployment/unifi-poller -- \
curl -k https://10.0.1.1 -
Restart UniFi Poller:
kubectl rollout restart deployment/unifi-poller -n unipoller
UniFi Poller High Error Rate
Symptoms:
- Logs show connection errors
- Intermittent metric collection
Diagnosis:
kubectl logs -n unipoller deployment/unifi-poller | grep -i error
Common Issues:
- Network connectivity to controller
- Invalid credentials
- Controller API changes after upgrade
Solution:
- Verify network path to 10.0.1.1
- Update UniFi Poller to latest version
- Check UniFi Controller version compatibility
Node Exporter Issues
Missing Node Metrics
Symptoms:
- Some nodes missing from metrics
- Incomplete node exporter data
Diagnosis:
# Check DaemonSet status (should show 5/5 for 5 nodes)
kubectl get daemonset -n default prometheus-node-exporter
# Check pods
kubectl get pods -n default -l app.kubernetes.io/name=prometheus-node-exporter -o wide
# Verify ServiceMonitor
kubectl get servicemonitor -n default | grep node-exporter
Solution:
-
Check Pod Status:
kubectl describe pod -n default -l app.kubernetes.io/name=prometheus-node-exporter -
Node Taint Issues:
# Check node taints
kubectl describe nodes | grep -i taint
# DaemonSet might need tolerations -
Restart DaemonSet:
kubectl rollout restart daemonset/prometheus-node-exporter -n default
Temperature Metrics Missing
Symptoms:
- No
node_hwmon_temp_celsiusmetrics - Cannot monitor Pi temperature
Diagnosis:
# Check if hwmon is exposed on nodes
kubectl exec -n default <node-exporter-pod> -- ls -la /host/sys/class/hwmon
Solution:
- Raspberry Pi temperature sensors should be auto-detected
- Verify node-exporter has access to host filesystem
- Check hostPath mounts in DaemonSet spec
Control Plane Component Monitoring Issues
Control Plane Targets Showing Down
Symptoms:
- kube-controller-manager, etcd, kube-scheduler, or kube-proxy showing as DOWN in Prometheus
- Missing control plane metrics
- Scrape errors in Prometheus logs
Diagnosis:
# Check Prometheus targets
kubectl port-forward -n default svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/targets
# Check ServiceMonitor status
kubectl get servicemonitor -n default | grep -E "controller|etcd|scheduler|proxy"
# Verify control plane component endpoints
kubectl get endpoints -n kube-system kube-controller-manager
kubectl get endpoints -n kube-system kube-scheduler
Common Causes:
-
Components Binding to Localhost:
- Problem: Default kubeadm configuration binds components to 127.0.0.1
- Symptom: Connection refused errors when Prometheus tries to scrape
- Solution: Update kubeadm configuration to bind to 0.0.0.0
-
Certificate/TLS Issues:
- Problem: Self-signed certificates or TLS verification failures
- Symptom: TLS handshake errors in Prometheus logs
- Solution: Configure
insecureSkipVerifyin ServiceMonitor (use cautiously)
-
Firewall/Network Policies:
- Problem: Network policies blocking scrape traffic
- Symptom: Timeouts or connection refused
- Solution: Verify network policies allow Prometheus → control plane traffic
Solution - Verify Bind Addresses:
Check kubeadm configuration for control plane components:
# Controller Manager
sudo cat /etc/kubernetes/manifests/kube-controller-manager.yaml | grep bind-address
# Scheduler
sudo cat /etc/kubernetes/manifests/kube-scheduler.yaml | grep bind-address
# Expected output: --bind-address=0.0.0.0
Solution - Update kubeadm Config (if binding to localhost):
-
Edit Controller Manager:
sudo vim /etc/kubernetes/manifests/kube-controller-manager.yaml
# Change:
- --bind-address=127.0.0.1
# To:
- --bind-address=0.0.0.0 -
Edit Scheduler:
sudo vim /etc/kubernetes/manifests/kube-scheduler.yaml
# Change:
- --bind-address=127.0.0.1
# To:
- --bind-address=0.0.0.0 -
Edit kube-proxy ConfigMap:
kubectl edit configmap kube-proxy -n kube-system
# In the config.conf section, change:
metricsBindAddress: "127.0.0.1:10249"
# To:
metricsBindAddress: "0.0.0.0:10249"
# Then restart kube-proxy DaemonSet
kubectl rollout restart daemonset kube-proxy -n kube-system -
etcd Configuration:
sudo vim /etc/kubernetes/manifests/etcd.yaml
# Verify:
- --listen-metrics-urls=http://0.0.0.0:2381
After Configuration Changes:
Wait for kubelet to automatically restart the static pods (~30-60 seconds), then verify targets:
# Check if targets are UP
kubectl port-forward -n default svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/targets
# Look for: kube-controller-manager, etcd, kube-scheduler, kube-proxy
Missing etcd Metrics
Symptoms:
- etcd target DOWN or missing
- No etcd performance metrics
Diagnosis:
# Check etcd endpoint
kubectl get endpoints -n kube-system kube-etcd
# Verify etcd is exposing metrics
curl -k http://<node-ip>:2381/metrics
Solution:
- Verify etcd is configured to expose metrics on port 2381
- Check ServiceMonitor configuration in values.yaml
- Ensure network connectivity from Prometheus pod to etcd
Controller Manager Metrics Missing
Diagnosis:
# Test endpoint directly
curl -k https://<node-ip>:10257/metrics
# Check for certificate issues
kubectl logs -n default prometheus-kube-prometheus-stack-prometheus-0 -c prometheus | grep controller-manager
Solution:
- Verify port 10257 is accessible
- Check TLS configuration in ServiceMonitor
- Ensure
insecureSkipVerify: trueif using self-signed certs
Scheduler Metrics Missing
Diagnosis:
# Test endpoint
curl -k https://<node-ip>:10259/metrics
# Check ServiceMonitor
kubectl get servicemonitor -n default kube-prometheus-stack-kube-scheduler -o yaml
Solution:
- Verify scheduler is running:
kubectl get pods -n kube-system | grep scheduler - Check bind address in scheduler manifest
- Verify port 10259 is accessible
kube-proxy Metrics Missing
Diagnosis:
# kube-proxy runs on all nodes
kubectl get pods -n kube-system -l k8s-app=kube-proxy -o wide
# Test metrics endpoint (use any node IP)
curl http://<node-ip>:10249/metrics
Solution:
- Verify metricsBindAddress in kube-proxy ConfigMap
- Restart kube-proxy after config changes
- Check that port 10249 is accessible from Prometheus
Calico CNI Limitation with hostNetwork Pods
IMPORTANT: This is a known architectural limitation, not a configuration issue.
Symptoms:
- Control plane components show mixed UP/DOWN status
- Timeout errors when scraping components on different nodes
- Only components on same node as Prometheus work
Root Cause: Calico CNI cannot route traffic from pod network to hostNetwork pods on different nodes via node IPs.
Diagnosis:
# 1. Check which node Prometheus is running on
kubectl get pod prometheus-kube-prometheus-stack-prometheus-0 -n default -o wide
# 2. Check which nodes control plane components are on
kubectl get pods -n kube-system -o wide | grep -E "proxy|etcd|scheduler|controller"
# 3. Test connectivity from Prometheus pod
kubectl exec -n default prometheus-kube-prometheus-stack-prometheus-0 -c prometheus -- \
wget -qO- --timeout=2 http://NODE_IP:PORT/metrics
# 4. Check Prometheus target status
kubectl exec -n default prometheus-kube-prometheus-stack-prometheus-0 -c prometheus -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=up{job="kube-etcd"}' | python3 -m json.tool
Why This Happens:
- Control plane components use
hostNetwork: true(required for their operation) - hostNetwork pods don't get a pod IP in Calico network (pod IP = node IP)
- Calico routing rules don't handle pod→host traffic across nodes
- Reverse path filtering blocks the traffic
Which Components Are Affected:
- ❌ kube-etcd: Runs only on control plane node (cannot be monitored reliably)
- ❌ kube-proxy: Only instance on same node as Prometheus works (1/5 success rate)
- ✅ kube-scheduler: Works via HTTPS (routed through API server)
- ✅ kube-controller-manager: May work via HTTPS (verify in your setup)
Solution: Disable monitoring for components that cannot work reliably:
# In manifests/base/kube-prometheus-stack/values.yaml
kubeEtcd:
enabled: false # Disabled: Calico CNI cannot route to hostNetwork pods across nodes
kubeProxy:
enabled: false # Disabled: Calico CNI cannot route to hostNetwork pods across nodes
Alternative Solutions (Not Recommended for Homelab):
- Run Prometheus as DaemonSet (one per node) - too complex
- Change CNI to one without this limitation - major infrastructure change
- Deploy metrics proxy on each node - over-engineered
Verification After Disabling:
# Check that disabled ServiceMonitors are gone
kubectl get servicemonitor -n default | grep -E "etcd|proxy"
# Verify remaining targets work
kubectl exec -n default prometheus-kube-prometheus-stack-prometheus-0 -c prometheus -- \
wget -qO- 'http://localhost:9090/api/v1/query?query=up{job="kube-scheduler"}'
General Troubleshooting Steps
Check All Monitoring Pods
# Check all monitoring components
kubectl get pods -n default | grep -E "prometheus|grafana|alertmanager|node-exporter"
kubectl get pods -n unipoller
View Component Logs
# Prometheus
kubectl logs -n default prometheus-kube-prometheus-stack-prometheus-0 -c prometheus
# Grafana
kubectl logs -n default deployment/kube-prometheus-stack-grafana
# AlertManager
kubectl logs -n default alertmanager-kube-prometheus-stack-alertmanager-0
# Node Exporter
kubectl logs -n default daemonset/prometheus-node-exporter
# UniFi Poller
kubectl logs -n unipoller deployment/unifi-poller
Restart Monitoring Stack
Individual Components:
# Restart Grafana
kubectl rollout restart deployment/kube-prometheus-stack-grafana -n default
# Restart Node Exporter
kubectl rollout restart daemonset/prometheus-node-exporter -n default
# Restart UniFi Poller
kubectl rollout restart deployment/unifi-poller -n unipoller
# Restart Prometheus (StatefulSet - delete pod)
kubectl delete pod prometheus-kube-prometheus-stack-prometheus-0 -n default
Full Stack Restart:
# Sync ArgoCD application
argocd app sync kube-prometheus-stack --grpc-web
argocd app sync unipoller --grpc-web
Verify ArgoCD Sync Status
# Check if monitoring apps are in sync
kubectl get application -n argocd | grep -E "prometheus|unipoller"
# Get detailed status
argocd app get kube-prometheus-stack --grpc-web
argocd app get unipoller --grpc-web
Performance Issues
Slow Dashboard Load Times
Causes:
- Expensive PromQL queries
- Large time ranges
- High cardinality metrics
Solutions:
- Use recording rules for expensive queries
- Limit dashboard time range
- Optimize PromQL queries
- Increase Prometheus resources
High CPU Usage on Prometheus
Diagnosis:
kubectl top pod prometheus-kube-prometheus-stack-prometheus-0 -n default
Solutions:
- Reduce scrape frequency
- Use recording rules
- Reduce retention period
- Optimize alert rule evaluation
Raspberry Pi Specific Issues
CPU Throttling Affecting Monitoring
Symptoms:
- Monitoring gaps during high load
- Inconsistent metric collection
Solution:
- Monitor Pi temperature:
node_hwmon_temp_celsius - Improve cooling if temps > 70°C
- Adjust resource limits on monitoring pods
- Distribute workload across nodes
Network Bandwidth Saturation
Symptoms:
- Delayed metric collection
- iSCSI performance issues affecting Prometheus storage
Solution:
- Check network utilization metrics
- Reduce scrape intervals if needed
- Monitor Synology NAS network traffic
- Consider metric retention policies
Recovery Procedures
Complete Monitoring Stack Failure
-
Check ArgoCD:
kubectl get application kube-prometheus-stack -n argocd -
Force Sync:
argocd app sync kube-prometheus-stack --force --grpc-web -
Check PVCs:
kubectl get pvc -n default | grep prometheus -
Verify Storage:
kubectl get pods -n synology-csi
Data Loss Prevention
Critical PVC:
prometheus-kube-prometheus-stack-prometheus-db-prometheus-kube-prometheus-stack-prometheus-0- Uses
synology-iscsi-retainstorage class - PV retained even if PVC deleted
Backup Recommendations:
- Synology snapshot schedule for Prometheus volume
- Regular testing of restore procedures
- Document alert rules and dashboard configs in git
Getting Help
Collect Diagnostic Information
# Create diagnostic bundle
kubectl describe pods -n default > monitoring-pods.txt
kubectl logs -n default prometheus-kube-prometheus-stack-prometheus-0 -c prometheus > prometheus.log
kubectl logs -n default deployment/kube-prometheus-stack-grafana > grafana.log
kubectl get events -n default --sort-by='.lastTimestamp' > events.txt
Useful Commands Reference
# Port-forward to UIs
kubectl port-forward -n default svc/kube-prometheus-stack-prometheus 9090:9090
kubectl port-forward -n default svc/kube-prometheus-stack-grafana 3000:80
kubectl port-forward -n default svc/kube-prometheus-stack-alertmanager 9093:9093
# Check resource usage
kubectl top pods -n default | grep prometheus
kubectl top pods -n unipoller
# View CRDs
kubectl get prometheus -A
kubectl get servicemonitor -A
kubectl get prometheusrule -A