Custom Grafana Dashboards
This guide documents the custom Grafana dashboards deployed for comprehensive Pi cluster monitoring.
Overview
Four custom dashboards provide visibility into cluster health, node resources, temperatures, and log analytics:
- Pi Cluster Overview - Unified health dashboard
- Node Resource Monitoring - Detailed per-node metrics
- Temperature Monitoring - Thermal analysis and monitoring
- Loki Log Analytics - Log aggregation and error tracking
All dashboards are deployed via ConfigMap sidecar provisioning and auto-discovered by Grafana.
Access
URL: https://grafana.k8s.n37.ca
Credentials: <grafana-username> / <grafana-password> (example only — set your own secure credentials in Grafana or your secret management system)
Dashboard Deployment
Architecture
Dashboards are deployed as Kubernetes ConfigMaps with the label grafana_dashboard: "1", which triggers auto-discovery by the Grafana sidecar container.
Deployment Pattern:
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-<name>
namespace: default
labels:
grafana_dashboard: "1" # Auto-discovery label
app: grafana
data:
<name>.json: |
{ "dashboard": {...} }
Kustomization Structure:
manifests/base/
├── grafana/
│ └── dashboards/
│ ├── kustomization.yaml # Includes all dashboards
│ ├── pi-cluster-overview.yaml
│ ├── node-resource-monitoring.yaml
│ ├── temperature-monitoring.yaml
│ └── loki-log-analytics.yaml
└── kube-prometheus-stack/
└── kustomization.yaml # References dashboards via bases
Sidecar Auto-Discovery
The Grafana deployment includes a sidecar container (grafana-sc-dashboard) that:
- Watches for ConfigMaps with label
grafana_dashboard: "1" - Extracts dashboard JSON from ConfigMap data
- Writes dashboard files to
/tmp/dashboards/ - Grafana automatically loads dashboards from this directory
Discovery is automatic - no manual dashboard import required.
Dashboard Provisioning
Dashboards are read-only in the Grafana UI (editable: false) because they're provisioned via ConfigMaps, but they remain configurable via the ConfigMap definitions. To modify:
- Edit the dashboard ConfigMap YAML file
- Commit and push changes
- ArgoCD syncs automatically
- Grafana sidecar reloads dashboard (~30s)
Pi Cluster Overview Dashboard
UID: pi-cluster-overview
Refresh Rate: 30 seconds
Default Time Range: Last 6 hours
Purpose
Unified cluster health dashboard combining key metrics across all nodes.
Panels (7 Total)
1. Total Nodes (Stat)
- Query:
count(kube_node_info) - Shows: Number of Kubernetes nodes in the cluster
- Expected: 5 (control-plane + 4 worker nodes)
2. Total Pods (Stat)
- Query:
count(kube_pod_info) - Shows: Total running pods across all namespaces
- Typical Range: 40-60 pods
3. Cluster CPU Usage (Stat)
- Query:
(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 - Shows: Average CPU usage across all nodes
- Thresholds:
- Green: < 70%
- Yellow: 70-90%
- Red: > 90%
4. Cluster Memory Usage (Stat)
- Query:
(1 - sum(node_memory_MemAvailable_bytes) / sum(node_memory_MemTotal_bytes)) * 100 - Shows: Total memory usage across cluster
- Cluster Total: ~80GB (5 nodes × 16GB)
- Thresholds: Same as CPU
5. CPU Usage Per Node (Time Series)
- Query:
(1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 - Shows: Individual CPU usage trends for each node
- Legend: Last, Max values displayed
6. Memory Usage Per Node (Time Series)
- Query:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 - Shows: Individual memory usage trends for each node
7. CPU Temperature Per Node (Time Series)
- Query:
node_hwmon_temp_celsius{chip="thermal_thermal_zone0",sensor="temp1"} - Shows: Real-time CPU temperatures for all Pi 5 nodes
- Thresholds:
- Green: < 70°C
- Yellow: 70-85°C
- Red: > 85°C
- Normal Range: 45-60°C (idle to moderate load)
Use Cases
- Quick health check - Glance at cluster status
- Capacity planning - Monitor resource utilization trends
- Thermal monitoring - Ensure nodes aren't overheating
- Incident response - Identify which nodes are under stress
Node Resource Monitoring Dashboard
UID: node-resource-monitoring
Refresh Rate: 30 seconds
Default Time Range: Last 6 hours
Purpose
Detailed per-node resource analysis for troubleshooting and capacity planning.
Panels (13 Total)
CPU Metrics
1. CPU Usage Per Node (Time Series)
- Query:
(1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 - Shows: Individual node CPU % over time
- 5 series - One per node
2. CPU Load Average (Time Series)
- Queries:
node_load1- 1-minute load averagenode_load5- 5-minute load averagenode_load15- 15-minute load average
- Interpretation:
- Load < 4.0 (# of cores) = healthy
- Load > 4.0 = CPU saturation
Memory Metrics
3. Memory Usage Per Node (Time Series)
- Query:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 - Shows: Per-node memory % over time
4. Available Memory Per Node (Time Series)
- Query:
node_memory_MemAvailable_bytes - Shows: Free memory in bytes per node
- 16GB per node = 17,179,869,184 bytes total
Disk Metrics
5. Disk I/O Per Node (Time Series)
- Queries:
- Read:
rate(node_disk_read_bytes_total[5m]) - Write:
rate(node_disk_written_bytes_total[5m])
- Read:
- Shows: Bytes/sec read and written per disk
- Includes: MicroSD and NVMe devices
6. Disk IOPS Per Node (Time Series)
- Queries:
- Read:
rate(node_disk_reads_completed_total[5m]) - Write:
rate(node_disk_writes_completed_total[5m])
- Read:
- Shows: I/O operations per second
Network Metrics
7. Network Receive (RX) Per Node (Time Series)
- Query:
rate(node_network_receive_bytes_total{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[5m]) - Shows: Inbound network traffic (bytes/sec)
- Excludes: Loopback and virtual interfaces
8. Network Transmit (TX) Per Node (Time Series)
- Query:
rate(node_network_transmit_bytes_total{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[5m]) - Shows: Outbound network traffic (bytes/sec)
Filesystem Metrics
9. Filesystem Usage (Table)
- Queries:
- Size:
node_filesystem_size_bytes - Available:
node_filesystem_avail_bytes - Used %:
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
- Size:
- Shows: Per-mountpoint disk usage
- Excludes: tmpfs, fuse, NFS mounts
Node Details
10. Node Boot Time (Stat)
- Query:
node_boot_time_seconds - Shows: Timestamp when each node last booted
11. Node Uptime (Stat)
- Query:
time() - node_boot_time_seconds - Shows: Seconds since boot per node
12. Kernel Version (Table)
- Query:
node_uname_info - Shows: Linux kernel version per node
- Expected: Debian/Raspberry Pi OS kernel
13. OS Information (Table)
- Query:
node_os_info - Shows: Operating system distribution per node
- Expected: Debian GNU/Linux 12 (bookworm)
Use Cases
- Performance troubleshooting - Identify bottlenecks (CPU, memory, disk, network)
- Capacity planning - Track resource trends over time
- Disk space monitoring - Prevent out-of-space conditions
- Network diagnostics - Measure bandwidth utilization
Temperature Monitoring Dashboard
UID: temperature-monitoring
Refresh Rate: 30 seconds
Default Time Range: Last 24 hours
Purpose
Raspberry Pi CPU thermal monitoring and cooling efficiency analysis.
Panels (8 Total)
1. CPU Temperature Per Node (24h) (Time Series)
- Query:
node_hwmon_temp_celsius{chip="thermal_thermal_zone0",sensor="temp1"} - Shows: Real-time CPU temperature trends for all 5 nodes
- Visualization:
- Smooth line interpolation
- Threshold lines at 70°C (yellow) and 85°C (red)
- Legend shows: Last, Max, Mean, Min values
Raspberry Pi 5 Thermal Characteristics:
- Idle: 40-50°C
- Moderate Load: 50-65°C
- Heavy Load: 65-75°C
- Throttle Point: 85°C (CPU will reduce frequency)
- Critical: 90°C (system protection kicks in)
Temperature Statistics
2. Current Max Temperature (Stat)
- Query:
max(node_hwmon_temp_celsius{chip="thermal_thermal_zone0",sensor="temp1"}) - Shows: Hottest node right now
3. Current Avg Temperature (Stat)
- Query:
avg(node_hwmon_temp_celsius{chip="thermal_thermal_zone0",sensor="temp1"}) - Shows: Average CPU temp across cluster
4. Max Temperature (24h) (Stat)
- Query:
max_over_time(node_hwmon_temp_celsius{chip="thermal_thermal_zone0",sensor="temp1"}[24h]) - Shows: Peak temperature in last 24 hours
- Use: Identify thermal spikes during heavy workloads
5. Current Min Temperature (Stat)
- Query:
min(node_hwmon_temp_celsius{chip="thermal_thermal_zone0",sensor="temp1"}) - Shows: Coolest node right now
Advanced Visualizations
6. Temperature Distribution Heatmap
- Query:
node_hwmon_temp_celsius{chip="thermal_thermal_zone0",sensor="temp1"} - Visualization: Heatmap showing temperature distribution over time
- Color Scheme: Oranges (cooler = lighter, hotter = darker)
- Use: Identify thermal patterns and daily cycles
7. Temperature Delta (Cooling Efficiency) (Gauge)
- Query:
max(node_hwmon_temp_celsius{...}) - min(node_hwmon_temp_celsius{...}) - Shows: Temperature difference between hottest and coolest nodes
- Thresholds:
- Green: < 5°C (excellent cooling uniformity)
- Yellow: 5-10°C (acceptable variation)
- Red: > 10°C (poor cooling efficiency)
- Ideal: < 5°C delta indicates well-balanced cooling
8. Temperature Summary by Node (Table)
- Columns:
- Instance (node IP)
- Current Temp
- Max (24h)
- Avg (24h)
- Sorted by: Current temperature (descending)
- Use: Identify which nodes run hotter consistently
Temperature Sensor Details
Metric: node_hwmon_temp_celsius
Available Chips:
thermal_thermal_zone0- CPU thermal zone (used in dashboard)1000120000_pcie_1f000c8000_adc- PCIe/RP1 chip sensornvme_nvme0- NVMe SSD temperature
Why thermal_thermal_zone0? This is the kernel's thermal zone for the Broadcom BCM2712 SoC (Raspberry Pi 5 CPU), which represents the actual CPU die temperature.
Cooling Recommendations
If temperatures consistently exceed 75°C:
- Verify active cooling (fans) are operational
- Check for dust accumulation on heatsinks
- Ensure adequate airflow in rack/case
- Consider upgrading to Active Cooler or heatsink with larger surface area
- Review pod scheduling - redistribute CPU-heavy workloads
Normal Operation:
- Passive cooling: 55-70°C under load
- Active cooling (fan): 45-55°C under load
- Idle: 40-50°C
Use Cases
- Thermal management - Ensure nodes stay within safe operating temperatures
- Cooling validation - Verify active cooling is effective
- Workload optimization - Identify temperature spikes during specific workloads
- Preventive maintenance - Detect thermal degradation over time
Loki Log Analytics Dashboard
UID: loki-log-analytics
Refresh Rate: 30 seconds
Default Time Range: Last 1 hour
Purpose
Log aggregation monitoring and analysis for cluster-wide log insights.
Panels (10 Total)
Ingestion Metrics
1. Log Ingestion Rate (Lines/sec) (Time Series)
- Query:
sum(rate(loki_distributor_lines_received_total[5m])) - Shows: Total log lines ingested per second
- Unit: cps (counts per second)
- Typical Range: 10-100 lines/sec (depends on verbosity)
2. Log Ingestion Rate (Bytes/sec) (Time Series)
- Query:
sum(rate(loki_distributor_bytes_received_total[5m])) - Shows: Total log data ingested per second
- Unit: Bytes/sec
- Typical Range: 1-10 KB/sec
Loki Internals
3. Active Streams (Stat)
- Query:
sum(loki_ingester_streams) - Shows: Number of active log streams in ingester
- Stream: Unique combination of labels (namespace, pod, container)
4. Chunks Flushed/sec (Stat)
- Query:
sum(rate(loki_ingester_chunks_flushed_total[5m])) - Shows: Rate of chunk flushes to storage
- High rate: May indicate high log volume or short retention
5. Ingester Memory Usage (Stat)
- Query:
sum(loki_ingester_memory_chunks_bytes) - Shows: Memory used by in-memory log chunks
- Thresholds:
- Green: < 256MB
- Yellow: 256-512MB
- Red: > 512MB
- Tuned for: Pi cluster with limited RAM
Log Volume Analysis
6. Log Volume by Namespace (Time Series)
- Query:
sum by (namespace) (count_over_time({namespace!=""}[5m])) - Shows: Log lines per namespace over time
- Visualization: Stacked bars
- Use: Identify which namespaces are most verbose
7. Error Log Volume by Pod (Time Series)
- Query:
sum by (pod) (count_over_time({namespace!=""} |~ "(?i)(error|err|failed|fatal)" [5m])) - Shows: Error-level logs per pod
- Pattern: Case-insensitive match for error keywords
- Use: Quickly identify pods with errors
Query Performance
8. Query Performance (Latency) (Time Series)
- Queries:
- p95:
histogram_quantile(0.95, sum(rate(loki_request_duration_seconds_bucket[5m])) by (le)) - p99:
histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket[5m])) by (le))
- p95:
- Shows: Query latency at 95th and 99th percentiles
- Healthy: p95 < 1s, p99 < 2s
- Degraded: p95 > 2s indicates query performance issues
Log Inspection
9. Recent Error Logs (Logs Panel)
- Query:
{cluster="$cluster", namespace=~"$namespace"} |~ "(?i)(error|err|failed|fatal)" - Shows: Live stream of error-level logs from all namespaces
- Features:
- Filterable by namespace/pod
- Time-ordered (newest first)
- Syntax highlighting
- Pattern Matching: Case-insensitive regex for common error keywords
10. Top 20 Pods by Log Volume (Time Series)
- Query:
topk(20, sum by (pod) (count_over_time({namespace!=""}[5m]))) - Shows: Pods generating the most log lines
- Use: Identify chatty applications or potential logging issues
LogQL Query Optimization
Performance Best Practices:
-
Use specific namespace selectors:
# Good
{namespace!=""} # Excludes empty namespace
# Avoid
{namespace=~".+"} # permissive regex (slower) -
Limit time ranges for expensive queries:
- Large scans: Use 5m-15m ranges
- Error searches: Use 1h-6h ranges
- Full-text searches: Avoid > 24h ranges
-
Case-insensitive pattern matching:
# Efficient
|~ "(?i)(error|err|failed|fatal)"
# Redundant
|~ "(?i)(error|ERROR|err|ERR|failed|FAILED)"
Loki Configuration
Retention: 7 days (configured in Loki values.yaml) Storage: Local filesystem (PVC-backed) Limits:
- Max query size: 5000 lines
- Max streams per user: 10000
Use Cases
- Troubleshooting - Search for errors across all pods
- Performance monitoring - Track query latency and ingestion rate
- Capacity planning - Monitor log volume growth
- Incident investigation - Filter logs by time range and keywords
Migrated Community Dashboards
In addition to the 4 custom dashboards above, 13 community/vendor dashboards that were previously manually imported have been migrated to ConfigMaps for GitOps management.
Loki Dashboards (4)
1. Logging Dashboard via Loki
- Source: Community dashboard
- Purpose: Comprehensive log search and visualization
- Features: Multi-namespace log filtering, live log tailing, pattern detection
2. Loki Dashboard
- Source: Grafana Labs
- Purpose: Quick search and log volume overview
- Features: Log volume by namespace, ingestion rate, query performance
3. Loki Stack Monitoring (Alloy, Loki)
- Source: Community dashboard (updated 2026-03-01 for Alloy migration)
- Purpose: Monitor Loki and Alloy log collector components
- Features: Ingester metrics, distributor stats, Alloy scrape status
4. Loki 2.0 Global Metrics
- Source: Grafana Labs
- Purpose: Loki internal metrics and performance
- Features: Memory usage, query duration, chunk operations
Synology NAS Dashboards (2)
1. Synology Dashboard
- Source: Community SNMP dashboard
- Purpose: NAS health and storage monitoring
- Features: Volume capacity, disk temperatures, RAID status, iSCSI targets
2. Synology Dashboard2
- Source: Alternative SNMP dashboard layout
- Purpose: Detailed NAS performance metrics
- Features: Network throughput, CPU/memory, disk I/O, service status
UniFi Network Dashboards (7)
1. UniFi Access Points
- Purpose: Wireless access point monitoring
- Features: AP status, client connections, channel utilization, signal strength
2. UniFi Clients
- Purpose: Client device tracking
- Features: Connected clients, bandwidth usage, connection history
3. UniFi DPI (Deep Packet Inspection)
- Purpose: Application-level traffic analysis
- Features: Traffic by application, protocol distribution, top talkers
4. UniFi Gateway
- Purpose: USG/UDM gateway monitoring
- Features: WAN/LAN throughput, firewall statistics, threat detection
5. UniFi PDU
- Purpose: Power Distribution Unit monitoring (if applicable)
- Features: Power consumption, outlet status, voltage/current
6. UniFi Sites
- Purpose: Multi-site UniFi deployment overview
- Features: Site health, device counts, aggregate statistics
7. UniFi Switches
- Purpose: Switch monitoring
- Features: Port status, PoE usage, bandwidth per port, STP topology
Data Source: All UniFi dashboards use metrics from UniFi Poller (deployed in unipoller namespace)
Ingress Dashboards (1)
1. Ingress NGINX Overview
- Source: Custom dashboard (deployed 2026-03-01)
- Purpose: Comprehensive ingress controller monitoring
- Features:
- Overview stats: RPS, success rate, active connections, config reload, p95 latency
- Request rate by status code (2xx/3xx/4xx/5xx) with per-host breakdown
- Latency percentiles (p50/p90/p95/p99) with per-host p95
- Upstream performance: response time p95, request/response sizes
- Connection states: active/reading/writing/waiting + rate-limited 429s
- Controller health: memory, CPU, config reload timestamps
- Data Source: ingress-nginx controller metrics (ingress-nginx namespace)
- Folder: network
- Template Variable:
hostdropdown for per-Ingress filtering - Alerts: 7 PrometheusRule alerts for availability, latency, and controller health
Security Dashboards (2)
1. Trivy Security Overview
- Source: Custom dashboard
- Purpose: Kubernetes security scanning and compliance monitoring
- Features: Compliance status (CIS/NSA), RBAC assessments, configuration audits, exposed secrets
- Data Source: Trivy Operator metrics (trivy-system namespace)
- Note: Updated 2026-01-29 to focus on compliance/RBAC metrics due to v0.29.0 vulnerability scanning bug
2. Falco Runtime Security
- Source: Custom dashboard (deployed 2026-01-29)
- Purpose: Runtime threat detection and security event monitoring
- Features:
- Critical/Error/Warning event counts (24h)
- Security events by priority over time
- Events by rule, namespace, and top pods
- Syscall event rate and memory usage
- Event drop rate monitoring
- Data Source: Falco metrics (falco namespace)
- Alerts: 7 PrometheusRule alerts for critical security events
CI/CD Dashboards (1)
1. Argo Workflows
- Source: Custom dashboard (deployed 2026-01-29)
- Purpose: CI/CD pipeline monitoring
- Features: Workflow status, duration, success rate, resource usage
- Data Source: Argo Workflows metrics (argo-workflows namespace)
Dashboard Organization
These dashboards are organized in Grafana folders for easier navigation:
- Folder: Loki - 4 log monitoring dashboards
- Folder: Synology - 2 NAS monitoring dashboards
- Folder: UniFi - 7 network infrastructure dashboards
- Folder: Security - 3 security dashboards (Trivy, Falco, Gatekeeper)
- Folder: Network - 2 network dashboards (Network Utilization, Ingress NGINX)
- Folder: General - 4 custom Pi cluster dashboards + Storage + APM + CI/CD
ConfigMap Label for Folders:
labels:
grafana_dashboard: "1"
folder: "loki" # or "synology", "unifi"
Dashboard Audit (Updated 2026-03-01)
Current State
Total Dashboards: 49 (all provisioned via ConfigMap) ✅ Custom Dashboards: 10 (4 original + Trivy + Falco + Argo Workflows + Gatekeeper + Ingress NGINX + APM) Migrated Community Dashboards: 13 (Loki: 4, Synology: 2, UniFi: 7) Kube-Prometheus-Stack Dashboards: 26 Uncommitted Dashboards: 0 ✅
All dashboards are managed as code - there are no manually created or uncommitted dashboards in the Grafana UI.
Audit Process
The following audit was performed to verify all dashboards are in GitOps:
-
Verified Dashboard Provisioning Configuration:
# Check sidecar provisioning config
kubectl exec -n default deployment/kube-prometheus-stack-grafana \
-c grafana -- cat /etc/grafana/provisioning/dashboards/sc-dashboardproviders.yamlKey Settings:
allowUiUpdates: false- UI modifications are disableddisableDeletion: false- Dashboards can be deleted but will be recreated by sidecarpath: /tmp/dashboards- All dashboards loaded from this directory
-
Listed All Provisioned Dashboards:
# List all dashboard files
kubectl exec -n default deployment/kube-prometheus-stack-grafana \
-c grafana -- ls -1 /tmp/dashboards/ | sortCustom Dashboards (4):
loki-log-analytics.jsonnode-resource-monitoring.jsonpi-cluster-overview.jsontemperature-monitoring.json
Migrated Community Dashboards (13):
logging-dashboard-via-loki.json(Loki folder)loki-dashboard.json(Loki folder)loki-stack-monitoring.json(Loki folder)loki2-global-metrics.json(Loki folder)synology-dashboard.json(Synology folder)synology-dashboard2.json(Synology folder)unifi-access-points.json(UniFi folder)unifi-clients.json(UniFi folder)unifi-dpi.json(UniFi folder)unifi-gateway.json(UniFi folder)unifi-pdu.json(UniFi folder)unifi-sites.json(UniFi folder)unifi-switches.json(UniFi folder)
Kube-Prometheus-Stack Dashboards (26):
alertmanager-overview.jsonapiserver.jsoncluster-total.jsoncontroller-manager.jsongrafana-overview.jsonk8s-coredns.jsonk8s-resources-cluster.jsonk8s-resources-multicluster.jsonk8s-resources-namespace.jsonk8s-resources-node.jsonk8s-resources-pod.jsonk8s-resources-workload.jsonk8s-resources-workloads-namespace.jsonkubelet.jsonnamespace-by-pod.jsonnamespace-by-workload.jsonnode-cluster-rsrc-use.jsonnode-rsrc-use.jsonnodes-aix.jsonnodes-darwin.jsonnodes.jsonpersistentvolumesusage.jsonpod-total.jsonprometheus.jsonscheduler.jsonworkload-total.json
-
Verified All Dashboards Have ConfigMap Sources:
# Count dashboard ConfigMaps
kubectl get configmap -n default -l grafana_dashboard=1 | wc -lResult: 43 ConfigMaps (matches 43 dashboard files)
Audit Conclusion
✅ All dashboards are managed as code via GitOps
✅ UI dashboard creation is disabled (allowUiUpdates: false)
✅ No manual migrations needed - all existing dashboards already have ConfigMap sources
✅ Sidecar auto-discovery is working - all ConfigMaps are loaded automatically
Recommendation: Maintain this GitOps-only workflow for all future dashboard changes.
Common Tasks
Adding a New Dashboard
Note: Dashboard creation through the Grafana UI is disabled (allowUiUpdates: false). All dashboards must be created as ConfigMaps.
Workflow:
-
Option A: Create JSON manually or Option B: Create locally and export
If using Option B:
- Temporarily enable
allowUiUpdates: truein values.yaml - Create dashboard in Grafana UI
- Export JSON (Settings → JSON Model)
- Disable
allowUiUpdates: falseagain
- Temporarily enable
-
Create ConfigMap YAML:
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-<name>
namespace: default
labels:
grafana_dashboard: "1" # Required for sidecar discovery
app: grafana
data:
<name>.json: |
{
"editable": false,
"title": "Dashboard Title",
"uid": "unique-dashboard-id",
...
} -
Add to
manifests/base/grafana/dashboards/kustomization.yaml:resources:
- pi-cluster-overview.yaml
- node-resource-monitoring.yaml
- temperature-monitoring.yaml
- loki-log-analytics.yaml
- <your-new-dashboard>.yaml # Add here -
Commit and push changes
-
ArgoCD syncs automatically (~3 minutes)
-
Grafana sidecar discovers and loads dashboard (~30 seconds)
Modifying an Existing Dashboard
Option 1: Edit YAML directly (recommended)
- Edit the dashboard ConfigMap YAML
- Commit and push changes
- ArgoCD syncs automatically
- Grafana reloads dashboard (~30s)
Option 2: Export from UI
- Make changes in Grafana UI
- Export JSON (Settings → JSON Model)
- Copy JSON into ConfigMap YAML
- Ensure
editable: falseis set - Commit and push
Note: UI edits are temporary - they will be overwritten on next sync.
Troubleshooting Dashboard Issues
Dashboard not appearing:
# 1. Verify ConfigMap exists
kubectl get configmap -n default -l grafana_dashboard=1
# 2. Check sidecar logs
kubectl logs -n default deployment/kube-prometheus-stack-grafana \
-c grafana-sc-dashboard --tail=50
# 3. Verify dashboard was written
kubectl logs -n default deployment/kube-prometheus-stack-grafana \
-c grafana-sc-dashboard | grep "Writing.*<dashboard-name>"
Query not returning data:
# Test query in Prometheus
kubectl port-forward -n default prometheus-kube-prometheus-stack-prometheus-0 9090:9090
# Open http://localhost:9090 and test PromQL query
Temperature metrics missing:
# With the port-forward from above still running, check available hwmon chips
curl -s 'http://localhost:9090/api/v1/label/chip/values' | jq
# Query all temperature sensors
curl -s 'http://localhost:9090/api/v1/query?query=node_hwmon_temp_celsius' | jq
Dashboard Configuration
Datasource References
All dashboards use structured datasource format:
Prometheus:
"datasource": {
"type": "prometheus",
"uid": "prometheus"
}
Loki:
"datasource": {
"type": "loki",
"uid": "loki"
}
Common Thresholds
Resource Usage (CPU/Memory):
- Green: < 70%
- Yellow: 70-90%
- Red: > 90%
Temperature:
- Green: < 70°C
- Yellow: 70-85°C
- Red: > 85°C
Loki Ingester Memory:
- Green: < 256MB
- Yellow: 256-512MB
- Red: > 512MB