Skip to main content

Homelab TODO & Improvements

Recently Completed (December 2025 - March 2026)

Infrastructure Fixes (January 2026)

  • Tigera Operator Migration - Migrated Calico CNI to GitOps-managed Tigera operator (PRs #346-352, 2026-01-30)
    • Calico now managed by Tigera operator in calico-system namespace
    • ArgoCD Application with multi-source (operator from GitHub, Installation CR from homelab)
    • Typha topology spread constraints for node distribution
    • Established ignoreDifferences patterns for operator-managed resources
    • ArgoCD repo-server memory increased to 512Mi for large manifest generation
  • External-DNS Domain Filter Fix - Fixed subdomain zone filtering (PRs #295-296, 2026-01-25)
    • Root cause: --domain-filter=k8s.n37.ca rejected the n37.ca Cloudflare zone
    • Solution: Use parent zone as domain-filter; ingresses specify exact hostnames
  • Grafana fsGroup Race Condition - Fixed mount failure with Synology CSI (PR #298, 2026-01-25)
    • Root cause: SQLite journal file deleted during fsGroup recursive application
    • Solution: Added fsGroupChangePolicy: OnRootMismatch to podSecurityContext

Secrets Management (January 2026)

  • Sealed Secrets Migration - Migrated 8 secrets from git-crypt to SealedSecrets (2026-01-14)
  • External Secrets Removed - Evaluation complete, Sealed Secrets chosen for simplicity (2026-01-14)
  • Secrets Directory Cleanup - Removed 15 obsolete files, only ArgoCD bootstrap secret remains

Backup & Disaster Recovery (January 2026)

  • Velero Backblaze B2 Migration - Migrated from LocalStack to Backblaze B2 for production backups (2026-01-14, PR #239)
  • Velero CSI Snapshots - Configured Velero to use CSI snapshots exclusively (2026-01-05)
  • snapshot-controller Fix - Temporarily downgraded from v8.2.0 → v6.3.1 to resolve VolumeSnapshot failures (2026-01-05), then upgraded to v8.2.1 with csi-snapshotter v8.4.0 (2026-01-11)
  • Loki Memory Optimization - Implemented GOMEMLIMIT, ingestion rate limits, reduced memory usage from 474Mi → 232Mi (2026-01-05)

Monitoring & Observability (December 2025)

  • SNMP Monitoring for Synology - Deployed SNMP exporter, scraping NAS metrics (disk health, temperature, RAID status)
  • Node Exporter for Pi Cluster - DaemonSet running on all 5 nodes, monitoring CPU, memory, disk, network
  • Log Aggregation - Loki + Alloy deployed, 7-day retention, collecting logs from all pods on all nodes (including control-plane)
  • Prometheus Stack Fixes - Fixed node-exporter scraping, Grafana PVC issues, cleaned up control plane monitoring
  • Control Plane Monitoring - Re-enabled kube-scheduler and kube-controller-manager monitoring
  • ServiceMonitor Enablement - Enabled metrics collection for Loki and Alloy

DNS & Service Discovery

  • External-DNS Deployment - Dual provider setup (Cloudflare + UniFi webhook) for split-horizon DNS (2025-12-27)
    • Cloudflare provider for public DNS records
    • kashalls/external-dns-unifi-webhook v0.7.0 for internal DNS
    • Automatic DNS record creation for Ingresses (argocd.k8s.n37.ca, grafana.k8s.n37.ca, localstack.k8s.n37.ca, workflows.k8s.n37.ca)
    • TXT registry for ownership tracking
    • Fixed domain-filter for subdomain zones (PRs #295-296, 2026-01-25) - Use parent zone (n37.ca) as domain-filter

Documentation

  • Comprehensive Docs Site - k8s-docs-n37 Docusaurus site with application guides
  • External-DNS Guide - Complete documentation with dual provider setup and troubleshooting
  • Loki Application Guide - Complete documentation for Loki + Alloy deployment
  • SNMP Exporter Guide - Synology monitoring documentation
  • Troubleshooting Guides - Monitoring stack and common issues documented

🎯 High Priority

1. Blackbox Exporter ✅ Complete

  • Blackbox Exporter - Fully operational (deployed 2025-12-27, verified 2025-12-28)
  • Deploy blackbox exporter for endpoint monitoring (v0.25.0, 2 replicas)
  • Monitor external services availability (DNS, HTTP/HTTPS probes configured)
  • SSL certificate expiry monitoring for k8s.n37.ca domain (https_cert_expiry module)
  • Network latency and response time tracking (ICMP ping monitoring)
  • Add alerts for service downtime (12 PrometheusRule alerts configured)
  • Monitor Synology NAS web interface availability (10.0.1.204 monitored)

Documentation: See Blackbox Exporter Application Guide for complete deployment details.

2. Enhanced Alerting ✅ Complete

  • AlertManager SMTP Email - Configured Gmail SMTP for critical alerts (2025-12-27)
  • Alert Routing - Critical → email, warning/info → null (reduce noise)
  • Velero Backup Alerts - 7 PrometheusRule alerts for backup monitoring
  • HTML Email Templates - Custom-formatted critical alert emails
  • Configure AlertManager webhook to Discord/Slack/Telegram - Not used (email preferred)
  • Implement tiered alerting (warning → suppress, critical → email)
  • Predictive Disk Space Alerts - Node filesystem, PVC, and Synology volume alerts with predict_linear() (2026-01-12)
  • NAS Health Alerts - Disk failures, RAID degradation, temperature, bad sectors, power status (2026-01-12)
  • Alert runbooks - Documented in secrets/SEALED-SECRETS.md and k8s-docs-n37 (2026-01-14)
  • Test alert routing - Verified email delivery (121 sent, 0 failed) (2026-01-14)

3. Backup Strategy ✅ Complete

  • Velero - Deployed for Kubernetes cluster backup (2025-12-27)
  • CSI Snapshots - Configured Velero to use CSI snapshots exclusively (2026-01-05)
  • snapshot-controller - Temporarily deployed v6.3.1 (2026-01-05), then upgraded to v8.2.1 with csi-snapshotter v8.4.0 (2026-01-11)
  • Backup critical PVCs (Prometheus 50Gi, Grafana 5Gi, Loki 20Gi)
  • Daily PVC backups (2 AM, 30-day retention) - CSI snapshots operational
  • Weekly cluster resource backups (3 AM Sunday, 90-day retention)
  • Velero backup monitoring alerts (7 PrometheusRule alerts)
  • Fixed VolumeSnapshot failures - Upgraded snapshot-controller to v8.2.1, csi-snapshotter to v8.4.0 (2026-01-11)
  • LocalStack Sync Wave Fix - LocalStack at wave -7, before Velero (-5) ✓
  • Schedule regular backup testing - Velero B2 restore tested and validated (2026-01-14)
  • Migrate from LocalStack to Backblaze B2 - Production backup storage (2026-01-14, PR #239)
  • Test disaster recovery scenarios - Namespace restore with SealedSecrets validated (2026-01-14)
  • ArgoCD configuration backup automation - Daily backup schedule at 1:30 AM (2026-01-14)

Note: Kopia file-level backups disabled in favor of CSI snapshots (more efficient for block storage)

Documentation: See Velero Application Guide for complete deployment details and disaster recovery procedures.


🔍 Monitoring & Observability Enhancements

4. Custom Dashboards ✅ Complete

  • Custom Grafana Dashboards - 4 dashboards deployed via ConfigMap provisioning (2025-12-28)
  • Pi cluster temperature monitoring dashboard (per-node CPU temps with Raspberry Pi 5 specifics)
  • Node resource utilization dashboard (CPU, memory, disk per node)
  • Loki log volume and ingestion rate dashboard (log analytics and error tracking)
  • Create unified "cluster health" dashboard (Pi Cluster Overview with 12 panels)
  • Migrate Uncommitted Dashboards to Code - Completed audit, no migration needed (2025-12-28)
    • Audit Grafana UI for any manually created or modified dashboards (30 total, all in ConfigMaps)
    • Export uncommitted dashboards as JSON (N/A - no uncommitted dashboards found)
    • Create ConfigMap manifests for exported dashboards (N/A - all 30 already in code)
    • Add to kustomization and deploy via GitOps (N/A - all already deployed)
    • Verify dashboards load correctly after migration (All 30 dashboards confirmed via sidecar)
    • Document dashboard creation and modification workflow (Added comprehensive audit section)
  • Network utilization dashboard - Deployed initial cluster-wide network utilization view (2026-02-05)
  • Storage performance metrics (iSCSI latency, IOPS, throughput) - Dashboard deployed (PR #383, 2026-02-05)
  • Application performance monitoring (APM) dashboard - 8-row overview with service health, CPU/memory, blackbox endpoints, API server, network I/O, saturation (2026-02-13)

Documentation: See Grafana Dashboards Guide for dashboard details.

5. Metrics Server Deployment ✅ Complete

  • Metrics Server - Deployed for kubectl top and HPA (2025-12-28)
  • Deploy metrics-server for kubectl top commands
  • Enable Horizontal Pod Autoscaler (HPA) capabilities
  • Configure for resource-constrained Pi environment (50m CPU / 100Mi RAM)
  • Prometheus ServiceMonitor integration

6. Log-Based Alerting ✅ ENABLED (2026-03-01)

  • Loki Ruler Alerting - Enabled via structuredConfig (rulerConfig ignored when ruler.enabled=false)
  • Set up Loki alerting rules for error patterns (HighErrorLogRate, CriticalErrorLogs)
  • Alert on CrashLoopBackOff events (CrashLoopBackOffDetected)
  • Alert on OOMKilled events (OOMKilledDetected)
  • Alert on persistent pod failures (PersistentPodRestarts)
  • Create log-based SLO monitoring (Error rate tracking via HighErrorLogRate)
  • Additional alerts: HTTP 5xx errors, DB connection errors, auth failures, security events

Status: 9 LogQL rules in 4 groups deployed as ConfigMap with loki_rule label. k8s-sidecar loads rules to /rules/fake/ for embedded ruler in singleBinary mode. Alerts route to AlertManager (PR #489, 2026-03-01).

Documentation: See Loki Application Guide for complete deployment details including log-based alerting.


🛡️ Security & Compliance

7. Security Scanning & Runtime Protection ✅ Complete

  • Trivy Operator - Container vulnerability scanning (deployed 2026-01-05, chart 0.31.0)
    • ServiceMonitor configured for Prometheus metrics
    • VulnerabilityReports available via kubectl
    • Scanning all cluster images automatically
    • Node-collector tolerations for control-plane scanning (PR #345, 2026-01-30)
  • Falco - Runtime security monitoring (deployed 2026-01-29, chart 8.0.1)
    • Modern eBPF driver for ARM64 efficiency
    • DaemonSet running on all nodes including control-plane
    • Falcosidekick with AlertManager and Loki integration
    • Web UI at falco.k8s.n37.ca (PR #340)
    • Custom rules for homelab (cryptocurrency mining, reverse shell detection)
    • PrometheusRules for security alerts
    • NetworkPolicy configured (PR #339, #344)
  • OPA Gatekeeper - Policy enforcement and admission control (deployed 2026-02-06, chart 3.21.1)
    • 5 ConstraintTemplates: resource limits, allowed repos, required labels, block NodePort, container limits
    • All constraints switched to deny mode (0 violations, 2026-02-07)
    • Pi-optimized: 1 replica, 100m/256Mi requests, 500m/512Mi limits
    • Prometheus metrics with ServiceMonitor
    • NetworkPolicy configured
    • System namespaces exempted (kube-system, argocd, gatekeeper-system)
  • Security policy definitions for workloads
  • Compliance reporting and alerting (PSS Baseline + Restricted alerts, weekly CronJob summary to AlertManager)
  • Create Grafana dashboard for vulnerability trends (completed 2026-02-08, PRs #410-412: fixed NetworkPolicy HBONE, Gatekeeper exemption, SBOM bug)

Documentation: See Trivy Operator Guide and Vulnerability Remediation Guide for details.

8. Secrets Management ✅ Complete

  • Evaluation Complete - Sealed Secrets recommended for homelab (2026-01-13)
    • Sealed Secrets: 1 pod, 9Mi RAM, simple, GitOps-native
    • External Secrets: 3 pods, 69Mi RAM, complex, requires backend
  • Sealed Secrets Deployed - bitnami-labs/sealed-secrets v2.16.2 (2026-01-13)
  • Secrets Migrated to SealedSecrets (2026-01-14)
    • unipoller-secret, external-dns (cloudflare + unifi), alertmanager-smtp-credentials
    • snmp-exporter-credentials, cert-manager cloudflare token, synology-csi client-info
    • pihole-web-password (8 secrets total)
  • External Secrets Operator Removed - Evaluation complete, not needed (2026-01-14)
  • Secrets Directory Cleaned - Only bootstrap secret (ArgoCD SSH key) remains (2026-01-14)
  • Documentation Updated - CLAUDE_NOTES.md and secrets/README.md updated
  • Set up SealedSecrets sealing key rotation automation - SealedSecrets controller key rotation enabled (30d, 2026-02-05); cert-manager separately handles TLS cert renewal automatically
  • Create runbook for adding new SealedSecrets (added to SEALED-SECRETS.md, PR #489, 2026-03-01)

Documentation: See Secrets Management Guide for complete procedures including rotation and disaster recovery.

9. Network Policies ✅ COMPLETE (2026-01-25)

  • Define NetworkPolicies for namespace isolation (18 namespaces)
  • Implement ingress/egress rules for sensitive workloads
    • localstack: Allow velero, ingress-nginx, prometheus; egress DNS only
    • unipoller: Allow prometheus; egress DNS + UniFi controller
    • loki: Allow alloy, prometheus, grafana; egress DNS + alertmanager + K8s API
    • trivy-system: Allow prometheus; egress DNS + K8s API + registries
    • velero: Allow prometheus; egress DNS + localstack + B2 + K8s API
    • argo-workflows: Allow ingress-nginx, prometheus; egress DNS + K8s API + B2 (2026-01-24)
    • cert-manager: Allow webhook validation, prometheus; egress DNS + K8s API + Let's Encrypt + Cloudflare (2026-01-25)
    • external-dns: Allow prometheus, internal webhook; egress DNS + K8s API + Cloudflare + UniFi (2026-01-25)
    • metallb-system: Allow prometheus, memberlist, webhook; egress DNS + K8s API (2026-01-25)
    • ingress-nginx: Allow external traffic, prometheus; egress DNS + K8s API
    • istio-system: Allow prometheus, webhook; egress DNS + K8s API + HBONE port 15008
    • gatekeeper-system: Allow prometheus, webhook; egress DNS + K8s API
    • falco: Allow prometheus, alertmanager, loki; egress DNS + K8s API
    • default: Allow ingress-nginx, prometheus; egress DNS + K8s API
    • argocd: Allow ingress-nginx, prometheus; egress DNS + K8s API + GitHub
    • synology-csi: Allow K8s API; egress DNS + NAS iSCSI
    • kube-system: Allow prometheus; egress DNS + K8s API (metrics-server port 10250)
    • tigera-operator: Allow prometheus; egress DNS + K8s API
  • Test policy enforcement (all tests passed)
  • Document network segmentation strategy in k8s-docs-n37 (PR #60, 2026-01-29)

Configuration: See manifests/base/network-policies/ in the homelab repository for all policy definitions.

Documentation: See Network Policies Guide for complete policy definitions and management procedures.


🚀 Platform Enhancements

10. Service Mesh ✅ DEPLOYED (2026-01-28)

  • Research lightweight service mesh options for Pi cluster
  • Evaluate Linkerd (lightweight, Pi-friendly) - Considered but Istio Ambient selected
  • Evaluate Istio (full-featured but resource-intensive) - Istio Ambient mode chosen
  • Proof-of-concept deployment in test namespace
  • Performance impact analysis on Pi 5 cluster (~38m CPU, ~145Mi memory)
  • Document decision and implementation plan
  • Status: Istio Ambient Mesh deployed with mTLS on 29 pods across 6 namespaces
  • Note: All 25 ArgoCD apps Synced and Healthy (OutOfSync resolved 2026-02-05, PRs #379-381)

11. Ingress Enhancements ✅ Complete

  • Document current nginx-ingress configuration (Updated network-info.md with all 5 Ingresses, rate limits, hardening config)
  • Implement rate limiting for public endpoints (Already configured: 50-100 RPS + 20 conn limits on all Ingresses)
  • Add ModSecurity WAF rules Deferred: 256Mi memory limit insufficient for OWASP CRS (~512-768Mi needed); not justified for private 10.0.10.0/24 network
  • Configure geo-blocking if needed N/A: All services on private network (MetalLB IP 10.0.10.10 is RFC 1918), no public ingress
  • Monitor ingress performance and errors (Created 7 PrometheusRule alerts + Grafana dashboard with 20 panels)

🏗️ Infrastructure & DevOps

12. GitOps Enhancements

  • Renovate - Automated dependency updates for Helm charts (deployed 2026-01-23)
    • GitHub App installed and configured
    • ArgoCD Application manifest scanning (Helm charts)
    • Docker image tag updates in Kubernetes manifests
    • Grouped updates (ArgoCD, monitoring, networking, security, backup)
    • Weekend schedule (Sat/Sun 6am-9pm) to minimize disruption
  • Pre-commit hooks for Kubernetes manifest validation (kubeval, kustomize)
  • Automated testing pipeline for infrastructure changes
  • Expand GitOps workflow documentation
  • Consider multi-cluster ArgoCD setup for dev/staging

Configuration: See renovate.json in the homelab repository.

13. Development & CI/CD Tools - Argo Workflows ✅ DEPLOYED (2026-01-24)

Phase 1: Argo Workflows Deployment ✅ Complete

  • Deploy Argo Workflows v3.7.8 (Helm chart 0.47.1)
  • Configure sync-wave: -8 (before LocalStack (-7) and Velero (-5))
  • Set up artifact repository (Backblaze B2) ✅ Fixed (PRs #287-289, 2026-01-24)
  • Configure resource limits for Pi cluster constraints:
    • Controller: 50m CPU / 128Mi RAM (request), 100m / 256Mi (limit)
    • Server: 25m CPU / 64Mi RAM (request), 50m / 128Mi (limit)
  • Enable Prometheus ServiceMonitor for workflow metrics
  • NetworkPolicy enabled ✅ Fixed K8s API egress (PR #291, 2026-01-24)
  • Ingress configured at workflows.k8s.n37.ca (PR #293, 2026-01-24)
  • Create Grafana dashboards for workflow monitoring (2026-01-29)
  • Set up AlertManager rules for workflow failures (2026-01-30, PR #354)

Phase 2: Workflow Integration

  • ARM64 container image build workflows
  • Automated testing pipelines for infrastructure changes
  • Monthly backup validation workflows (Velero restore tests)
  • Security vulnerability scanning workflows (Trivy integration)
  • Infrastructure compliance scan workflows

Phase 3: Advanced Features

  • SSO integration via oauth2-proxy
  • Workflow templates library
  • Automated dependency updates (Renovate integration)
  • Multi-cluster workflow support (if dev/staging clusters added)

Alternative Tools Considered:

  • Evaluate Tekton (more complex, higher resource usage)
  • Evaluate Gitea vs GitLab for self-hosted git
  • Harbor - Container registry with vulnerability scanning
  • Build and deployment automation for ARM64 custom containers

🌐 Network & Access Management

14. CoreDNS Customization

  • Document current CoreDNS configuration
  • Custom DNS records for internal services
  • DNS-based service discovery patterns
  • DNS monitoring and troubleshooting tools
  • Consider DNS caching optimizations

15. VPN & Remote Access

  • Evaluate Tailscale vs WireGuard for cluster access
  • Deploy chosen VPN solution
  • oauth2-proxy - Single Sign-On (SSO) integration
  • Multi-factor authentication for critical services
  • Document remote access policies and procedures
  • VPN performance monitoring

🔧 Operational Improvements

16. Documentation Enhancements

  • Create operational runbooks for common tasks (pod restarts, rollbacks, etc.)
  • Document disaster recovery procedures (node failure, control plane failure)
  • Capacity planning documentation with growth projections
  • Create network topology diagrams to complement the existing network-info.md documentation
  • Performance baseline documentation
  • Document on-call procedures and escalation paths
  • Create k8s-docs-n37 guides for: cert-manager, metallb, ingress-nginx, localstack

17. Testing & Validation

  • Chaos engineering with Litmus (lighter than Chaos Monkey)
  • Load testing framework for applications
  • Backup and restore testing automation (monthly validation)
  • Network failure simulation and recovery testing
  • Performance regression testing
  • Test node drain and pod eviction scenarios

18. Resource Optimization

  • Audit resource requests/limits across all workloads (7 workloads adjusted, 2026-02-11)
  • Identify over-provisioned pods (resource right-sizing audit complete, net +928Mi requests)
  • Implement pod resource quotas per namespace
  • Storage capacity planning and alerting
  • Network bandwidth monitoring and optimization
  • Consider implementing Vertical Pod Autoscaler (VPA)

🌟 Nice to Have

19. Pi Cluster Specific Monitoring

  • Power consumption tracking (requires PoE monitoring or UPS integration)
  • Track PoE power draw per node
  • NVMe thermal throttling detection
  • Track undervoltage events
  • ARM64-specific performance optimizations

20. Application Deployments

  • Home Assistant integration
  • Private container registry (Harbor or similar)
  • Internal wiki or knowledge base
  • Status page (Uptime Kuma or similar)
  • Internal chat/collaboration tool

21. Observability Maturity Enhancements

  • Distributed Tracing - Evaluate Jaeger or Tempo for trace collection
  • Continuous Profiling - Pyroscope for application performance profiling
  • Service Level Objectives (SLOs) - Define and monitor SLOs for critical services
  • Error Budget Tracking - Automated SLO/error budget reporting
  • Anomaly Detection - ML-based anomaly detection for metrics (Prometheus AI/ML)
  • Synthetic Monitoring - Automated user journey testing

22. Disaster Recovery Testing

  • Monthly DR Drills - Automated disaster recovery validation
  • Chaos Engineering - Controlled failure injection (Litmus)
  • Velero Restore Testing - Automated monthly PVC restore validation
  • Network Partition Testing - Simulate network failures
  • Node Failure Scenarios - Test cluster resilience to node loss
  • Control Plane Failure - Test etcd backup/restore procedures
  • DR Runbook Automation - Convert manual runbooks to Argo Workflows

23. Cost Optimization & Efficiency

  • Resource Right-Sizing - Analyze actual vs requested resources
  • Spot/Preemptible Instances - Not applicable for bare metal, document for future cloud consideration
  • Storage Optimization - Compress old logs, optimize retention policies
  • Network Egress Optimization - Monitor and optimize outbound traffic
  • Power Consumption Tracking - PoE monitoring and efficiency analysis
  • Carbon Footprint - Calculate and optimize cluster carbon footprint

24. LLM Hosting & AI Infrastructure (Planning)

  • GPU Hardware - Add GPU-capable unit to cluster (planned)
  • Evaluate inference frameworks - vLLM, Ollama, LocalAI, llama.cpp for ARM64/GPU
  • Kubernetes GPU scheduling - NVIDIA device plugin or equivalent
  • Model storage - Plan NFS/iSCSI storage for large model weights (7B-70B+ parameter models)
  • Resource isolation - Dedicated node pool or taints/tolerations for GPU workloads
  • API gateway - OpenAI-compatible API endpoint for model serving
  • Monitoring - GPU utilization, inference latency, token throughput dashboards
  • Model management - Version control and deployment pipeline for models
  • Network considerations - High-bandwidth model loading, inference API exposure

📅 Implementation Priorities

Items are organized by priority, not by timeline. Focus on:

Phase 1: Foundation & Reliability ✅ Complete

  1. ✅ Backup strategy (Velero + critical PVC backups)
  2. ✅ Enhanced alerting (AlertManager notifications)
  3. ✅ Metrics server deployment
  4. ✅ Blackbox exporter for endpoint monitoring

Phase 2: Security & Observability ✅ Complete

  1. ✅ Security scanning (Trivy Operator)
  2. ✅ Secrets management migration (SealedSecrets)
  3. ✅ Blackbox exporter for endpoint monitoring
  4. ✅ Custom Grafana dashboards

Phase 3: Advanced Features ✅ Complete

  1. ✅ GitOps enhancements (Renovate deployed 2026-01-23)
  2. ✅ Network policies implementation (18 namespaces isolated)
  3. ✅ Development tools and CI/CD (Argo Workflows deployed 2026-01-24)
  4. ✅ Service mesh (Istio Ambient deployed 2026-01-28)

Phase 4: Optimization & Expansion

  1. Resource optimization and VPA
  2. Chaos engineering and resilience testing
  3. Advanced networking and VPN
  4. Additional application deployments

🔄 ArgoCD Sync Wave Order

Wave -100: tigera-operator (CNI foundation - ArgoCD-managed)
Wave -50: argocd (self-management)
Wave -35: metal-lb (networking foundation)
Wave -30: synology-csi (storage driver)
Wave -25: sealed-secrets (secrets management)
Wave -20: unipoller (UniFi metrics collection)
Wave -15: kube-prometheus-stack (monitoring stack)
Wave -12: loki (log aggregation)
Wave -11: alloy (log collection, replaced Promtail 2026-03-01)
Wave -10: cert-manager, external-dns, metrics-server (certificates & DNS & metrics)
Wave -8: argo-workflows (CI/CD)
Wave -7: localstack (S3 mock for Velero)
Wave -6: gatekeeper (admission control, policy enforcement)
Wave -5: velero, falco (backup, runtime security)

📋 Notes

  • Resource Constraints: All implementations must consider the Pi 5 cluster constraints (80GB RAM total, 20 ARM cores)
  • Testing Strategy: Test all implementations in a development namespace before production deployment
  • Documentation First: Document all configurations and procedures for maintainability in this docs site
  • GitOps Workflow: All changes must go through PR workflow in the homelab repository
  • Regular Reviews: Review and update this TODO list monthly based on cluster evolution
  • Monitoring First: Ensure monitoring is in place before deploying new workloads

🔗 References

  • homelab/TODO.md - Infrastructure repository TODO list (source of truth — this page mirrors it)
  • homelab/Hardware.md - Cluster hardware specifications
  • homelab/network-info.md - Comprehensive network configuration
  • k8s-docs-n37 Application Guides - Per-application deployment documentation (see docs/applications/ in this site)