Homelab TODO & Improvements

✅ Recently Completed (December 2025 - March 2026)

Infrastructure Fixes (January 2026)

Tigera Operator Migration - Migrated Calico CNI to GitOps-managed Tigera operator (PRs #346-352, 2026-01-30)
- Calico now managed by Tigera operator in calico-system namespace
- ArgoCD Application with multi-source (operator from GitHub, Installation CR from homelab)
- Typha topology spread constraints for node distribution
- Established ignoreDifferences patterns for operator-managed resources
- ArgoCD repo-server memory increased to 512Mi for large manifest generation
External-DNS Domain Filter Fix - Fixed subdomain zone filtering (PRs #295-296, 2026-01-25)
- Root cause: --domain-filter=k8s.n37.ca rejected the n37.ca Cloudflare zone
- Solution: Use parent zone as domain-filter; ingresses specify exact hostnames
Grafana fsGroup Race Condition - Fixed mount failure with Synology CSI (PR #298, 2026-01-25)
- Root cause: SQLite journal file deleted during fsGroup recursive application
- Solution: Added fsGroupChangePolicy: OnRootMismatch to podSecurityContext

Secrets Management (January 2026)

Sealed Secrets Migration - Migrated 8 secrets from git-crypt to SealedSecrets (2026-01-14)
External Secrets Removed - Evaluation complete, Sealed Secrets chosen for simplicity (2026-01-14)
Secrets Directory Cleanup - Removed 15 obsolete files, only ArgoCD bootstrap secret remains

Backup & Disaster Recovery (January 2026)

Velero Backblaze B2 Migration - Migrated from LocalStack to Backblaze B2 for production backups (2026-01-14, PR #239)
Velero CSI Snapshots - Configured Velero to use CSI snapshots exclusively (2026-01-05)
snapshot-controller Fix - Temporarily downgraded from v8.2.0 → v6.3.1 to resolve VolumeSnapshot failures (2026-01-05), then upgraded to v8.2.1 with csi-snapshotter v8.4.0 (2026-01-11)
Loki Memory Optimization - Implemented GOMEMLIMIT, ingestion rate limits, reduced memory usage from 474Mi → 232Mi (2026-01-05)

Monitoring & Observability (December 2025)

SNMP Monitoring for Synology - Deployed SNMP exporter, scraping NAS metrics (disk health, temperature, RAID status)
Node Exporter for Pi Cluster - DaemonSet running on all 5 nodes, monitoring CPU, memory, disk, network
Log Aggregation - Loki + Alloy deployed, 7-day retention, collecting logs from all pods on all nodes (including control-plane)
Prometheus Stack Fixes - Fixed node-exporter scraping, Grafana PVC issues, cleaned up control plane monitoring
Control Plane Monitoring - Re-enabled kube-scheduler and kube-controller-manager monitoring
ServiceMonitor Enablement - Enabled metrics collection for Loki and Alloy

DNS & Service Discovery

External-DNS Deployment - Dual provider setup (Cloudflare + UniFi webhook) for split-horizon DNS (2025-12-27)
- Cloudflare provider for public DNS records
- kashalls/external-dns-unifi-webhook v0.7.0 for internal DNS
- Automatic DNS record creation for Ingresses (argocd.k8s.n37.ca, grafana.k8s.n37.ca, localstack.k8s.n37.ca, workflows.k8s.n37.ca)
- TXT registry for ownership tracking
- Fixed domain-filter for subdomain zones (PRs #295-296, 2026-01-25) - Use parent zone (n37.ca) as domain-filter

Documentation

Comprehensive Docs Site - k8s-docs-n37 Docusaurus site with application guides
External-DNS Guide - Complete documentation with dual provider setup and troubleshooting
Loki Application Guide - Complete documentation for Loki + Alloy deployment
SNMP Exporter Guide - Synology monitoring documentation
Troubleshooting Guides - Monitoring stack and common issues documented

🎯 High Priority

1. Blackbox Exporter ✅ Complete

Blackbox Exporter - Fully operational (deployed 2025-12-27, verified 2025-12-28)
Deploy blackbox exporter for endpoint monitoring (v0.25.0, 2 replicas)
Monitor external services availability (DNS, HTTP/HTTPS probes configured)
SSL certificate expiry monitoring for k8s.n37.ca domain (https_cert_expiry module)
Network latency and response time tracking (ICMP ping monitoring)
Add alerts for service downtime (12 PrometheusRule alerts configured)
Monitor Synology NAS web interface availability (10.0.1.204 monitored)

Documentation: See Blackbox Exporter Application Guide for complete deployment details.

2. Enhanced Alerting ✅ Complete

AlertManager SMTP Email - Configured Gmail SMTP for critical alerts (2025-12-27)
Alert Routing - Critical → email, warning/info → null (reduce noise)
Velero Backup Alerts - 7 PrometheusRule alerts for backup monitoring
HTML Email Templates - Custom-formatted critical alert emails
~~Configure AlertManager webhook to Discord/Slack/Telegram~~ - Not used (email preferred)
Implement tiered alerting (warning → suppress, critical → email)
Predictive Disk Space Alerts - Node filesystem, PVC, and Synology volume alerts with predict_linear() (2026-01-12)
NAS Health Alerts - Disk failures, RAID degradation, temperature, bad sectors, power status (2026-01-12)
Alert runbooks - Documented in secrets/SEALED-SECRETS.md and k8s-docs-n37 (2026-01-14)
Test alert routing - Verified email delivery (121 sent, 0 failed) (2026-01-14)

3. Backup Strategy ✅ Complete

Note: Kopia file-level backups disabled in favor of CSI snapshots (more efficient for block storage)

Documentation: See Velero Application Guide for complete deployment details and disaster recovery procedures.

🔍 Monitoring & Observability Enhancements

4. Custom Dashboards ✅ Complete

Documentation: See Grafana Dashboards Guide for dashboard details.

5. Metrics Server Deployment ✅ Complete

Metrics Server - Deployed for kubectl top and HPA (2025-12-28)
Deploy metrics-server for kubectl top commands
Enable Horizontal Pod Autoscaler (HPA) capabilities
Configure for resource-constrained Pi environment (50m CPU / 100Mi RAM)
Prometheus ServiceMonitor integration

6. Log-Based Alerting ✅ ENABLED (2026-03-01)

Loki Ruler Alerting - Enabled via structuredConfig (rulerConfig ignored when ruler.enabled=false)
Set up Loki alerting rules for error patterns (HighErrorLogRate, CriticalErrorLogs)
Alert on CrashLoopBackOff events (CrashLoopBackOffDetected)
Alert on OOMKilled events (OOMKilledDetected)
Alert on persistent pod failures (PersistentPodRestarts)
Create log-based SLO monitoring (Error rate tracking via HighErrorLogRate)
Additional alerts: HTTP 5xx errors, DB connection errors, auth failures, security events

Status: 9 LogQL rules in 4 groups deployed as ConfigMap with loki_rule label. k8s-sidecar loads rules to /rules/fake/ for embedded ruler in singleBinary mode. Alerts route to AlertManager (PR #489, 2026-03-01).

Documentation: See Loki Application Guide for complete deployment details including log-based alerting.

🛡️ Security & Compliance

7. Security Scanning & Runtime Protection ✅ Complete

Documentation: See Trivy Operator Guide and Vulnerability Remediation Guide for details.

8. Secrets Management ✅ Complete

Evaluation Complete - Sealed Secrets recommended for homelab (2026-01-13)
- Sealed Secrets: 1 pod, 9Mi RAM, simple, GitOps-native
- External Secrets: 3 pods, 69Mi RAM, complex, requires backend
Sealed Secrets Deployed - bitnami-labs/sealed-secrets v2.16.2 (2026-01-13)
Secrets Migrated to SealedSecrets (2026-01-14)
- unipoller-secret, external-dns (cloudflare + unifi), alertmanager-smtp-credentials
- snmp-exporter-credentials, cert-manager cloudflare token, synology-csi client-info
- pihole-web-password (8 secrets total)
External Secrets Operator Removed - Evaluation complete, not needed (2026-01-14)
Secrets Directory Cleaned - Only bootstrap secret (ArgoCD SSH key) remains (2026-01-14)
Documentation Updated - CLAUDE_NOTES.md and secrets/README.md updated
Set up SealedSecrets sealing key rotation automation - SealedSecrets controller key rotation enabled (30d, 2026-02-05); cert-manager separately handles TLS cert renewal automatically
Create runbook for adding new SealedSecrets (added to SEALED-SECRETS.md, PR #489, 2026-03-01)

Documentation: See Secrets Management Guide for complete procedures including rotation and disaster recovery.

9. Network Policies ✅ COMPLETE (2026-01-25)

Configuration: See manifests/base/network-policies/ in the homelab repository for all policy definitions.

Documentation: See Network Policies Guide for complete policy definitions and management procedures.

🚀 Platform Enhancements

10. Service Mesh ✅ DEPLOYED (2026-01-28)

Research lightweight service mesh options for Pi cluster
Evaluate Linkerd (lightweight, Pi-friendly) - Considered but Istio Ambient selected
Evaluate Istio (full-featured but resource-intensive) - Istio Ambient mode chosen
Proof-of-concept deployment in test namespace
Performance impact analysis on Pi 5 cluster (~38m CPU, ~145Mi memory)
Document decision and implementation plan
Status: Istio Ambient Mesh deployed with mTLS on 29 pods across 6 namespaces
Note: All 25 ArgoCD apps Synced and Healthy (OutOfSync resolved 2026-02-05, PRs #379-381)

11. Ingress Enhancements ✅ Complete

Document current nginx-ingress configuration (Updated network-info.md with all 5 Ingresses, rate limits, hardening config)
Implement rate limiting for public endpoints (Already configured: 50-100 RPS + 20 conn limits on all Ingresses)
~~Add ModSecurity WAF rules~~ Deferred: 256Mi memory limit insufficient for OWASP CRS (~512-768Mi needed); not justified for private 10.0.10.0/24 network
~~Configure geo-blocking if needed~~ N/A: All services on private network (MetalLB IP 10.0.10.10 is RFC 1918), no public ingress
Monitor ingress performance and errors (Created 7 PrometheusRule alerts + Grafana dashboard with 20 panels)

🏗️ Infrastructure & DevOps

12. GitOps Enhancements

Configuration: See renovate.json in the homelab repository.

13. Development & CI/CD Tools - Argo Workflows ✅ DEPLOYED (2026-01-24)

Phase 1: Argo Workflows Deployment ✅ Complete

Deploy Argo Workflows v3.7.8 (Helm chart 0.47.1)
Configure sync-wave: -8 (before LocalStack (-7) and Velero (-5))
Set up artifact repository (Backblaze B2) ✅ Fixed (PRs #287-289, 2026-01-24)
Configure resource limits for Pi cluster constraints:
- Controller: 50m CPU / 128Mi RAM (request), 100m / 256Mi (limit)
- Server: 25m CPU / 64Mi RAM (request), 50m / 128Mi (limit)
Enable Prometheus ServiceMonitor for workflow metrics
NetworkPolicy enabled ✅ Fixed K8s API egress (PR #291, 2026-01-24)
Ingress configured at workflows.k8s.n37.ca (PR #293, 2026-01-24)
Create Grafana dashboards for workflow monitoring (2026-01-29)
Set up AlertManager rules for workflow failures (2026-01-30, PR #354)

Phase 2: Workflow Integration

ARM64 container image build workflows
Automated testing pipelines for infrastructure changes
Monthly backup validation workflows (Velero restore tests)
Security vulnerability scanning workflows (Trivy integration)
Infrastructure compliance scan workflows

Phase 3: Advanced Features

SSO integration via oauth2-proxy
Workflow templates library
Automated dependency updates (Renovate integration)
Multi-cluster workflow support (if dev/staging clusters added)

Alternative Tools Considered:

Evaluate Tekton (more complex, higher resource usage)
Evaluate Gitea vs GitLab for self-hosted git
Harbor - Container registry with vulnerability scanning
Build and deployment automation for ARM64 custom containers

🌐 Network & Access Management

14. CoreDNS Customization

Document current CoreDNS configuration
Custom DNS records for internal services
DNS-based service discovery patterns
DNS monitoring and troubleshooting tools
Consider DNS caching optimizations

15. VPN & Remote Access

Evaluate Tailscale vs WireGuard for cluster access
Deploy chosen VPN solution
oauth2-proxy - Single Sign-On (SSO) integration
Multi-factor authentication for critical services
Document remote access policies and procedures
VPN performance monitoring

🔧 Operational Improvements

16. Documentation Enhancements

Create operational runbooks for common tasks (pod restarts, rollbacks, etc.)
Document disaster recovery procedures (node failure, control plane failure)
Capacity planning documentation with growth projections
Create network topology diagrams to complement the existing network-info.md documentation
Performance baseline documentation
Document on-call procedures and escalation paths
Create k8s-docs-n37 guides for: cert-manager, metallb, ingress-nginx, localstack

17. Testing & Validation

Chaos engineering with Litmus (lighter than Chaos Monkey)
Load testing framework for applications
Backup and restore testing automation (monthly validation)
Network failure simulation and recovery testing
Performance regression testing
Test node drain and pod eviction scenarios

18. Resource Optimization

Audit resource requests/limits across all workloads (7 workloads adjusted, 2026-02-11)
Identify over-provisioned pods (resource right-sizing audit complete, net +928Mi requests)
Implement pod resource quotas per namespace
Storage capacity planning and alerting
Network bandwidth monitoring and optimization
Consider implementing Vertical Pod Autoscaler (VPA)

🌟 Nice to Have

19. Pi Cluster Specific Monitoring

Power consumption tracking (requires PoE monitoring or UPS integration)
Track PoE power draw per node
NVMe thermal throttling detection
Track undervoltage events
ARM64-specific performance optimizations

20. Application Deployments

Home Assistant integration
Private container registry (Harbor or similar)
Internal wiki or knowledge base
Status page (Uptime Kuma or similar)
Internal chat/collaboration tool

21. Observability Maturity Enhancements

Distributed Tracing - Evaluate Jaeger or Tempo for trace collection
Continuous Profiling - Pyroscope for application performance profiling
Service Level Objectives (SLOs) - Define and monitor SLOs for critical services
Error Budget Tracking - Automated SLO/error budget reporting
Anomaly Detection - ML-based anomaly detection for metrics (Prometheus AI/ML)
Synthetic Monitoring - Automated user journey testing

22. Disaster Recovery Testing

Monthly DR Drills - Automated disaster recovery validation
Chaos Engineering - Controlled failure injection (Litmus)
Velero Restore Testing - Automated monthly PVC restore validation
Network Partition Testing - Simulate network failures
Node Failure Scenarios - Test cluster resilience to node loss
Control Plane Failure - Test etcd backup/restore procedures
DR Runbook Automation - Convert manual runbooks to Argo Workflows

23. Cost Optimization & Efficiency

Resource Right-Sizing - Analyze actual vs requested resources
Spot/Preemptible Instances - Not applicable for bare metal, document for future cloud consideration
Storage Optimization - Compress old logs, optimize retention policies
Network Egress Optimization - Monitor and optimize outbound traffic
Power Consumption Tracking - PoE monitoring and efficiency analysis
Carbon Footprint - Calculate and optimize cluster carbon footprint

24. LLM Hosting & AI Infrastructure (Planning)

GPU Hardware - Add GPU-capable unit to cluster (planned)
Evaluate inference frameworks - vLLM, Ollama, LocalAI, llama.cpp for ARM64/GPU
Kubernetes GPU scheduling - NVIDIA device plugin or equivalent
Model storage - Plan NFS/iSCSI storage for large model weights (7B-70B+ parameter models)
Resource isolation - Dedicated node pool or taints/tolerations for GPU workloads
API gateway - OpenAI-compatible API endpoint for model serving
Monitoring - GPU utilization, inference latency, token throughput dashboards
Model management - Version control and deployment pipeline for models
Network considerations - High-bandwidth model loading, inference API exposure

📅 Implementation Priorities

Items are organized by priority, not by timeline. Focus on:

Phase 1: Foundation & Reliability ✅ Complete

✅ Backup strategy (Velero + critical PVC backups)
✅ Enhanced alerting (AlertManager notifications)
✅ Metrics server deployment
✅ Blackbox exporter for endpoint monitoring

Phase 2: Security & Observability ✅ Complete

✅ Security scanning (Trivy Operator)
✅ Secrets management migration (SealedSecrets)
✅ Blackbox exporter for endpoint monitoring
✅ Custom Grafana dashboards

Phase 3: Advanced Features ✅ Complete

✅ GitOps enhancements (Renovate deployed 2026-01-23)
✅ Network policies implementation (18 namespaces isolated)
✅ Development tools and CI/CD (Argo Workflows deployed 2026-01-24)
✅ Service mesh (Istio Ambient deployed 2026-01-28)

Phase 4: Optimization & Expansion

Resource optimization and VPA
Chaos engineering and resilience testing
Advanced networking and VPN
Additional application deployments

🔄 ArgoCD Sync Wave Order

Wave -100: tigera-operator (CNI foundation - ArgoCD-managed)
Wave  -50: argocd (self-management)
Wave  -35: metal-lb (networking foundation)
Wave  -30: synology-csi (storage driver)
Wave  -25: sealed-secrets (secrets management)
Wave  -20: unipoller (UniFi metrics collection)
Wave  -15: kube-prometheus-stack (monitoring stack)
Wave  -12: loki (log aggregation)
Wave  -11: alloy (log collection, replaced Promtail 2026-03-01)
Wave  -10: cert-manager, external-dns, metrics-server (certificates & DNS & metrics)
Wave   -8: argo-workflows (CI/CD)
Wave   -7: localstack (S3 mock for Velero)
Wave   -6: gatekeeper (admission control, policy enforcement)
Wave   -5: velero, falco (backup, runtime security)

📋 Notes

Resource Constraints: All implementations must consider the Pi 5 cluster constraints (80GB RAM total, 20 ARM cores)
Testing Strategy: Test all implementations in a development namespace before production deployment
Documentation First: Document all configurations and procedures for maintainability in this docs site
GitOps Workflow: All changes must go through PR workflow in the homelab repository
Regular Reviews: Review and update this TODO list monthly based on cluster evolution
Monitoring First: Ensure monitoring is in place before deploying new workloads

🔗 References

homelab/TODO.md - Infrastructure repository TODO list (source of truth — this page mirrors it)
homelab/Hardware.md - Cluster hardware specifications
homelab/network-info.md - Comprehensive network configuration
k8s-docs-n37 Application Guides - Per-application deployment documentation (see docs/applications/ in this site)

✅ Recently Completed (December 2025 - March 2026)​

Infrastructure Fixes (January 2026)​

Secrets Management (January 2026)​

Backup & Disaster Recovery (January 2026)​

Monitoring & Observability (December 2025)​

DNS & Service Discovery​

Documentation​

🎯 High Priority​

1. Blackbox Exporter ✅ Complete​

2. Enhanced Alerting ✅ Complete​

3. Backup Strategy ✅ Complete​

🔍 Monitoring & Observability Enhancements​

4. Custom Dashboards ✅ Complete​

5. Metrics Server Deployment ✅ Complete​

6. Log-Based Alerting ✅ ENABLED (2026-03-01)​

🛡️ Security & Compliance​

7. Security Scanning & Runtime Protection ✅ Complete​

8. Secrets Management ✅ Complete​

9. Network Policies ✅ COMPLETE (2026-01-25)​

🚀 Platform Enhancements​

10. Service Mesh ✅ DEPLOYED (2026-01-28)​

11. Ingress Enhancements ✅ Complete​

🏗️ Infrastructure & DevOps​

12. GitOps Enhancements​

13. Development & CI/CD Tools - Argo Workflows ✅ DEPLOYED (2026-01-24)​

🌐 Network & Access Management​

14. CoreDNS Customization​

15. VPN & Remote Access​

🔧 Operational Improvements​

16. Documentation Enhancements​

17. Testing & Validation​

18. Resource Optimization​

🌟 Nice to Have​

19. Pi Cluster Specific Monitoring​

20. Application Deployments​

21. Observability Maturity Enhancements​

22. Disaster Recovery Testing​

23. Cost Optimization & Efficiency​

24. LLM Hosting & AI Infrastructure (Planning)​

📅 Implementation Priorities​

Phase 1: Foundation & Reliability ✅ Complete​

Phase 2: Security & Observability ✅ Complete​

Phase 3: Advanced Features ✅ Complete​

Phase 4: Optimization & Expansion​

🔄 ArgoCD Sync Wave Order​

📋 Notes​

🔗 References​