Homelab TODO & Improvements
✅ Recently Completed (December 2025 - March 2026)
Infrastructure Fixes (January 2026)
- Tigera Operator Migration - Migrated Calico CNI to GitOps-managed Tigera operator (PRs #346-352, 2026-01-30)
- Calico now managed by Tigera operator in calico-system namespace
- ArgoCD Application with multi-source (operator from GitHub, Installation CR from homelab)
- Typha topology spread constraints for node distribution
- Established ignoreDifferences patterns for operator-managed resources
- ArgoCD repo-server memory increased to 512Mi for large manifest generation
- External-DNS Domain Filter Fix - Fixed subdomain zone filtering (PRs #295-296, 2026-01-25)
- Root cause:
--domain-filter=k8s.n37.carejected then37.caCloudflare zone - Solution: Use parent zone as domain-filter; ingresses specify exact hostnames
- Root cause:
- Grafana fsGroup Race Condition - Fixed mount failure with Synology CSI (PR #298, 2026-01-25)
- Root cause: SQLite journal file deleted during fsGroup recursive application
- Solution: Added
fsGroupChangePolicy: OnRootMismatchto podSecurityContext
Secrets Management (January 2026)
- Sealed Secrets Migration - Migrated 8 secrets from git-crypt to SealedSecrets (2026-01-14)
- External Secrets Removed - Evaluation complete, Sealed Secrets chosen for simplicity (2026-01-14)
- Secrets Directory Cleanup - Removed 15 obsolete files, only ArgoCD bootstrap secret remains
Backup & Disaster Recovery (January 2026)
- Velero Backblaze B2 Migration - Migrated from LocalStack to Backblaze B2 for production backups (2026-01-14, PR #239)
- Velero CSI Snapshots - Configured Velero to use CSI snapshots exclusively (2026-01-05)
- snapshot-controller Fix - Temporarily downgraded from v8.2.0 → v6.3.1 to resolve VolumeSnapshot failures (2026-01-05), then upgraded to v8.2.1 with csi-snapshotter v8.4.0 (2026-01-11)
- Loki Memory Optimization - Implemented GOMEMLIMIT, ingestion rate limits, reduced memory usage from 474Mi → 232Mi (2026-01-05)
Monitoring & Observability (December 2025)
- SNMP Monitoring for Synology - Deployed SNMP exporter, scraping NAS metrics (disk health, temperature, RAID status)
- Node Exporter for Pi Cluster - DaemonSet running on all 5 nodes, monitoring CPU, memory, disk, network
- Log Aggregation - Loki + Alloy deployed, 7-day retention, collecting logs from all pods on all nodes (including control-plane)
- Prometheus Stack Fixes - Fixed node-exporter scraping, Grafana PVC issues, cleaned up control plane monitoring
- Control Plane Monitoring - Re-enabled kube-scheduler and kube-controller-manager monitoring
- ServiceMonitor Enablement - Enabled metrics collection for Loki and Alloy
DNS & Service Discovery
- External-DNS Deployment - Dual provider setup (Cloudflare + UniFi webhook) for split-horizon DNS (2025-12-27)
- Cloudflare provider for public DNS records
- kashalls/external-dns-unifi-webhook v0.7.0 for internal DNS
- Automatic DNS record creation for Ingresses (argocd.k8s.n37.ca, grafana.k8s.n37.ca, localstack.k8s.n37.ca, workflows.k8s.n37.ca)
- TXT registry for ownership tracking
- Fixed domain-filter for subdomain zones (PRs #295-296, 2026-01-25) - Use parent zone (n37.ca) as domain-filter
Documentation
- Comprehensive Docs Site - k8s-docs-n37 Docusaurus site with application guides
- External-DNS Guide - Complete documentation with dual provider setup and troubleshooting
- Loki Application Guide - Complete documentation for Loki + Alloy deployment
- SNMP Exporter Guide - Synology monitoring documentation
- Troubleshooting Guides - Monitoring stack and common issues documented
🎯 High Priority
1. Blackbox Exporter ✅ Complete
- Blackbox Exporter - Fully operational (deployed 2025-12-27, verified 2025-12-28)
- Deploy blackbox exporter for endpoint monitoring (v0.25.0, 2 replicas)
- Monitor external services availability (DNS, HTTP/HTTPS probes configured)
- SSL certificate expiry monitoring for k8s.n37.ca domain (https_cert_expiry module)
- Network latency and response time tracking (ICMP ping monitoring)
- Add alerts for service downtime (12 PrometheusRule alerts configured)
- Monitor Synology NAS web interface availability (10.0.1.204 monitored)
Documentation: See Blackbox Exporter Application Guide for complete deployment details.
2. Enhanced Alerting ✅ Complete
- AlertManager SMTP Email - Configured Gmail SMTP for critical alerts (2025-12-27)
- Alert Routing - Critical → email, warning/info → null (reduce noise)
- Velero Backup Alerts - 7 PrometheusRule alerts for backup monitoring
- HTML Email Templates - Custom-formatted critical alert emails
Configure AlertManager webhook to Discord/Slack/Telegram- Not used (email preferred)- Implement tiered alerting (warning → suppress, critical → email)
- Predictive Disk Space Alerts - Node filesystem, PVC, and Synology volume alerts with predict_linear() (2026-01-12)
- NAS Health Alerts - Disk failures, RAID degradation, temperature, bad sectors, power status (2026-01-12)
- Alert runbooks - Documented in secrets/SEALED-SECRETS.md and k8s-docs-n37 (2026-01-14)
- Test alert routing - Verified email delivery (121 sent, 0 failed) (2026-01-14)
3. Backup Strategy ✅ Complete
- Velero - Deployed for Kubernetes cluster backup (2025-12-27)
- CSI Snapshots - Configured Velero to use CSI snapshots exclusively (2026-01-05)
- snapshot-controller - Temporarily deployed v6.3.1 (2026-01-05), then upgraded to v8.2.1 with csi-snapshotter v8.4.0 (2026-01-11)
- Backup critical PVCs (Prometheus 50Gi, Grafana 5Gi, Loki 20Gi)
- Daily PVC backups (2 AM, 30-day retention) - CSI snapshots operational
- Weekly cluster resource backups (3 AM Sunday, 90-day retention)
- Velero backup monitoring alerts (7 PrometheusRule alerts)
- Fixed VolumeSnapshot failures - Upgraded snapshot-controller to v8.2.1, csi-snapshotter to v8.4.0 (2026-01-11)
- LocalStack Sync Wave Fix - LocalStack at wave -7, before Velero (-5) ✓
- Schedule regular backup testing - Velero B2 restore tested and validated (2026-01-14)
- Migrate from LocalStack to Backblaze B2 - Production backup storage (2026-01-14, PR #239)
- Test disaster recovery scenarios - Namespace restore with SealedSecrets validated (2026-01-14)
- ArgoCD configuration backup automation - Daily backup schedule at 1:30 AM (2026-01-14)
Note: Kopia file-level backups disabled in favor of CSI snapshots (more efficient for block storage)
Documentation: See Velero Application Guide for complete deployment details and disaster recovery procedures.
🔍 Monitoring & Observability Enhancements
4. Custom Dashboards ✅ Complete
- Custom Grafana Dashboards - 4 dashboards deployed via ConfigMap provisioning (2025-12-28)
- Pi cluster temperature monitoring dashboard (per-node CPU temps with Raspberry Pi 5 specifics)
- Node resource utilization dashboard (CPU, memory, disk per node)
- Loki log volume and ingestion rate dashboard (log analytics and error tracking)
- Create unified "cluster health" dashboard (Pi Cluster Overview with 12 panels)
- Migrate Uncommitted Dashboards to Code - Completed audit, no migration needed (2025-12-28)
- Audit Grafana UI for any manually created or modified dashboards (30 total, all in ConfigMaps)
- Export uncommitted dashboards as JSON (N/A - no uncommitted dashboards found)
- Create ConfigMap manifests for exported dashboards (N/A - all 30 already in code)
- Add to kustomization and deploy via GitOps (N/A - all already deployed)
- Verify dashboards load correctly after migration (All 30 dashboards confirmed via sidecar)
- Document dashboard creation and modification workflow (Added comprehensive audit section)
- Network utilization dashboard - Deployed initial cluster-wide network utilization view (2026-02-05)
- Storage performance metrics (iSCSI latency, IOPS, throughput) - Dashboard deployed (PR #383, 2026-02-05)
- Application performance monitoring (APM) dashboard - 8-row overview with service health, CPU/memory, blackbox endpoints, API server, network I/O, saturation (2026-02-13)
Documentation: See Grafana Dashboards Guide for dashboard details.
5. Metrics Server Deployment ✅ Complete
- Metrics Server - Deployed for kubectl top and HPA (2025-12-28)
- Deploy metrics-server for kubectl top commands
- Enable Horizontal Pod Autoscaler (HPA) capabilities
- Configure for resource-constrained Pi environment (50m CPU / 100Mi RAM)
- Prometheus ServiceMonitor integration
6. Log-Based Alerting ✅ ENABLED (2026-03-01)
- Loki Ruler Alerting - Enabled via structuredConfig (rulerConfig ignored when ruler.enabled=false)
- Set up Loki alerting rules for error patterns (HighErrorLogRate, CriticalErrorLogs)
- Alert on CrashLoopBackOff events (CrashLoopBackOffDetected)
- Alert on OOMKilled events (OOMKilledDetected)
- Alert on persistent pod failures (PersistentPodRestarts)
- Create log-based SLO monitoring (Error rate tracking via HighErrorLogRate)
- Additional alerts: HTTP 5xx errors, DB connection errors, auth failures, security events
Status: 9 LogQL rules in 4 groups deployed as ConfigMap with loki_rule label. k8s-sidecar loads rules to /rules/fake/ for embedded ruler in singleBinary mode. Alerts route to AlertManager (PR #489, 2026-03-01).
Documentation: See Loki Application Guide for complete deployment details including log-based alerting.
🛡️ Security & Compliance
7. Security Scanning & Runtime Protection ✅ Complete
- Trivy Operator - Container vulnerability scanning (deployed 2026-01-05, chart 0.31.0)
- ServiceMonitor configured for Prometheus metrics
- VulnerabilityReports available via kubectl
- Scanning all cluster images automatically
- Node-collector tolerations for control-plane scanning (PR #345, 2026-01-30)
- Falco - Runtime security monitoring (deployed 2026-01-29, chart 8.0.1)
- Modern eBPF driver for ARM64 efficiency
- DaemonSet running on all nodes including control-plane
- Falcosidekick with AlertManager and Loki integration
- Web UI at falco.k8s.n37.ca (PR #340)
- Custom rules for homelab (cryptocurrency mining, reverse shell detection)
- PrometheusRules for security alerts
- NetworkPolicy configured (PR #339, #344)
- OPA Gatekeeper - Policy enforcement and admission control (deployed 2026-02-06, chart 3.21.1)
- 5 ConstraintTemplates: resource limits, allowed repos, required labels, block NodePort, container limits
- All constraints switched to deny mode (0 violations, 2026-02-07)
- Pi-optimized: 1 replica, 100m/256Mi requests, 500m/512Mi limits
- Prometheus metrics with ServiceMonitor
- NetworkPolicy configured
- System namespaces exempted (kube-system, argocd, gatekeeper-system)
- Security policy definitions for workloads
- Compliance reporting and alerting (PSS Baseline + Restricted alerts, weekly CronJob summary to AlertManager)
- Create Grafana dashboard for vulnerability trends (completed 2026-02-08, PRs #410-412: fixed NetworkPolicy HBONE, Gatekeeper exemption, SBOM bug)
Documentation: See Trivy Operator Guide and Vulnerability Remediation Guide for details.
8. Secrets Management ✅ Complete
- Evaluation Complete - Sealed Secrets recommended for homelab (2026-01-13)
- Sealed Secrets: 1 pod, 9Mi RAM, simple, GitOps-native
- External Secrets: 3 pods, 69Mi RAM, complex, requires backend
- Sealed Secrets Deployed - bitnami-labs/sealed-secrets v2.16.2 (2026-01-13)
- Secrets Migrated to SealedSecrets (2026-01-14)
- unipoller-secret, external-dns (cloudflare + unifi), alertmanager-smtp-credentials
- snmp-exporter-credentials, cert-manager cloudflare token, synology-csi client-info
- pihole-web-password (8 secrets total)
- External Secrets Operator Removed - Evaluation complete, not needed (2026-01-14)
- Secrets Directory Cleaned - Only bootstrap secret (ArgoCD SSH key) remains (2026-01-14)
- Documentation Updated - CLAUDE_NOTES.md and secrets/README.md updated
- Set up SealedSecrets sealing key rotation automation - SealedSecrets controller key rotation enabled (30d, 2026-02-05); cert-manager separately handles TLS cert renewal automatically
- Create runbook for adding new SealedSecrets (added to SEALED-SECRETS.md, PR #489, 2026-03-01)
Documentation: See Secrets Management Guide for complete procedures including rotation and disaster recovery.
9. Network Policies ✅ COMPLETE (2026-01-25)
- Define NetworkPolicies for namespace isolation (18 namespaces)
- Implement ingress/egress rules for sensitive workloads
- localstack: Allow velero, ingress-nginx, prometheus; egress DNS only
- unipoller: Allow prometheus; egress DNS + UniFi controller
- loki: Allow alloy, prometheus, grafana; egress DNS + alertmanager + K8s API
- trivy-system: Allow prometheus; egress DNS + K8s API + registries
- velero: Allow prometheus; egress DNS + localstack + B2 + K8s API
- argo-workflows: Allow ingress-nginx, prometheus; egress DNS + K8s API + B2 (2026-01-24)
- cert-manager: Allow webhook validation, prometheus; egress DNS + K8s API + Let's Encrypt + Cloudflare (2026-01-25)
- external-dns: Allow prometheus, internal webhook; egress DNS + K8s API + Cloudflare + UniFi (2026-01-25)
- metallb-system: Allow prometheus, memberlist, webhook; egress DNS + K8s API (2026-01-25)
- ingress-nginx: Allow external traffic, prometheus; egress DNS + K8s API
- istio-system: Allow prometheus, webhook; egress DNS + K8s API + HBONE port 15008
- gatekeeper-system: Allow prometheus, webhook; egress DNS + K8s API
- falco: Allow prometheus, alertmanager, loki; egress DNS + K8s API
- default: Allow ingress-nginx, prometheus; egress DNS + K8s API
- argocd: Allow ingress-nginx, prometheus; egress DNS + K8s API + GitHub
- synology-csi: Allow K8s API; egress DNS + NAS iSCSI
- kube-system: Allow prometheus; egress DNS + K8s API (metrics-server port 10250)
- tigera-operator: Allow prometheus; egress DNS + K8s API
- Test policy enforcement (all tests passed)
- Document network segmentation strategy in k8s-docs-n37 (PR #60, 2026-01-29)
Configuration: See manifests/base/network-policies/ in the homelab repository for all policy definitions.
Documentation: See Network Policies Guide for complete policy definitions and management procedures.
🚀 Platform Enhancements
10. Service Mesh ✅ DEPLOYED (2026-01-28)
- Research lightweight service mesh options for Pi cluster
- Evaluate Linkerd (lightweight, Pi-friendly) - Considered but Istio Ambient selected
- Evaluate Istio (full-featured but resource-intensive) - Istio Ambient mode chosen
- Proof-of-concept deployment in test namespace
- Performance impact analysis on Pi 5 cluster (~38m CPU, ~145Mi memory)
- Document decision and implementation plan
- Status: Istio Ambient Mesh deployed with mTLS on 29 pods across 6 namespaces
- Note: All 25 ArgoCD apps Synced and Healthy (OutOfSync resolved 2026-02-05, PRs #379-381)
11. Ingress Enhancements ✅ Complete
- Document current nginx-ingress configuration (Updated network-info.md with all 5 Ingresses, rate limits, hardening config)
- Implement rate limiting for public endpoints (Already configured: 50-100 RPS + 20 conn limits on all Ingresses)
-
Add ModSecurity WAF rulesDeferred: 256Mi memory limit insufficient for OWASP CRS (~512-768Mi needed); not justified for private 10.0.10.0/24 network -
Configure geo-blocking if neededN/A: All services on private network (MetalLB IP 10.0.10.10 is RFC 1918), no public ingress - Monitor ingress performance and errors (Created 7 PrometheusRule alerts + Grafana dashboard with 20 panels)
🏗️ Infrastructure & DevOps
12. GitOps Enhancements
- Renovate - Automated dependency updates for Helm charts (deployed 2026-01-23)
- GitHub App installed and configured
- ArgoCD Application manifest scanning (Helm charts)
- Docker image tag updates in Kubernetes manifests
- Grouped updates (ArgoCD, monitoring, networking, security, backup)
- Weekend schedule (Sat/Sun 6am-9pm) to minimize disruption
- Pre-commit hooks for Kubernetes manifest validation (kubeval, kustomize)
- Automated testing pipeline for infrastructure changes
- Expand GitOps workflow documentation
- Consider multi-cluster ArgoCD setup for dev/staging
Configuration: See renovate.json in the homelab repository.
13. Development & CI/CD Tools - Argo Workflows ✅ DEPLOYED (2026-01-24)
Phase 1: Argo Workflows Deployment ✅ Complete
- Deploy Argo Workflows v3.7.8 (Helm chart 0.47.1)
- Configure sync-wave: -8 (before LocalStack (-7) and Velero (-5))
- Set up artifact repository (Backblaze B2) ✅ Fixed (PRs #287-289, 2026-01-24)
- Configure resource limits for Pi cluster constraints:
- Controller: 50m CPU / 128Mi RAM (request), 100m / 256Mi (limit)
- Server: 25m CPU / 64Mi RAM (request), 50m / 128Mi (limit)
- Enable Prometheus ServiceMonitor for workflow metrics
- NetworkPolicy enabled ✅ Fixed K8s API egress (PR #291, 2026-01-24)
- Ingress configured at workflows.k8s.n37.ca (PR #293, 2026-01-24)
- Create Grafana dashboards for workflow monitoring (2026-01-29)
- Set up AlertManager rules for workflow failures (2026-01-30, PR #354)
Phase 2: Workflow Integration
- ARM64 container image build workflows
- Automated testing pipelines for infrastructure changes
- Monthly backup validation workflows (Velero restore tests)
- Security vulnerability scanning workflows (Trivy integration)
- Infrastructure compliance scan workflows
Phase 3: Advanced Features
- SSO integration via oauth2-proxy
- Workflow templates library
- Automated dependency updates (Renovate integration)
- Multi-cluster workflow support (if dev/staging clusters added)
Alternative Tools Considered:
- Evaluate Tekton (more complex, higher resource usage)
- Evaluate Gitea vs GitLab for self-hosted git
- Harbor - Container registry with vulnerability scanning
- Build and deployment automation for ARM64 custom containers
🌐 Network & Access Management
14. CoreDNS Customization
- Document current CoreDNS configuration
- Custom DNS records for internal services
- DNS-based service discovery patterns
- DNS monitoring and troubleshooting tools
- Consider DNS caching optimizations
15. VPN & Remote Access
- Evaluate Tailscale vs WireGuard for cluster access
- Deploy chosen VPN solution
- oauth2-proxy - Single Sign-On (SSO) integration
- Multi-factor authentication for critical services
- Document remote access policies and procedures
- VPN performance monitoring
🔧 Operational Improvements
16. Documentation Enhancements
- Create operational runbooks for common tasks (pod restarts, rollbacks, etc.)
- Document disaster recovery procedures (node failure, control plane failure)
- Capacity planning documentation with growth projections
- Create network topology diagrams to complement the existing network-info.md documentation
- Performance baseline documentation
- Document on-call procedures and escalation paths
- Create k8s-docs-n37 guides for: cert-manager, metallb, ingress-nginx, localstack
17. Testing & Validation
- Chaos engineering with Litmus (lighter than Chaos Monkey)
- Load testing framework for applications
- Backup and restore testing automation (monthly validation)
- Network failure simulation and recovery testing
- Performance regression testing
- Test node drain and pod eviction scenarios
18. Resource Optimization
- Audit resource requests/limits across all workloads (7 workloads adjusted, 2026-02-11)
- Identify over-provisioned pods (resource right-sizing audit complete, net +928Mi requests)
- Implement pod resource quotas per namespace
- Storage capacity planning and alerting
- Network bandwidth monitoring and optimization
- Consider implementing Vertical Pod Autoscaler (VPA)
🌟 Nice to Have
19. Pi Cluster Specific Monitoring
- Power consumption tracking (requires PoE monitoring or UPS integration)
- Track PoE power draw per node
- NVMe thermal throttling detection
- Track undervoltage events
- ARM64-specific performance optimizations
20. Application Deployments
- Home Assistant integration
- Private container registry (Harbor or similar)
- Internal wiki or knowledge base
- Status page (Uptime Kuma or similar)
- Internal chat/collaboration tool
21. Observability Maturity Enhancements
- Distributed Tracing - Evaluate Jaeger or Tempo for trace collection
- Continuous Profiling - Pyroscope for application performance profiling
- Service Level Objectives (SLOs) - Define and monitor SLOs for critical services
- Error Budget Tracking - Automated SLO/error budget reporting
- Anomaly Detection - ML-based anomaly detection for metrics (Prometheus AI/ML)
- Synthetic Monitoring - Automated user journey testing
22. Disaster Recovery Testing
- Monthly DR Drills - Automated disaster recovery validation
- Chaos Engineering - Controlled failure injection (Litmus)
- Velero Restore Testing - Automated monthly PVC restore validation
- Network Partition Testing - Simulate network failures
- Node Failure Scenarios - Test cluster resilience to node loss
- Control Plane Failure - Test etcd backup/restore procedures
- DR Runbook Automation - Convert manual runbooks to Argo Workflows
23. Cost Optimization & Efficiency
- Resource Right-Sizing - Analyze actual vs requested resources
- Spot/Preemptible Instances - Not applicable for bare metal, document for future cloud consideration
- Storage Optimization - Compress old logs, optimize retention policies
- Network Egress Optimization - Monitor and optimize outbound traffic
- Power Consumption Tracking - PoE monitoring and efficiency analysis
- Carbon Footprint - Calculate and optimize cluster carbon footprint
24. LLM Hosting & AI Infrastructure (Planning)
- GPU Hardware - Add GPU-capable unit to cluster (planned)
- Evaluate inference frameworks - vLLM, Ollama, LocalAI, llama.cpp for ARM64/GPU
- Kubernetes GPU scheduling - NVIDIA device plugin or equivalent
- Model storage - Plan NFS/iSCSI storage for large model weights (7B-70B+ parameter models)
- Resource isolation - Dedicated node pool or taints/tolerations for GPU workloads
- API gateway - OpenAI-compatible API endpoint for model serving
- Monitoring - GPU utilization, inference latency, token throughput dashboards
- Model management - Version control and deployment pipeline for models
- Network considerations - High-bandwidth model loading, inference API exposure
📅 Implementation Priorities
Items are organized by priority, not by timeline. Focus on:
Phase 1: Foundation & Reliability ✅ Complete
- ✅ Backup strategy (Velero + critical PVC backups)
- ✅ Enhanced alerting (AlertManager notifications)
- ✅ Metrics server deployment
- ✅ Blackbox exporter for endpoint monitoring
Phase 2: Security & Observability ✅ Complete
- ✅ Security scanning (Trivy Operator)
- ✅ Secrets management migration (SealedSecrets)
- ✅ Blackbox exporter for endpoint monitoring
- ✅ Custom Grafana dashboards
Phase 3: Advanced Features ✅ Complete
- ✅ GitOps enhancements (Renovate deployed 2026-01-23)
- ✅ Network policies implementation (18 namespaces isolated)
- ✅ Development tools and CI/CD (Argo Workflows deployed 2026-01-24)
- ✅ Service mesh (Istio Ambient deployed 2026-01-28)
Phase 4: Optimization & Expansion
- Resource optimization and VPA
- Chaos engineering and resilience testing
- Advanced networking and VPN
- Additional application deployments
🔄 ArgoCD Sync Wave Order
Wave -100: tigera-operator (CNI foundation - ArgoCD-managed)
Wave -50: argocd (self-management)
Wave -35: metal-lb (networking foundation)
Wave -30: synology-csi (storage driver)
Wave -25: sealed-secrets (secrets management)
Wave -20: unipoller (UniFi metrics collection)
Wave -15: kube-prometheus-stack (monitoring stack)
Wave -12: loki (log aggregation)
Wave -11: alloy (log collection, replaced Promtail 2026-03-01)
Wave -10: cert-manager, external-dns, metrics-server (certificates & DNS & metrics)
Wave -8: argo-workflows (CI/CD)
Wave -7: localstack (S3 mock for Velero)
Wave -6: gatekeeper (admission control, policy enforcement)
Wave -5: velero, falco (backup, runtime security)
📋 Notes
- Resource Constraints: All implementations must consider the Pi 5 cluster constraints (80GB RAM total, 20 ARM cores)
- Testing Strategy: Test all implementations in a development namespace before production deployment
- Documentation First: Document all configurations and procedures for maintainability in this docs site
- GitOps Workflow: All changes must go through PR workflow in the homelab repository
- Regular Reviews: Review and update this TODO list monthly based on cluster evolution
- Monitoring First: Ensure monitoring is in place before deploying new workloads
🔗 References
- homelab/TODO.md - Infrastructure repository TODO list (source of truth — this page mirrors it)
- homelab/Hardware.md - Cluster hardware specifications
- homelab/network-info.md - Comprehensive network configuration
- k8s-docs-n37 Application Guides - Per-application deployment documentation (see
docs/applications/in this site)