Cluster Maintenance

This guide covers procedures for gracefully shutting down and bringing up the Kubernetes homelab cluster, including NAS maintenance scenarios where iSCSI volumes must be safely detached.

Cluster Topology

Node	IP Address	Role	SSH User
control-plane	10.0.10.214	Control Plane	imcbeth
node01	10.0.10.235	Worker	imcbeth
node02	10.0.10.211	Worker	imcbeth
node03	10.0.10.244	Worker	imcbeth
node04	10.0.10.220	Worker	imcbeth

SSH Key: ~/.ssh/id_ed25519_k8s

NAS-Dependent Workloads

These workloads have PersistentVolumeClaims backed by the Synology NAS via iSCSI (synology-iscsi-retain StorageClass):

Workload	Namespace	Type	PVC Size
Prometheus	default	StatefulSet	50Gi
Grafana	default	Deployment	5Gi
Loki	loki	StatefulSet	20Gi
Trivy Server	trivy-system	StatefulSet	5Gi
Falco Redis	falco	StatefulSet	2Gi

Graceful Shutdown

Use this procedure when the NAS needs maintenance, or for any planned full cluster shutdown.

Critical

Never power off the NAS while iSCSI volumes are attached. This risks filesystem corruption on all persistent volumes.

Step 1: Disable ArgoCD Auto-Sync

ArgoCD will fight any manual scale-downs if auto-sync is enabled. Disable it on all applications first:

kubectl get applications.argoproj.io -n argocd \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | \
  while read app; do
    kubectl patch application "$app" -n argocd --type=merge \
      -p '{"spec":{"syncPolicy":{"automated":null}}}'
  done

Step 2: Scale Down NAS-Dependent Workloads

Prometheus Operator

The Prometheus Operator continuously reconciles the Prometheus StatefulSet. You must scale the operator to 0 before scaling the StatefulSet, or it will be immediately restored to the desired replica count.

# Scale operator FIRST
kubectl scale deployment kube-prometheus-stack-operator -n default --replicas=0
sleep 5

# Then scale the workloads
kubectl scale statefulset prometheus-kube-prometheus-stack-prometheus -n default --replicas=0
kubectl scale deployment kube-prometheus-stack-grafana -n default --replicas=0
kubectl scale statefulset loki -n loki --replicas=0
kubectl scale statefulset trivy-server -n trivy-system --replicas=0
kubectl scale statefulset falco-falcosidekick-ui-redis -n falco --replicas=0

Step 3: Verify Volume Detachment

Wait for all pods to terminate, then confirm the iSCSI volumes are detached:

# Wait for pods to terminate
kubectl wait --for=delete pod \
  -l app.kubernetes.io/name=prometheus -n default --timeout=120s

# Verify NO VolumeAttachments remain
kubectl get volumeattachments

danger

Do NOT proceed to node shutdown until this command returns no results. Any remaining csi.san.synology.com VolumeAttachments indicate volumes still mounted on a node.

Step 4: Cordon and Drain Nodes

# Cordon all nodes
kubectl cordon control-plane node01 node02 node03 node04

# Drain workers
for node in node01 node02 node03 node04; do
  kubectl drain $node \
    --ignore-daemonsets \
    --delete-emptydir-data \
    --force \
    --timeout=120s
done

If Gatekeeper PDB blocks the drain, delete it — ArgoCD restores it automatically on next startup:

kubectl delete pdb -n gatekeeper-system --all

Step 5: Shutdown Nodes

Workers first, control plane last:

SSH_KEY=~/.ssh/id_ed25519_k8s
SSH_OPTS="-i $SSH_KEY -o ConnectTimeout=10"

# Workers (can be parallel)
for ip in 10.0.10.235 10.0.10.211 10.0.10.244 10.0.10.220; do
  ssh $SSH_OPTS imcbeth@$ip "sudo shutdown -h now" &
done
wait

# Verify workers are down
sleep 15
for ip in 10.0.10.235 10.0.10.211 10.0.10.244 10.0.10.220; do
  ping -c 1 -W 2 $ip > /dev/null 2>&1 \
    && echo "$ip: STILL UP" \
    || echo "$ip: DOWN"
done

# Control plane LAST
ssh $SSH_OPTS imcbeth@10.0.10.214 "sudo shutdown -h now"

Shutdown Verification Checklist

All 5 nodes unreachable via ping
All iSCSI VolumeAttachments released before shutdown
ArgoCD auto-sync disabled on all apps
NAS is safe for maintenance

Startup and Health Check

Boot Order

Boot Sequence

NAS first — iSCSI target service must be online
Control plane (10.0.10.214) — wait for kubectl get nodes to respond
Workers (node01–04) — can power on simultaneously

Step 1: Verify All Nodes Ready

kubectl get nodes -o wide

All 5 nodes should show Ready. They will show SchedulingDisabled from the pre-shutdown cordon — this is expected and fixed in the next step.

Kernel Updates

Unattended-upgrades may run during the boot process, upgrading the kernel. Verify with uname -r if you notice unexpected version changes.

Step 2: Uncordon All Nodes

kubectl uncordon control-plane node01 node02 node03 node04

Step 3: Re-enable ArgoCD Auto-Sync

kubectl get applications.argoproj.io -n argocd \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | \
  while read app; do
    kubectl patch application "$app" -n argocd --type=merge \
      -p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}'
  done

Wait 60–90 seconds for ArgoCD to reconcile, then check:

kubectl get applications.argoproj.io -n argocd \
  -o custom-columns='NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status'

Step 4: Handle Stuck Syncs

ArgoCD may enter auto-sync backoff after the outage. If apps show OutOfSync but aren't syncing, trigger manual syncs:

kubectl patch application <app-name> -n argocd --type=merge \
  -p '{"operation":{"initiatedBy":{"username":"manual"},"sync":{"syncStrategy":{"apply":{"force":false}},"prune":true}}}'

kube-prometheus-stack Hook Issue

The kube-prometheus-stack sync frequently gets stuck waiting on 174+ PreSync admission webhook hooks after a restart. To fix:

# Clear the stuck operation
kubectl patch application kube-prometheus-stack -n argocd --type json \
  -p '[{"op":"remove","path":"/status/operationState"}]'
kubectl patch application kube-prometheus-stack -n argocd --type json \
  -p '[{"op":"remove","path":"/operation"}]'

# Manually restore the workloads
kubectl scale deployment kube-prometheus-stack-operator -n default --replicas=1
kubectl scale deployment kube-prometheus-stack-grafana -n default --replicas=1
sleep 5
kubectl scale statefulset prometheus-kube-prometheus-stack-prometheus -n default --replicas=1

Step 5: Verify NAS Volumes

# All PVCs should be Bound
kubectl get pvc -A

# VolumeAttachments should exist (one per PVC)
kubectl get volumeattachments

Step 6: Verify Workloads

# NAS-dependent workloads
echo -n "Prometheus Operator: "; kubectl get deploy kube-prometheus-stack-operator -n default -o jsonpath='{.status.readyReplicas}'; echo
echo -n "Prometheus: "; kubectl get sts prometheus-kube-prometheus-stack-prometheus -n default -o jsonpath='{.status.readyReplicas}'; echo
echo -n "Grafana: "; kubectl get deploy kube-prometheus-stack-grafana -n default -o jsonpath='{.status.readyReplicas}'; echo
echo -n "Loki: "; kubectl get sts loki -n loki -o jsonpath='{.status.readyReplicas}'; echo
echo -n "Trivy Server: "; kubectl get sts trivy-server -n trivy-system -o jsonpath='{.status.readyReplicas}'; echo
echo -n "Falco Redis: "; kubectl get sts falco-falcosidekick-ui-redis -n falco -o jsonpath='{.status.readyReplicas}'; echo

All should report 1.

Step 7: DaemonSet Health

kubectl get ds -A -o custom-columns='NS:.metadata.namespace,NAME:.metadata.name,DESIRED:.status.desiredNumberScheduled,READY:.status.numberReady'

DaemonSet	Expected Count
calico-node, csi-node-driver, node-exporter, falco, istio-cni-node, ztunnel, kube-proxy, promtail, metallb-speaker	5
loki-canary, synology-csi-node	4 (worker-only)

Step 8: Metrics API

kubectl top nodes

If unavailable, restart the metrics server:

kubectl rollout restart deployment metrics-server -n kube-system

Step 9: Non-Running Pods

kubectl get pods -A --field-selector='status.phase!=Running,status.phase!=Succeeded'

Expected: No results. If falco-falcosidekick-ui is stuck in Init:Error, delete the pod to reschedule it near its Redis instance:

kubectl delete pod -n falco -l app.kubernetes.io/name=falcosidekick-ui

Step 10: External Access

# Verify LoadBalancer IP
kubectl get svc -A --field-selector=spec.type=LoadBalancer

# Test ingress endpoints
curl -sk https://argocd.k8s.n37.ca/ | head -5
curl -sk https://grafana.k8s.n37.ca/ | head -5

Startup Verification Checklist

Quick Reference: Claude Code Skills

These procedures are also available as Claude Code slash commands:

Command	Purpose
`/cluster-shutdown`	Execute graceful cluster shutdown
`/cluster-healthcheck`	Run post-maintenance health check

Cluster Topology​

NAS-Dependent Workloads​

Graceful Shutdown​

Step 1: Disable ArgoCD Auto-Sync​

Step 2: Scale Down NAS-Dependent Workloads​

Step 3: Verify Volume Detachment​

Step 4: Cordon and Drain Nodes​

Step 5: Shutdown Nodes​

Shutdown Verification Checklist​

Startup and Health Check​

Boot Order​

Step 1: Verify All Nodes Ready​

Step 2: Uncordon All Nodes​

Step 3: Re-enable ArgoCD Auto-Sync​

Step 4: Handle Stuck Syncs​

kube-prometheus-stack Hook Issue​

Step 5: Verify NAS Volumes​

Step 6: Verify Workloads​

Step 7: DaemonSet Health​

Step 8: Metrics API​

Step 9: Non-Running Pods​

Step 10: External Access​

Startup Verification Checklist​

Quick Reference: Claude Code Skills​