Skip to main content

Cluster Maintenance

This guide covers procedures for gracefully shutting down and bringing up the Kubernetes homelab cluster, including NAS maintenance scenarios where iSCSI volumes must be safely detached.

Cluster Topology

NodeIP AddressRoleSSH User
control-plane10.0.10.214Control Planeimcbeth
node0110.0.10.235Workerimcbeth
node0210.0.10.211Workerimcbeth
node0310.0.10.244Workerimcbeth
node0410.0.10.220Workerimcbeth

SSH Key: ~/.ssh/id_ed25519_k8s

NAS-Dependent Workloads

These workloads have PersistentVolumeClaims backed by the Synology NAS via iSCSI (synology-iscsi-retain StorageClass):

WorkloadNamespaceTypePVC Size
PrometheusdefaultStatefulSet50Gi
GrafanadefaultDeployment5Gi
LokilokiStatefulSet20Gi
Trivy Servertrivy-systemStatefulSet5Gi
Falco RedisfalcoStatefulSet2Gi

Graceful Shutdown

Use this procedure when the NAS needs maintenance, or for any planned full cluster shutdown.

Critical

Never power off the NAS while iSCSI volumes are attached. This risks filesystem corruption on all persistent volumes.

Step 1: Disable ArgoCD Auto-Sync

ArgoCD will fight any manual scale-downs if auto-sync is enabled. Disable it on all applications first:

kubectl get applications.argoproj.io -n argocd \
-o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | \
while read app; do
kubectl patch application "$app" -n argocd --type=merge \
-p '{"spec":{"syncPolicy":{"automated":null}}}'
done

Step 2: Scale Down NAS-Dependent Workloads

Prometheus Operator

The Prometheus Operator continuously reconciles the Prometheus StatefulSet. You must scale the operator to 0 before scaling the StatefulSet, or it will be immediately restored to the desired replica count.

# Scale operator FIRST
kubectl scale deployment kube-prometheus-stack-operator -n default --replicas=0
sleep 5

# Then scale the workloads
kubectl scale statefulset prometheus-kube-prometheus-stack-prometheus -n default --replicas=0
kubectl scale deployment kube-prometheus-stack-grafana -n default --replicas=0
kubectl scale statefulset loki -n loki --replicas=0
kubectl scale statefulset trivy-server -n trivy-system --replicas=0
kubectl scale statefulset falco-falcosidekick-ui-redis -n falco --replicas=0

Step 3: Verify Volume Detachment

Wait for all pods to terminate, then confirm the iSCSI volumes are detached:

# Wait for pods to terminate
kubectl wait --for=delete pod \
-l app.kubernetes.io/name=prometheus -n default --timeout=120s

# Verify NO VolumeAttachments remain
kubectl get volumeattachments
danger

Do NOT proceed to node shutdown until this command returns no results. Any remaining csi.san.synology.com VolumeAttachments indicate volumes still mounted on a node.

Step 4: Cordon and Drain Nodes

# Cordon all nodes
kubectl cordon control-plane node01 node02 node03 node04

# Drain workers
for node in node01 node02 node03 node04; do
kubectl drain $node \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--timeout=120s
done

If Gatekeeper PDB blocks the drain, delete it — ArgoCD restores it automatically on next startup:

kubectl delete pdb -n gatekeeper-system --all

Step 5: Shutdown Nodes

Workers first, control plane last:

SSH_KEY=~/.ssh/id_ed25519_k8s
SSH_OPTS="-i $SSH_KEY -o ConnectTimeout=10"

# Workers (can be parallel)
for ip in 10.0.10.235 10.0.10.211 10.0.10.244 10.0.10.220; do
ssh $SSH_OPTS imcbeth@$ip "sudo shutdown -h now" &
done
wait

# Verify workers are down
sleep 15
for ip in 10.0.10.235 10.0.10.211 10.0.10.244 10.0.10.220; do
ping -c 1 -W 2 $ip > /dev/null 2>&1 \
&& echo "$ip: STILL UP" \
|| echo "$ip: DOWN"
done

# Control plane LAST
ssh $SSH_OPTS imcbeth@10.0.10.214 "sudo shutdown -h now"

Shutdown Verification Checklist

  • All 5 nodes unreachable via ping
  • All iSCSI VolumeAttachments released before shutdown
  • ArgoCD auto-sync disabled on all apps
  • NAS is safe for maintenance

Startup and Health Check

Boot Order

Boot Sequence
  1. NAS first — iSCSI target service must be online
  2. Control plane (10.0.10.214) — wait for kubectl get nodes to respond
  3. Workers (node01–04) — can power on simultaneously

Step 1: Verify All Nodes Ready

kubectl get nodes -o wide

All 5 nodes should show Ready. They will show SchedulingDisabled from the pre-shutdown cordon — this is expected and fixed in the next step.

Kernel Updates

Unattended-upgrades may run during the boot process, upgrading the kernel. Verify with uname -r if you notice unexpected version changes.

Step 2: Uncordon All Nodes

kubectl uncordon control-plane node01 node02 node03 node04

Step 3: Re-enable ArgoCD Auto-Sync

kubectl get applications.argoproj.io -n argocd \
-o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | \
while read app; do
kubectl patch application "$app" -n argocd --type=merge \
-p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}'
done

Wait 60–90 seconds for ArgoCD to reconcile, then check:

kubectl get applications.argoproj.io -n argocd \
-o custom-columns='NAME:.metadata.name,SYNC:.status.sync.status,HEALTH:.status.health.status'

Step 4: Handle Stuck Syncs

ArgoCD may enter auto-sync backoff after the outage. If apps show OutOfSync but aren't syncing, trigger manual syncs:

kubectl patch application <app-name> -n argocd --type=merge \
-p '{"operation":{"initiatedBy":{"username":"manual"},"sync":{"syncStrategy":{"apply":{"force":false}},"prune":true}}}'

kube-prometheus-stack Hook Issue

The kube-prometheus-stack sync frequently gets stuck waiting on 174+ PreSync admission webhook hooks after a restart. To fix:

# Clear the stuck operation
kubectl patch application kube-prometheus-stack -n argocd --type json \
-p '[{"op":"remove","path":"/status/operationState"}]'
kubectl patch application kube-prometheus-stack -n argocd --type json \
-p '[{"op":"remove","path":"/operation"}]'

# Manually restore the workloads
kubectl scale deployment kube-prometheus-stack-operator -n default --replicas=1
kubectl scale deployment kube-prometheus-stack-grafana -n default --replicas=1
sleep 5
kubectl scale statefulset prometheus-kube-prometheus-stack-prometheus -n default --replicas=1

Step 5: Verify NAS Volumes

# All PVCs should be Bound
kubectl get pvc -A

# VolumeAttachments should exist (one per PVC)
kubectl get volumeattachments

Step 6: Verify Workloads

# NAS-dependent workloads
echo -n "Prometheus Operator: "; kubectl get deploy kube-prometheus-stack-operator -n default -o jsonpath='{.status.readyReplicas}'; echo
echo -n "Prometheus: "; kubectl get sts prometheus-kube-prometheus-stack-prometheus -n default -o jsonpath='{.status.readyReplicas}'; echo
echo -n "Grafana: "; kubectl get deploy kube-prometheus-stack-grafana -n default -o jsonpath='{.status.readyReplicas}'; echo
echo -n "Loki: "; kubectl get sts loki -n loki -o jsonpath='{.status.readyReplicas}'; echo
echo -n "Trivy Server: "; kubectl get sts trivy-server -n trivy-system -o jsonpath='{.status.readyReplicas}'; echo
echo -n "Falco Redis: "; kubectl get sts falco-falcosidekick-ui-redis -n falco -o jsonpath='{.status.readyReplicas}'; echo

All should report 1.

Step 7: DaemonSet Health

kubectl get ds -A -o custom-columns='NS:.metadata.namespace,NAME:.metadata.name,DESIRED:.status.desiredNumberScheduled,READY:.status.numberReady'
DaemonSetExpected Count
calico-node, csi-node-driver, node-exporter, falco, istio-cni-node, ztunnel, kube-proxy, promtail, metallb-speaker5
loki-canary, synology-csi-node4 (worker-only)

Step 8: Metrics API

kubectl top nodes

If unavailable, restart the metrics server:

kubectl rollout restart deployment metrics-server -n kube-system

Step 9: Non-Running Pods

kubectl get pods -A --field-selector='status.phase!=Running,status.phase!=Succeeded'

Expected: No results. If falco-falcosidekick-ui is stuck in Init:Error, delete the pod to reschedule it near its Redis instance:

kubectl delete pod -n falco -l app.kubernetes.io/name=falcosidekick-ui

Step 10: External Access

# Verify LoadBalancer IP
kubectl get svc -A --field-selector=spec.type=LoadBalancer

# Test ingress endpoints
curl -sk https://argocd.k8s.n37.ca/ | head -5
curl -sk https://grafana.k8s.n37.ca/ | head -5

Startup Verification Checklist

  • 5/5 nodes Ready and uncordoned
  • ArgoCD auto-sync re-enabled
  • 24+ apps Synced + Healthy
  • 5/5 PVCs Bound with VolumeAttachments
  • All NAS workloads running (Prometheus, Grafana, Loki, Trivy, Falco Redis)
  • DaemonSets at expected counts
  • Metrics API responsive (kubectl top nodes)
  • No stuck pods
  • LoadBalancer IP assigned (10.0.10.10)
  • Gatekeeper PDB restored
  • External endpoints accessible

Quick Reference: Claude Code Skills

These procedures are also available as Claude Code slash commands:

CommandPurpose
/cluster-shutdownExecute graceful cluster shutdown
/cluster-healthcheckRun post-maintenance health check