Skip to main content

Velero

Velero provides backup and disaster recovery capabilities for the Raspberry Pi 5 Kubernetes homelab cluster, protecting critical persistent volumes and cluster resources.

Overview

  • Namespace: velero
  • Helm Chart: vmware-tanzu/velero
  • Chart Version: 11.3.2
  • App Version: v1.17.2
  • Deployment: Managed by ArgoCD
  • Backup Storage: Backblaze B2 (production)
  • Backup Strategy: Daily PVC backups + Weekly cluster resource backups

Architecture

┌─────────────────────────────────────────────────────────┐
│ Velero Server (1 pod) │
│ - 100m CPU / 256Mi RAM │
│ - Manages backup/restore operations │
│ - CSI snapshot coordination │
└─────────────────────────────────────────────────────────┘

┌─────────────────┴──────────────────┐
↓ ↓
┌──────────────────────────────────┐ ┌──────────────────────┐
│ CSI Snapshots (Primary Method) │ │ S3 Storage │
│ - snapshot-controller v8.2.1 │ │ - Backblaze B2 (prod)│
│ - Synology CSI driver │ │ - 11 nines durability│
│ - Storage-native snapshots │ │ - Backup metadata │
│ - Fast backup/restore │ │ - Offsite DR │
└──────────────────────────────────┘ └──────────────────────┘

┌──────────────────────────────────┐
│ Synology NAS Storage │
│ - iSCSI LUN snapshots │
│ - Hardware-accelerated │
│ - Instant snapshot creation │
└──────────────────────────────────┘

Components:

  • Velero Server: Manages backup/restore operations, schedules, creates VolumeSnapshot resources
  • snapshot-controller v8.2.1: Kubernetes controller that processes VolumeSnapshot requests
  • Synology CSI Driver: Creates storage-native snapshots on Synology NAS
  • S3 Storage: Object storage for backup metadata (LocalStack for testing, Backblaze B2 for production)

Note: Kopia file-level backups were disabled (2026-01-05) in favor of CSI snapshots, which are more efficient for block storage.

Backup Strategy

Backup Schedule Overview

ScheduleTimeRetentionScopeMethod
velero-daily-argocd1:30 AM30 daysargocd namespaceResources only
velero-daily-critical-pvcs2:00 AM30 daysdefault, lokiCSI snapshots
velero-weekly-cluster-resources3:00 AM Sunday90 daysAll namespacesResources only

Daily ArgoCD Configuration Backup (1:30 AM)

  • Schedule: Every day at 1:30 AM
  • Retention: 30 days
  • Namespaces: argocd only
  • Method: Kubernetes resources (Applications, AppProjects, ConfigMaps, Secrets)
  • Data: ~50 resources (stateless, no PVCs)
  • Purpose: Daily recovery point for ArgoCD configuration changes

Daily Critical PVC Backup (2:00 AM)

  • Schedule: Every day at 2:00 AM
  • Retention: 30 days
  • Namespaces: default (Prometheus, Grafana), loki
  • Method: CSI snapshots only (storage-native snapshots on Synology NAS)
  • Total Data: ~75Gi (Prometheus 50Gi, Loki 20Gi, Grafana 5Gi)
  • Backup Duration: ~20 seconds (instant snapshot creation)

Weekly Cluster Resource Backup (3:00 AM Sunday)

  • Schedule: Every Sunday at 3:00 AM
  • Retention: 90 days
  • Scope: All cluster resources (ArgoCD apps, ConfigMaps, Secrets, etc.)
  • Method: Kubernetes resource backup only (no PVCs)

Cluster PVCs

All persistent volumes in the cluster. The daily critical PVC backup schedule covers the default and loki namespaces.

ComponentNamespaceSizeStorage ClassData TypeBacked Up
Prometheusdefault50Gisynology-iscsi-retainMetrics TSDB (10-day retention)Yes (daily)
Lokiloki20Gisynology-iscsi-retainLog chunks/TSDB (7-day retention)Yes (daily)
Grafanadefault5Gisynology-iscsi-retainDashboards, datasources, pluginsYes (daily)
Trivy Servertrivy-system5Gisynology-iscsi-retainVulnerability databaseNo (recreatable)
Falco Redisfalco1Gisynology-iscsi-retainSecurity event storageNo (ephemeral)

Storage Backends

Backblaze B2 (Production - Active)

Current Configuration (as of 2026-01-15):

configuration:
backupStorageLocation:
- name: default
provider: aws
bucket: velero-backups-homelab-n37
config:
region: us-west-004
s3Url: https://s3.us-west-004.backblazeb2.com

credentials:
useSecret: true
existingSecret: "velero-b2-credentials" # SealedSecret

Features:

  • ✅ 11 nines (99.999999999%) data durability
  • ✅ Offsite disaster recovery
  • ✅ S3-compatible API
  • ✅ Credentials managed via SealedSecret (GitOps-compatible)

Cost Estimate:

  • Storage: $0.006/GB/month ($6/TB)
  • ~100Gi stored ≈ $0.60/month
  • Egress: $0.01/GB (first 1GB/day free)
  • Total: ~$2-6/month for homelab

LocalStack (Testing - Available)

LocalStack remains deployed for local testing purposes:

config:
region: us-east-1
s3ForcePathStyle: "true"
s3Url: http://localstack.localstack:4566
insecureSkipTLSVerify: "true"

credentials:
aws_access_key_id: test
aws_secret_access_key: test

Use Case: Testing backup/restore procedures locally Limitations:

  • ⚠️ Ephemeral storage - backups lost on pod restart
  • ❌ NOT suitable for production disaster recovery

Deployment via ArgoCD

The Velero deployment is managed through GitOps with ArgoCD using a multi-source configuration:

Application Manifest: manifests/applications/velero.yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: velero
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "-5"
spec:
project: infrastructure
sources:
# Source 1: Helm chart from VMware Tanzu
- repoURL: https://vmware-tanzu.github.io/helm-charts
chart: velero
targetRevision: 11.3.2
helm:
releaseName: velero
valueFiles:
- $values/manifests/base/velero/values.yaml
# Source 2: Values file reference
- repoURL: git@github.com:imcbeth/homelab.git
path: manifests/base/velero
targetRevision: HEAD
ref: values
# Source 3: Additional resources (SealedSecrets for B2 credentials)
- repoURL: git@github.com:imcbeth/homelab.git
path: manifests/base/velero
targetRevision: HEAD
destination:
server: https://kubernetes.default.svc
namespace: velero
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true

Note: The third source deploys the kustomization resources including the SealedSecret for B2 credentials.

Resource Allocation

Velero Server

resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 200m
memory: 512Mi

Total Cluster Overhead:

  • CPU: 100m (~0.5% of 20 cores)
  • Memory: 256Mi (~0.3% of 80GB)

Note: With CSI snapshots, no node-agent DaemonSet is required, significantly reducing resource overhead compared to Kopia file-level backups.

Manual Backup Commands

Create Backups

# Backup specific namespace with CSI snapshots
velero backup create grafana-manual \
--include-namespaces default \
--selector app.kubernetes.io/name=grafana \
--snapshot-volumes=true

# Backup entire cluster with resources
velero backup create cluster-backup-$(date +%Y%m%d) \
--include-cluster-resources=true \
--snapshot-volumes=true

# Backup namespaces with PVCs (CSI snapshots)
velero backup create critical-pvcs-manual \
--include-namespaces default,loki \
--snapshot-volumes=true \
--wait

# Check backup status
velero backup describe critical-pvcs-manual

CSI Snapshot Configuration:

  • --snapshot-volumes=true: Use CSI snapshots for PVCs
  • --default-volumes-to-fs-backup=false: Disable Kopia file-level backups (default in current config)
  • VolumeSnapshots are created automatically for PVCs with CSI storage class

View Backups

# List all backups
velero backup get

# Describe specific backup
velero backup describe daily-critical-pvcs-20251227020000

# View backup logs
velero backup logs daily-critical-pvcs-20251227020000

# Check backup in S3 (LocalStack)
kubectl -n localstack exec deployment/localstack -- \
awslocal s3 ls s3://velero-backups/backups/

Restore Commands

Finding Available Backups from B2

All backups are stored in Backblaze B2 and can be queried using the Velero CLI:

# List all backups (shows status, age, storage location)
velero backup get

# Filter by schedule name
velero backup get --selector velero.io/schedule-name=velero-daily-critical-pvcs

# Get backup details including CSI snapshot info
velero backup describe <backup-name> --details

# Check backup logs for specific PVC snapshots
velero backup logs <backup-name> | grep -i "volumesnapshot\|storage-loki\|grafana"

# Verify backup phase and items
velero backup describe <backup-name> | grep -E "Phase|Items backed up"

Backup naming convention: <schedule-name>-<YYYYMMDDHHMMSS>

  • Example: velero-daily-critical-pvcs-20260124020024 (Jan 24, 2026 at 02:00:24 UTC)

Restore from Backup

# List available backups
velero backup get

# Restore from latest scheduled backup
velero restore create --from-backup daily-critical-pvcs-latest

# Restore specific namespace
velero restore create grafana-restore \
--from-backup grafana-manual \
--include-namespaces default

# Check restore status
velero restore describe grafana-restore
velero restore logs grafana-restore

Disaster Recovery Scenarios

Scenario 1: Single PVC Loss (Grafana)

# 1. Scale down deployment
kubectl -n default scale deployment kube-prometheus-stack-grafana --replicas=0

# 2. Delete PVC
kubectl -n default delete pvc kube-prometheus-stack-grafana

# 3. Find latest backup
LATEST_BACKUP=$(velero backup get | awk '/^daily-critical-pvcs-/ {print $1}' | sort | tail -n 1)

# 4. Restore from backup
velero restore create grafana-pvc-restore \
--from-backup "$LATEST_BACKUP" \
--include-namespaces default \
--include-resources pvc,pv

# 5. Scale up deployment
kubectl -n default scale deployment kube-prometheus-stack-grafana --replicas=1

# Time to recovery: < 15 minutes

Scenario 2: StatefulSet PVC Restore (Loki Example)

Use this procedure when a StatefulSet PVC is lost or corrupted (e.g., Loki logs missing).

# 1. List available backups and find the one with your data
velero backup get --selector velero.io/schedule-name=velero-daily-critical-pvcs

# 2. Verify the backup contains your PVC (check for CSI snapshot)
velero backup logs velero-daily-critical-pvcs-20260124020024 | grep -i "storage-loki"
# Look for: "Created VolumeSnapshot loki/velero-storage-loki-0-xxxxx"

# 3. Disable ArgoCD auto-sync to prevent reconciliation during restore
kubectl patch application loki -n argocd \
--type=merge -p '{"spec":{"syncPolicy":{"automated":null}}}'

# 4. Scale down the StatefulSet
kubectl scale statefulset -n loki loki --replicas=0

# 5. Wait for pod termination
kubectl get pods -n loki -l app.kubernetes.io/name=loki,app.kubernetes.io/component=single-binary -w

# 6. Delete the existing PVC (if it exists)
kubectl delete pvc -n loki storage-loki-0

# 7. Restore PVC from backup (CSI snapshot)
velero restore create loki-restore-$(date +%Y%m%d%H%M) \
--from-backup velero-daily-critical-pvcs-20260124020024 \
--include-namespaces loki \
--include-resources persistentvolumeclaims,volumesnapshots.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io \
--restore-volumes=true

# 8. Monitor restore progress
velero restore describe loki-restore-202601251301

# 9. Verify PVC was restored
kubectl get pvc -n loki

# 10. Scale StatefulSet back up
kubectl scale statefulset -n loki loki --replicas=1

# 11. Wait for pod to be ready
kubectl get pods -n loki -l app.kubernetes.io/name=loki -w

# 12. Re-enable ArgoCD auto-sync
kubectl patch application loki -n argocd \
--type=merge -p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}'

# 13. Verify data is accessible
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
curl -s 'http://loki.loki.svc.cluster.local:3100/loki/api/v1/labels'

# Time to recovery: ~5-10 minutes

Important Notes:

  • CSI snapshots are point-in-time; data between backup time and restore will be lost
  • Always disable ArgoCD auto-sync first to prevent race conditions
  • The restore creates a new PV from the CSI snapshot on Synology NAS
  • Verify the backup contains a VolumeSnapshot before attempting restore

Scenario 3: Full Cluster Rebuild

# 1. Deploy new Kubernetes cluster (same version)
# 2. Install Velero with same configuration
# 3. Point to same S3 bucket
# 4. Restore all namespaces

velero restore create cluster-restore \
--from-backup weekly-cluster-resources-2024-12-01-000000

# Time to recovery: < 4 hours

Monitoring

Prometheus Metrics

Velero exports metrics that are automatically scraped by Prometheus:

# Backup success rate
velero_backup_success_total{schedule="daily-critical-pvcs"}

# Backup failure count
velero_backup_failure_total

# Backup duration
velero_backup_duration_seconds{schedule="daily-critical-pvcs"}

# Last successful backup timestamp
velero_backup_last_successful_timestamp

Velero Backup Alerts

The following PrometheusRule alerts monitor backup health:

Critical Alerts:

  • VeleroBackupFailed: Backup failures detected in last hour
  • VeleroBackupDelayed: No successful backup in 24+ hours
  • VeleroBackupStorageLocationUnavailable: S3 storage unreachable
  • VeleroBackupMetricAbsent: Velero metrics not being scraped

Warning Alerts:

  • VeleroBackupDurationHigh: Backup taking >30 minutes
  • VeleroVolumeSnapshotLocationUnavailable: CSI snapshot location unavailable
  • VeleroPartialBackupFailure: Some resources not backed up

See kube-prometheus-stack for alert configuration details.

Check Backup Health

# Pod status
kubectl get pods -n velero

# Backup storage location status
kubectl get backupstoragelocation -n velero

# Recent backups
velero backup get

# Backup schedules
velero schedule get

# Velero server logs
kubectl -n velero logs deployment/velero

Troubleshooting

LocalStack Not Deployed

Symptoms:

BackupStorageLocation "default" is unavailable: rpc error: code = Unknown desc = Get "http://localstack.localstack:4566/": dial tcp: lookup localstack.localstack on 10.96.0.10:53: no such host

Resolution:

Deploy LocalStack first, OR reconfigure Velero for production S3 (see Backblaze B2 section above).

General S3 Connection Issues

# Verify B2 credentials secret exists
kubectl -n velero get secret velero-b2-credentials

# Check SealedSecret status
kubectl -n velero get sealedsecret velero-b2-credentials

# Test S3 connectivity from Velero pod
kubectl -n velero exec deployment/velero -- velero backup-location get

# Check backup storage location status
kubectl get backupstoragelocation -n velero -o yaml

Common B2 Issues:

  • Invalid credentials: Verify keyID and applicationKey are correct
  • Bucket permissions: Ensure the application key has read/write access to the bucket
  • Region mismatch: Check the region matches your B2 bucket location

Backup Failing

# Check backup status
velero backup describe <backup-name> --details

# View backup logs
velero backup logs <backup-name>

# Common issues:
# 1. S3 connectivity - check s3Url and credentials
# 2. CSI snapshot issues - check VolumeSnapshot CRDs
# 3. Kopia timeout - check node-agent logs

Node-Agent Permission Issues

# Check node-agent pods
kubectl -n velero get pods -l name=node-agent -o wide

# View node-agent logs
kubectl -n velero logs daemonset/node-agent -c node-agent --tail=100

# Verify DAC_READ_SEARCH capability is sufficient
# If permission errors persist, check:
# 1. SELinux/AppArmor policies
# 2. PodSecurityPolicy/PodSecurityStandards
# 3. hostPath mount for /var/lib/kubelet/pods

Security Considerations

Node-Agent Capabilities

The Velero node-agent runs with minimal Linux capabilities instead of privileged mode:

containerSecurityContext:
privileged: false
allowPrivilegeEscalation: false
capabilities:
add:
- DAC_READ_SEARCH # Bypass file read permission checks

Why DAC_READ_SEARCH?

  • Allows Kopia to read PVC data from /var/lib/kubelet/pods regardless of file ownership
  • Much safer than privileged: true or SYS_ADMIN capability
  • Sufficient for file-level backup operations

Security Comparison:

ConfigurationPrivilegesSecurity RiskRecommendation
privileged: trueAll capabilities + host accessVery High❌ Avoid
capabilities: [SYS_ADMIN]Broad system adminHigh⚠️ Only if necessary
capabilities: [DAC_READ_SEARCH]File read bypass onlyLow✅ Recommended

Credential Management

Current Implementation (SealedSecrets):

B2 credentials are managed via SealedSecret for GitOps compatibility:

  • SealedSecret: manifests/base/velero/b2-credentials-sealed.yaml
  • Decrypted Secret: velero-b2-credentials in velero namespace
  • Kustomization: manifests/base/velero/kustomization.yaml
# View credential secret (base64 encoded)
kubectl get secret velero-b2-credentials -n velero -o yaml

# Check SealedSecret status
kubectl get sealedsecret velero-b2-credentials -n velero

Updating B2 Credentials:

# 1. Create temporary secret file (DO NOT COMMIT)
cat > /tmp/velero-b2-credentials.yaml <<EOF
apiVersion: v1
kind: Secret
metadata:
name: velero-b2-credentials
namespace: velero
type: Opaque
stringData:
cloud: |
[default]
aws_access_key_id=<NEW_B2_KEY_ID>
aws_secret_access_key=<NEW_B2_APPLICATION_KEY>
EOF

# 2. Seal the secret
kubeseal --cert <(kubectl get secret -n kube-system \
-l sealedsecrets.bitnami.com/sealed-secrets-key=active \
-o jsonpath='{.items[0].data.tls\.crt}' | base64 -d) \
--format yaml < /tmp/velero-b2-credentials.yaml > manifests/base/velero/b2-credentials-sealed.yaml

# 3. Delete temporary file and commit
rm /tmp/velero-b2-credentials.yaml
git add manifests/base/velero/b2-credentials-sealed.yaml
git commit -m "feat: Update Velero B2 credentials"
git push

See Secrets Management for more details on SealedSecrets.

Migration from LocalStack to Production S3

Migration Status: ✅ Completed (2026-01-15)

The migration from LocalStack to Backblaze B2 has been completed successfully:

  • PR #239: feat: Migrate Velero backups from LocalStack to Backblaze B2
  • Bucket: velero-backups-homelab-n37
  • Region: us-west-004
  • Credentials: Managed via SealedSecret

Verification Results

# BackupStorageLocation status
$ kubectl get backupstoragelocation -n velero
NAME PHASE LAST VALIDATED AGE DEFAULT
default Available 1s 17d true

# Test backup completed successfully
$ velero backup create test-b2-migration --include-namespaces velero --wait
Backup completed with status: Completed
Items backed up: 54

Migration Reference (For Future Providers)

If you need to migrate to a different S3 provider in the future:

Step 1: Create SealedSecret for new credentials

# Create temporary secret
cat > /tmp/velero-new-credentials.yaml <<EOF
apiVersion: v1
kind: Secret
metadata:
name: velero-new-credentials
namespace: velero
type: Opaque
stringData:
cloud: |
[default]
aws_access_key_id=<NEW_KEY_ID>
aws_secret_access_key=<NEW_SECRET_KEY>
EOF

# Seal and commit
kubeseal ... < /tmp/velero-new-credentials.yaml > manifests/base/velero/new-credentials-sealed.yaml
rm /tmp/velero-new-credentials.yaml

Step 2: Update values.yaml

configuration:
backupStorageLocation:
- name: default
provider: aws
bucket: <new-bucket-name>
config:
region: <new-region>
s3Url: <new-s3-endpoint>

credentials:
useSecret: true
existingSecret: "velero-new-credentials"

Step 3: Update kustomization.yaml and deploy

git add manifests/base/velero/
git commit -m "feat: Migrate Velero to new S3 provider"
git push

Testing Procedures

Test 1: ConfigMap Backup/Restore

# Create test data
kubectl create namespace velero-test
kubectl -n velero-test create configmap test-data --from-literal=foo=bar

# Backup
velero backup create test-configmap \
--include-namespaces velero-test \
--wait

# Delete namespace
kubectl delete namespace velero-test

# Restore
velero restore create test-restore \
--from-backup test-configmap \
--wait

# Verify
kubectl -n velero-test get configmap test-data -o yaml

# Cleanup
kubectl delete namespace velero-test

Test 2: PVC Backup/Restore

For comprehensive PVC testing procedures, see manifests/base/velero/README.md in the homelab repository.

Example PVC Test:

# Create test namespace and PVC
kubectl create namespace velero-test

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
namespace: velero-test
spec:
accessModes:
- ReadWriteOnce
storageClassName: synology-iscsi-retain
resources:
requests:
storage: 1Gi
EOF

# Create pod with data
kubectl run test-pod -n velero-test --image=busybox --restart=Never \
--overrides='{"spec":{"containers":[{"name":"busybox","image":"busybox","command":["/bin/sh","-c","echo test-data > /data/test.txt && sleep 3600"],"volumeMounts":[{"name":"data","mountPath":"/data"}]}],"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"test-pvc"}}]}}'

# Backup with CSI snapshots
velero backup create test-pvc-backup \
--include-namespaces velero-test \
--snapshot-volumes=true \
--wait

# Check VolumeSnapshot was created
kubectl get volumesnapshot -n velero-test

# Delete namespace
kubectl delete namespace velero-test

# Restore
velero restore create test-pvc-restore \
--from-backup test-pvc-backup \
--wait

# Verify data
kubectl -n velero-test exec test-pod -- cat /data/test.txt

# Cleanup
kubectl delete namespace velero-test

Best Practices

  1. Test Restores Regularly: Monthly disaster recovery drills
  2. Monitor Backup Success: Check Prometheus metrics and AlertManager notifications
  3. Verify S3 Storage: Monthly audit of S3 bucket and costs
  4. Update Retention Policies: Adjust based on compliance and storage requirements
  5. Document Procedures: Keep disaster recovery runbooks up-to-date
  6. Plan for Growth: Monitor backup sizes and adjust resources accordingly
  7. Secure Credentials: Use git-crypt, Vault, or external secret management for production
  8. Test Production Migration: Validate S3 migration before relying on it

Known Issues and Solutions

Issue 1: Velero v1.17 Breaking Change - --keep-latest-maintenance-jobs Flag Removed

Date Noted: 2026-01-23 Severity: Critical (pod crash) Status: Resolved

Symptoms:

  • Velero pod in CrashLoopBackOff after upgrading to chart v11.x
  • Error in logs: Error: unknown flag: --keep-latest-maintenance-jobs

Root Cause:

The --keep-latest-maintenance-jobs CLI flag was deprecated in Velero v1.14 and removed in v1.17. The Helm chart v11.x uses a ConfigMap-based approach instead (--repo-maintenance-job-configmap).

Solution:

If ArgoCD isn't picking up the new chart version from git (still showing old targetRevision), recreate the ArgoCD Application:

# Delete and recreate the ArgoCD Application to force sync
kubectl delete application velero -n argocd
kubectl apply -f manifests/applications/velero.yaml

# Wait for sync and verify
kubectl get application velero -n argocd
velero backup-location get

Configuration Change:

The new Helm chart uses configuration.repositoryMaintenanceJob.repositoryConfigData in values.yaml instead of CLI flags:

configuration:
repositoryMaintenanceJob:
repositoryConfigData:
global:
keepLatestMaintenanceJobs: 3 # Previously a CLI flag

Related PRs:

  • homelab#271: Velero major update to v11.3.2

Issue 2: snapshot-controller v8.x VolumeSnapshot Failures

Date Noted: 2026-01-05 Severity: Critical (backup failure) Status: Resolved by downgrading to v6.3.1

Symptoms:

  • All VolumeSnapshots stuck with READYTOUSE: false
  • Velero backups showing PartiallyFailed status
  • Error message: VolumeSnapshotContent is invalid: spec: Invalid value: sourceVolumeMode is required once set
  • VolumeSnapshotContent objects unable to be updated by snapshot-controller

Root Cause:

snapshot-controller v8.2.0 has strict immutability validation on the sourceVolumeMode field. When the controller attempts to add annotations to VolumeSnapshotContent objects during snapshot creation, the Kubernetes API server rejects the updates due to field validation rules that treat any update as potentially modifying the immutable field.

This is a known issue with the v8.x series: kubernetes-csi/external-snapshotter#866

Investigation Commands:

# Check VolumeSnapshot status
kubectl get volumesnapshot -A

# Describe failed snapshot
kubectl describe volumesnapshot -n default <snapshot-name>

# Check snapshot-controller version
kubectl get deployment -n synology-csi snapshot-controller -o yaml | grep "image:"

# View snapshot-controller logs
kubectl logs -n synology-csi deployment/snapshot-controller

Solution:

Downgrade to snapshot-controller v8.2.1 or v7.0.2, which are stable and compatible with Kubernetes 1.35:

Step 1: Clean up stuck VolumeSnapshot resources

# Remove finalizers to allow deletion
kubectl patch volumesnapshot -n <namespace> <snapshot-name> \
-p '{"metadata":{"finalizers":null}}' --type=merge

# Repeat for all stuck VolumeSnapshotContent objects
kubectl patch volumesnapshotcontent <snapcontent-name> \
-p '{"metadata":{"finalizers":null}}' --type=merge

Step 2: Update snapshot-controller version

In manifests/base/synology-csi/kustomization.yaml:

resources:
- github.com/kubernetes-csi/external-snapshotter/client/config/crd?ref=v7.0.2
- github.com/kubernetes-csi/external-snapshotter/deploy/kubernetes/snapshot-controller?ref=v7.0.2

Step 3: Deploy and verify

# ArgoCD will auto-sync
argocd app sync synology-csi

# Wait for new snapshot-controller pods
kubectl get pods -n synology-csi -l app.kubernetes.io/name=snapshot-controller

# Test VolumeSnapshot creation
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: test-snapshot
namespace: default
spec:
volumeSnapshotClassName: synology-snapshot-class
source:
persistentVolumeClaimName: <your-pvc-name>
EOF

# Verify snapshot reaches READYTOUSE: true
kubectl get volumesnapshot -n default test-snapshot

Expected Result:

  • VolumeSnapshots reach READYTOUSE: true in 8-10 seconds
  • Velero backups complete with status Completed (not PartiallyFailed)
  • CSI snapshots: csiVolumeSnapshotsCompleted: 3, Errors: 0

Related PRs:

  • homelab#189: Downgrade snapshot-controller to v7.0.2 for stability
  • homelab#188: Add snapshot-controller to Synology CSI deployment (introduced issue)
  • homelab#187: Configure Velero to use CSI snapshots only

Issue 3: LocalStack Connection Required for Initial Deployment

Date Noted: 2025-12-27 Severity: Medium (deployment blocker)

Symptoms:

  • Velero pod fails to start if LocalStack is not deployed first
  • BackupStorageLocation shows "Unavailable"

Root Cause:

  • Default values.yaml is configured for LocalStack testing
  • Velero validates S3 connectivity on startup

Solution:

  • Deploy LocalStack before Velero (for testing), OR
  • Configure production S3 credentials before first deployment

Related PRs:

  • homelab#149: Deploy Velero with Kopia file-level backup support

References