Velero
Velero provides backup and disaster recovery capabilities for the Raspberry Pi 5 Kubernetes homelab cluster, protecting critical persistent volumes and cluster resources.
Overview
- Namespace:
velero - Helm Chart:
vmware-tanzu/velero - Chart Version:
11.3.2 - App Version:
v1.17.2 - Deployment: Managed by ArgoCD
- Backup Storage: Backblaze B2 (production)
- Backup Strategy: Daily PVC backups + Weekly cluster resource backups
Architecture
┌─────────────────────────────────────────────────────────┐
│ Velero Server (1 pod) │
│ - 100m CPU / 256Mi RAM │
│ - Manages backup/restore operations │
│ - CSI snapshot coordination │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────┴──────────────────┐
↓ ↓
┌──────────────────────────────────┐ ┌──────────────────────┐
│ CSI Snapshots (Primary Method) │ │ S3 Storage │
│ - snapshot-controller v8.2.1 │ │ - Backblaze B2 (prod)│
│ - Synology CSI driver │ │ - 11 nines durability│
│ - Storage-native snapshots │ │ - Backup metadata │
│ - Fast backup/restore │ │ - Offsite DR │
└──────────────────────────────────┘ └──────────────────────┘
↓
┌──────────────────────────────────┐
│ Synology NAS Storage │
│ - iSCSI LUN snapshots │
│ - Hardware-accelerated │
│ - Instant snapshot creation │
└──────────────────────────────────┘
Components:
- Velero Server: Manages backup/restore operations, schedules, creates VolumeSnapshot resources
- snapshot-controller v8.2.1: Kubernetes controller that processes VolumeSnapshot requests
- Synology CSI Driver: Creates storage-native snapshots on Synology NAS
- S3 Storage: Object storage for backup metadata (LocalStack for testing, Backblaze B2 for production)
Note: Kopia file-level backups were disabled (2026-01-05) in favor of CSI snapshots, which are more efficient for block storage.
Backup Strategy
Backup Schedule Overview
| Schedule | Time | Retention | Scope | Method |
|---|---|---|---|---|
velero-daily-argocd | 1:30 AM | 30 days | argocd namespace | Resources only |
velero-daily-critical-pvcs | 2:00 AM | 30 days | default, loki | CSI snapshots |
velero-weekly-cluster-resources | 3:00 AM Sunday | 90 days | All namespaces | Resources only |
Daily ArgoCD Configuration Backup (1:30 AM)
- Schedule: Every day at 1:30 AM
- Retention: 30 days
- Namespaces: argocd only
- Method: Kubernetes resources (Applications, AppProjects, ConfigMaps, Secrets)
- Data: ~50 resources (stateless, no PVCs)
- Purpose: Daily recovery point for ArgoCD configuration changes
Daily Critical PVC Backup (2:00 AM)
- Schedule: Every day at 2:00 AM
- Retention: 30 days
- Namespaces: default (Prometheus, Grafana), loki
- Method: CSI snapshots only (storage-native snapshots on Synology NAS)
- Total Data: ~75Gi (Prometheus 50Gi, Loki 20Gi, Grafana 5Gi)
- Backup Duration: ~20 seconds (instant snapshot creation)
Weekly Cluster Resource Backup (3:00 AM Sunday)
- Schedule: Every Sunday at 3:00 AM
- Retention: 90 days
- Scope: All cluster resources (ArgoCD apps, ConfigMaps, Secrets, etc.)
- Method: Kubernetes resource backup only (no PVCs)
Cluster PVCs
All persistent volumes in the cluster. The daily critical PVC backup schedule covers the default and loki namespaces.
| Component | Namespace | Size | Storage Class | Data Type | Backed Up |
|---|---|---|---|---|---|
| Prometheus | default | 50Gi | synology-iscsi-retain | Metrics TSDB (10-day retention) | Yes (daily) |
| Loki | loki | 20Gi | synology-iscsi-retain | Log chunks/TSDB (7-day retention) | Yes (daily) |
| Grafana | default | 5Gi | synology-iscsi-retain | Dashboards, datasources, plugins | Yes (daily) |
| Trivy Server | trivy-system | 5Gi | synology-iscsi-retain | Vulnerability database | No (recreatable) |
| Falco Redis | falco | 1Gi | synology-iscsi-retain | Security event storage | No (ephemeral) |
Storage Backends
Backblaze B2 (Production - Active)
Current Configuration (as of 2026-01-15):
configuration:
backupStorageLocation:
- name: default
provider: aws
bucket: velero-backups-homelab-n37
config:
region: us-west-004
s3Url: https://s3.us-west-004.backblazeb2.com
credentials:
useSecret: true
existingSecret: "velero-b2-credentials" # SealedSecret
Features:
- ✅ 11 nines (99.999999999%) data durability
- ✅ Offsite disaster recovery
- ✅ S3-compatible API
- ✅ Credentials managed via SealedSecret (GitOps-compatible)
Cost Estimate:
- Storage: $0.006/GB/month ($6/TB)
- ~100Gi stored ≈ $0.60/month
- Egress: $0.01/GB (first 1GB/day free)
- Total: ~$2-6/month for homelab
LocalStack (Testing - Available)
LocalStack remains deployed for local testing purposes:
config:
region: us-east-1
s3ForcePathStyle: "true"
s3Url: http://localstack.localstack:4566
insecureSkipTLSVerify: "true"
credentials:
aws_access_key_id: test
aws_secret_access_key: test
Use Case: Testing backup/restore procedures locally Limitations:
- ⚠️ Ephemeral storage - backups lost on pod restart
- ❌ NOT suitable for production disaster recovery
Deployment via ArgoCD
The Velero deployment is managed through GitOps with ArgoCD using a multi-source configuration:
Application Manifest: manifests/applications/velero.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: velero
namespace: argocd
annotations:
argocd.argoproj.io/sync-wave: "-5"
spec:
project: infrastructure
sources:
# Source 1: Helm chart from VMware Tanzu
- repoURL: https://vmware-tanzu.github.io/helm-charts
chart: velero
targetRevision: 11.3.2
helm:
releaseName: velero
valueFiles:
- $values/manifests/base/velero/values.yaml
# Source 2: Values file reference
- repoURL: git@github.com:imcbeth/homelab.git
path: manifests/base/velero
targetRevision: HEAD
ref: values
# Source 3: Additional resources (SealedSecrets for B2 credentials)
- repoURL: git@github.com:imcbeth/homelab.git
path: manifests/base/velero
targetRevision: HEAD
destination:
server: https://kubernetes.default.svc
namespace: velero
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
Note: The third source deploys the kustomization resources including the SealedSecret for B2 credentials.
Resource Allocation
Velero Server
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 200m
memory: 512Mi
Total Cluster Overhead:
- CPU: 100m (~0.5% of 20 cores)
- Memory: 256Mi (~0.3% of 80GB)
Note: With CSI snapshots, no node-agent DaemonSet is required, significantly reducing resource overhead compared to Kopia file-level backups.
Manual Backup Commands
Create Backups
# Backup specific namespace with CSI snapshots
velero backup create grafana-manual \
--include-namespaces default \
--selector app.kubernetes.io/name=grafana \
--snapshot-volumes=true
# Backup entire cluster with resources
velero backup create cluster-backup-$(date +%Y%m%d) \
--include-cluster-resources=true \
--snapshot-volumes=true
# Backup namespaces with PVCs (CSI snapshots)
velero backup create critical-pvcs-manual \
--include-namespaces default,loki \
--snapshot-volumes=true \
--wait
# Check backup status
velero backup describe critical-pvcs-manual
CSI Snapshot Configuration:
--snapshot-volumes=true: Use CSI snapshots for PVCs--default-volumes-to-fs-backup=false: Disable Kopia file-level backups (default in current config)- VolumeSnapshots are created automatically for PVCs with CSI storage class
View Backups
# List all backups
velero backup get
# Describe specific backup
velero backup describe daily-critical-pvcs-20251227020000
# View backup logs
velero backup logs daily-critical-pvcs-20251227020000
# Check backup in S3 (LocalStack)
kubectl -n localstack exec deployment/localstack -- \
awslocal s3 ls s3://velero-backups/backups/
Restore Commands
Finding Available Backups from B2
All backups are stored in Backblaze B2 and can be queried using the Velero CLI:
# List all backups (shows status, age, storage location)
velero backup get
# Filter by schedule name
velero backup get --selector velero.io/schedule-name=velero-daily-critical-pvcs
# Get backup details including CSI snapshot info
velero backup describe <backup-name> --details
# Check backup logs for specific PVC snapshots
velero backup logs <backup-name> | grep -i "volumesnapshot\|storage-loki\|grafana"
# Verify backup phase and items
velero backup describe <backup-name> | grep -E "Phase|Items backed up"
Backup naming convention: <schedule-name>-<YYYYMMDDHHMMSS>
- Example:
velero-daily-critical-pvcs-20260124020024(Jan 24, 2026 at 02:00:24 UTC)
Restore from Backup
# List available backups
velero backup get
# Restore from latest scheduled backup
velero restore create --from-backup daily-critical-pvcs-latest
# Restore specific namespace
velero restore create grafana-restore \
--from-backup grafana-manual \
--include-namespaces default
# Check restore status
velero restore describe grafana-restore
velero restore logs grafana-restore
Disaster Recovery Scenarios
Scenario 1: Single PVC Loss (Grafana)
# 1. Scale down deployment
kubectl -n default scale deployment kube-prometheus-stack-grafana --replicas=0
# 2. Delete PVC
kubectl -n default delete pvc kube-prometheus-stack-grafana
# 3. Find latest backup
LATEST_BACKUP=$(velero backup get | awk '/^daily-critical-pvcs-/ {print $1}' | sort | tail -n 1)
# 4. Restore from backup
velero restore create grafana-pvc-restore \
--from-backup "$LATEST_BACKUP" \
--include-namespaces default \
--include-resources pvc,pv
# 5. Scale up deployment
kubectl -n default scale deployment kube-prometheus-stack-grafana --replicas=1
# Time to recovery: < 15 minutes
Scenario 2: StatefulSet PVC Restore (Loki Example)
Use this procedure when a StatefulSet PVC is lost or corrupted (e.g., Loki logs missing).
# 1. List available backups and find the one with your data
velero backup get --selector velero.io/schedule-name=velero-daily-critical-pvcs
# 2. Verify the backup contains your PVC (check for CSI snapshot)
velero backup logs velero-daily-critical-pvcs-20260124020024 | grep -i "storage-loki"
# Look for: "Created VolumeSnapshot loki/velero-storage-loki-0-xxxxx"
# 3. Disable ArgoCD auto-sync to prevent reconciliation during restore
kubectl patch application loki -n argocd \
--type=merge -p '{"spec":{"syncPolicy":{"automated":null}}}'
# 4. Scale down the StatefulSet
kubectl scale statefulset -n loki loki --replicas=0
# 5. Wait for pod termination
kubectl get pods -n loki -l app.kubernetes.io/name=loki,app.kubernetes.io/component=single-binary -w
# 6. Delete the existing PVC (if it exists)
kubectl delete pvc -n loki storage-loki-0
# 7. Restore PVC from backup (CSI snapshot)
velero restore create loki-restore-$(date +%Y%m%d%H%M) \
--from-backup velero-daily-critical-pvcs-20260124020024 \
--include-namespaces loki \
--include-resources persistentvolumeclaims,volumesnapshots.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io \
--restore-volumes=true
# 8. Monitor restore progress
velero restore describe loki-restore-202601251301
# 9. Verify PVC was restored
kubectl get pvc -n loki
# 10. Scale StatefulSet back up
kubectl scale statefulset -n loki loki --replicas=1
# 11. Wait for pod to be ready
kubectl get pods -n loki -l app.kubernetes.io/name=loki -w
# 12. Re-enable ArgoCD auto-sync
kubectl patch application loki -n argocd \
--type=merge -p '{"spec":{"syncPolicy":{"automated":{"prune":true,"selfHeal":true}}}}'
# 13. Verify data is accessible
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never -- \
curl -s 'http://loki.loki.svc.cluster.local:3100/loki/api/v1/labels'
# Time to recovery: ~5-10 minutes
Important Notes:
- CSI snapshots are point-in-time; data between backup time and restore will be lost
- Always disable ArgoCD auto-sync first to prevent race conditions
- The restore creates a new PV from the CSI snapshot on Synology NAS
- Verify the backup contains a VolumeSnapshot before attempting restore
Scenario 3: Full Cluster Rebuild
# 1. Deploy new Kubernetes cluster (same version)
# 2. Install Velero with same configuration
# 3. Point to same S3 bucket
# 4. Restore all namespaces
velero restore create cluster-restore \
--from-backup weekly-cluster-resources-2024-12-01-000000
# Time to recovery: < 4 hours
Monitoring
Prometheus Metrics
Velero exports metrics that are automatically scraped by Prometheus:
# Backup success rate
velero_backup_success_total{schedule="daily-critical-pvcs"}
# Backup failure count
velero_backup_failure_total
# Backup duration
velero_backup_duration_seconds{schedule="daily-critical-pvcs"}
# Last successful backup timestamp
velero_backup_last_successful_timestamp
Velero Backup Alerts
The following PrometheusRule alerts monitor backup health:
Critical Alerts:
- VeleroBackupFailed: Backup failures detected in last hour
- VeleroBackupDelayed: No successful backup in 24+ hours
- VeleroBackupStorageLocationUnavailable: S3 storage unreachable
- VeleroBackupMetricAbsent: Velero metrics not being scraped
Warning Alerts:
- VeleroBackupDurationHigh: Backup taking >30 minutes
- VeleroVolumeSnapshotLocationUnavailable: CSI snapshot location unavailable
- VeleroPartialBackupFailure: Some resources not backed up
See kube-prometheus-stack for alert configuration details.
Check Backup Health
# Pod status
kubectl get pods -n velero
# Backup storage location status
kubectl get backupstoragelocation -n velero
# Recent backups
velero backup get
# Backup schedules
velero schedule get
# Velero server logs
kubectl -n velero logs deployment/velero
Troubleshooting
LocalStack Not Deployed
Symptoms:
BackupStorageLocation "default" is unavailable: rpc error: code = Unknown desc = Get "http://localstack.localstack:4566/": dial tcp: lookup localstack.localstack on 10.96.0.10:53: no such host
Resolution:
Deploy LocalStack first, OR reconfigure Velero for production S3 (see Backblaze B2 section above).
General S3 Connection Issues
# Verify B2 credentials secret exists
kubectl -n velero get secret velero-b2-credentials
# Check SealedSecret status
kubectl -n velero get sealedsecret velero-b2-credentials
# Test S3 connectivity from Velero pod
kubectl -n velero exec deployment/velero -- velero backup-location get
# Check backup storage location status
kubectl get backupstoragelocation -n velero -o yaml
Common B2 Issues:
- Invalid credentials: Verify keyID and applicationKey are correct
- Bucket permissions: Ensure the application key has read/write access to the bucket
- Region mismatch: Check the region matches your B2 bucket location
Backup Failing
# Check backup status
velero backup describe <backup-name> --details
# View backup logs
velero backup logs <backup-name>
# Common issues:
# 1. S3 connectivity - check s3Url and credentials
# 2. CSI snapshot issues - check VolumeSnapshot CRDs
# 3. Kopia timeout - check node-agent logs
Node-Agent Permission Issues
# Check node-agent pods
kubectl -n velero get pods -l name=node-agent -o wide
# View node-agent logs
kubectl -n velero logs daemonset/node-agent -c node-agent --tail=100
# Verify DAC_READ_SEARCH capability is sufficient
# If permission errors persist, check:
# 1. SELinux/AppArmor policies
# 2. PodSecurityPolicy/PodSecurityStandards
# 3. hostPath mount for /var/lib/kubelet/pods
Security Considerations
Node-Agent Capabilities
The Velero node-agent runs with minimal Linux capabilities instead of privileged mode:
containerSecurityContext:
privileged: false
allowPrivilegeEscalation: false
capabilities:
add:
- DAC_READ_SEARCH # Bypass file read permission checks
Why DAC_READ_SEARCH?
- Allows Kopia to read PVC data from
/var/lib/kubelet/podsregardless of file ownership - Much safer than
privileged: trueorSYS_ADMINcapability - Sufficient for file-level backup operations
Security Comparison:
| Configuration | Privileges | Security Risk | Recommendation |
|---|---|---|---|
privileged: true | All capabilities + host access | Very High | ❌ Avoid |
capabilities: [SYS_ADMIN] | Broad system admin | High | ⚠️ Only if necessary |
capabilities: [DAC_READ_SEARCH] | File read bypass only | Low | ✅ Recommended |
Credential Management
Current Implementation (SealedSecrets):
B2 credentials are managed via SealedSecret for GitOps compatibility:
- SealedSecret:
manifests/base/velero/b2-credentials-sealed.yaml - Decrypted Secret:
velero-b2-credentialsinveleronamespace - Kustomization:
manifests/base/velero/kustomization.yaml
# View credential secret (base64 encoded)
kubectl get secret velero-b2-credentials -n velero -o yaml
# Check SealedSecret status
kubectl get sealedsecret velero-b2-credentials -n velero
Updating B2 Credentials:
# 1. Create temporary secret file (DO NOT COMMIT)
cat > /tmp/velero-b2-credentials.yaml <<EOF
apiVersion: v1
kind: Secret
metadata:
name: velero-b2-credentials
namespace: velero
type: Opaque
stringData:
cloud: |
[default]
aws_access_key_id=<NEW_B2_KEY_ID>
aws_secret_access_key=<NEW_B2_APPLICATION_KEY>
EOF
# 2. Seal the secret
kubeseal --cert <(kubectl get secret -n kube-system \
-l sealedsecrets.bitnami.com/sealed-secrets-key=active \
-o jsonpath='{.items[0].data.tls\.crt}' | base64 -d) \
--format yaml < /tmp/velero-b2-credentials.yaml > manifests/base/velero/b2-credentials-sealed.yaml
# 3. Delete temporary file and commit
rm /tmp/velero-b2-credentials.yaml
git add manifests/base/velero/b2-credentials-sealed.yaml
git commit -m "feat: Update Velero B2 credentials"
git push
See Secrets Management for more details on SealedSecrets.
Migration from LocalStack to Production S3
Migration Status: ✅ Completed (2026-01-15)
The migration from LocalStack to Backblaze B2 has been completed successfully:
- PR #239: feat: Migrate Velero backups from LocalStack to Backblaze B2
- Bucket: velero-backups-homelab-n37
- Region: us-west-004
- Credentials: Managed via SealedSecret
Verification Results
# BackupStorageLocation status
$ kubectl get backupstoragelocation -n velero
NAME PHASE LAST VALIDATED AGE DEFAULT
default Available 1s 17d true
# Test backup completed successfully
$ velero backup create test-b2-migration --include-namespaces velero --wait
Backup completed with status: Completed
Items backed up: 54
Migration Reference (For Future Providers)
If you need to migrate to a different S3 provider in the future:
Step 1: Create SealedSecret for new credentials
# Create temporary secret
cat > /tmp/velero-new-credentials.yaml <<EOF
apiVersion: v1
kind: Secret
metadata:
name: velero-new-credentials
namespace: velero
type: Opaque
stringData:
cloud: |
[default]
aws_access_key_id=<NEW_KEY_ID>
aws_secret_access_key=<NEW_SECRET_KEY>
EOF
# Seal and commit
kubeseal ... < /tmp/velero-new-credentials.yaml > manifests/base/velero/new-credentials-sealed.yaml
rm /tmp/velero-new-credentials.yaml
Step 2: Update values.yaml
configuration:
backupStorageLocation:
- name: default
provider: aws
bucket: <new-bucket-name>
config:
region: <new-region>
s3Url: <new-s3-endpoint>
credentials:
useSecret: true
existingSecret: "velero-new-credentials"
Step 3: Update kustomization.yaml and deploy
git add manifests/base/velero/
git commit -m "feat: Migrate Velero to new S3 provider"
git push
Testing Procedures
Test 1: ConfigMap Backup/Restore
# Create test data
kubectl create namespace velero-test
kubectl -n velero-test create configmap test-data --from-literal=foo=bar
# Backup
velero backup create test-configmap \
--include-namespaces velero-test \
--wait
# Delete namespace
kubectl delete namespace velero-test
# Restore
velero restore create test-restore \
--from-backup test-configmap \
--wait
# Verify
kubectl -n velero-test get configmap test-data -o yaml
# Cleanup
kubectl delete namespace velero-test
Test 2: PVC Backup/Restore
For comprehensive PVC testing procedures, see manifests/base/velero/README.md in the homelab repository.
Example PVC Test:
# Create test namespace and PVC
kubectl create namespace velero-test
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
namespace: velero-test
spec:
accessModes:
- ReadWriteOnce
storageClassName: synology-iscsi-retain
resources:
requests:
storage: 1Gi
EOF
# Create pod with data
kubectl run test-pod -n velero-test --image=busybox --restart=Never \
--overrides='{"spec":{"containers":[{"name":"busybox","image":"busybox","command":["/bin/sh","-c","echo test-data > /data/test.txt && sleep 3600"],"volumeMounts":[{"name":"data","mountPath":"/data"}]}],"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"test-pvc"}}]}}'
# Backup with CSI snapshots
velero backup create test-pvc-backup \
--include-namespaces velero-test \
--snapshot-volumes=true \
--wait
# Check VolumeSnapshot was created
kubectl get volumesnapshot -n velero-test
# Delete namespace
kubectl delete namespace velero-test
# Restore
velero restore create test-pvc-restore \
--from-backup test-pvc-backup \
--wait
# Verify data
kubectl -n velero-test exec test-pod -- cat /data/test.txt
# Cleanup
kubectl delete namespace velero-test
Best Practices
- Test Restores Regularly: Monthly disaster recovery drills
- Monitor Backup Success: Check Prometheus metrics and AlertManager notifications
- Verify S3 Storage: Monthly audit of S3 bucket and costs
- Update Retention Policies: Adjust based on compliance and storage requirements
- Document Procedures: Keep disaster recovery runbooks up-to-date
- Plan for Growth: Monitor backup sizes and adjust resources accordingly
- Secure Credentials: Use git-crypt, Vault, or external secret management for production
- Test Production Migration: Validate S3 migration before relying on it
Known Issues and Solutions
Issue 1: Velero v1.17 Breaking Change - --keep-latest-maintenance-jobs Flag Removed
Date Noted: 2026-01-23 Severity: Critical (pod crash) Status: Resolved
Symptoms:
- Velero pod in
CrashLoopBackOffafter upgrading to chart v11.x - Error in logs:
Error: unknown flag: --keep-latest-maintenance-jobs
Root Cause:
The --keep-latest-maintenance-jobs CLI flag was deprecated in Velero v1.14 and removed in v1.17. The Helm chart v11.x uses a ConfigMap-based approach instead (--repo-maintenance-job-configmap).
Solution:
If ArgoCD isn't picking up the new chart version from git (still showing old targetRevision), recreate the ArgoCD Application:
# Delete and recreate the ArgoCD Application to force sync
kubectl delete application velero -n argocd
kubectl apply -f manifests/applications/velero.yaml
# Wait for sync and verify
kubectl get application velero -n argocd
velero backup-location get
Configuration Change:
The new Helm chart uses configuration.repositoryMaintenanceJob.repositoryConfigData in values.yaml instead of CLI flags:
configuration:
repositoryMaintenanceJob:
repositoryConfigData:
global:
keepLatestMaintenanceJobs: 3 # Previously a CLI flag
Related PRs:
- homelab#271: Velero major update to v11.3.2
Issue 2: snapshot-controller v8.x VolumeSnapshot Failures
Date Noted: 2026-01-05 Severity: Critical (backup failure) Status: Resolved by downgrading to v6.3.1
Symptoms:
- All VolumeSnapshots stuck with
READYTOUSE: false - Velero backups showing
PartiallyFailedstatus - Error message:
VolumeSnapshotContent is invalid: spec: Invalid value: sourceVolumeMode is required once set - VolumeSnapshotContent objects unable to be updated by snapshot-controller
Root Cause:
snapshot-controller v8.2.0 has strict immutability validation on the sourceVolumeMode field. When the controller attempts to add annotations to VolumeSnapshotContent objects during snapshot creation, the Kubernetes API server rejects the updates due to field validation rules that treat any update as potentially modifying the immutable field.
This is a known issue with the v8.x series: kubernetes-csi/external-snapshotter#866
Investigation Commands:
# Check VolumeSnapshot status
kubectl get volumesnapshot -A
# Describe failed snapshot
kubectl describe volumesnapshot -n default <snapshot-name>
# Check snapshot-controller version
kubectl get deployment -n synology-csi snapshot-controller -o yaml | grep "image:"
# View snapshot-controller logs
kubectl logs -n synology-csi deployment/snapshot-controller
Solution:
Downgrade to snapshot-controller v8.2.1 or v7.0.2, which are stable and compatible with Kubernetes 1.35:
Step 1: Clean up stuck VolumeSnapshot resources
# Remove finalizers to allow deletion
kubectl patch volumesnapshot -n <namespace> <snapshot-name> \
-p '{"metadata":{"finalizers":null}}' --type=merge
# Repeat for all stuck VolumeSnapshotContent objects
kubectl patch volumesnapshotcontent <snapcontent-name> \
-p '{"metadata":{"finalizers":null}}' --type=merge
Step 2: Update snapshot-controller version
In manifests/base/synology-csi/kustomization.yaml:
resources:
- github.com/kubernetes-csi/external-snapshotter/client/config/crd?ref=v7.0.2
- github.com/kubernetes-csi/external-snapshotter/deploy/kubernetes/snapshot-controller?ref=v7.0.2
Step 3: Deploy and verify
# ArgoCD will auto-sync
argocd app sync synology-csi
# Wait for new snapshot-controller pods
kubectl get pods -n synology-csi -l app.kubernetes.io/name=snapshot-controller
# Test VolumeSnapshot creation
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: test-snapshot
namespace: default
spec:
volumeSnapshotClassName: synology-snapshot-class
source:
persistentVolumeClaimName: <your-pvc-name>
EOF
# Verify snapshot reaches READYTOUSE: true
kubectl get volumesnapshot -n default test-snapshot
Expected Result:
- VolumeSnapshots reach
READYTOUSE: truein 8-10 seconds - Velero backups complete with status
Completed(notPartiallyFailed) - CSI snapshots:
csiVolumeSnapshotsCompleted: 3,Errors: 0
Related PRs:
- homelab#189: Downgrade snapshot-controller to v7.0.2 for stability
- homelab#188: Add snapshot-controller to Synology CSI deployment (introduced issue)
- homelab#187: Configure Velero to use CSI snapshots only
Issue 3: LocalStack Connection Required for Initial Deployment
Date Noted: 2025-12-27 Severity: Medium (deployment blocker)
Symptoms:
- Velero pod fails to start if LocalStack is not deployed first
- BackupStorageLocation shows "Unavailable"
Root Cause:
- Default
values.yamlis configured for LocalStack testing - Velero validates S3 connectivity on startup
Solution:
- Deploy LocalStack before Velero (for testing), OR
- Configure production S3 credentials before first deployment
Related PRs:
- homelab#149: Deploy Velero with Kopia file-level backup support
Related Documentation
- kube-prometheus-stack - Velero backup alerts
- Monitoring Overview
- ArgoCD
- Storage Configuration