Trivy Vulnerability Remediation Guide

Production Status

Last Updated: 2026-01-12

Trivy Operator Status: ✅ OPERATIONAL (since 2026-01-05 deployment)

Current Alerts:

✅ CriticalVulnerabilitiesDetected: Firing (4 images with CRITICAL CVEs)
✅ HighVulnerabilityCount: Firing for 3 images
✅ ExposedSecretsDetected: Not firing (no secrets exposed)
✅ Alert routing to email confirmed working

Current Vulnerability Summary (2026-01-12):

Severity	Count	Change from Initial (2026-01-05)
CRITICAL	10	⬇️ -43 (-81%)
HIGH	332	⬇️ -422 (-56%)
MEDIUM	1,074	⬇️ -425 (-28%)
Reports	95	+18 images scanned

Remaining CRITICAL Vulnerabilities:

Component	CRITICAL	Blocker
Synology CSI (controller)	3	Awaiting upstream v1.2.2
Synology CSI (snapshotter)	3	Awaiting upstream v1.2.2
Synology CSI (node)	3	Awaiting upstream v1.2.2
Trivy Server	1	Awaiting Alpine base image fix

Overview

This guide provides procedures for responding to and remediating vulnerabilities detected by Trivy Operator in the Raspberry Pi 5 Kubernetes homelab cluster.

Monitoring and Alerting

Grafana Dashboard

Access the Trivy Security Dashboard at: https://grafana.k8s.n37.ca

Dashboard Panels:

Total Vulnerabilities: Critical, High, and Medium severity counts
Images Scanned: Total number of container images monitored
Vulnerabilities by Image: Sortable table with severity breakdown
Severity Distribution: Pie chart showing vulnerability composition
Namespace Breakdown: Critical+High vulnerabilities by namespace

Active Alerts

PrometheusRule alerts configured in manifests/base/trivy-operator/trivy-alerts.yaml:

Alert Name	Severity	Threshold	Purpose
`CriticalVulnerabilitiesDetected`	Critical	Any image with CRITICAL CVEs	Immediate notification of critical security issues
`HighVulnerabilityCount`	Warning	>20 HIGH vulnerabilities in single image	Warn when vulnerability count is excessive
`ClusterCriticalVulnerabilityThresholdExceeded`	Warning	>100 CRITICAL across cluster	Cluster-wide security posture degradation
`ExposedSecretsDetected`	Critical	Any exposed secrets in images	IMMEDIATE ACTION REQUIRED
`HighRiskRBACPermissions`	Warning	Critical RBAC issues	Overly permissive cluster roles
`CISKubernetesBenchmarkFailures`	Info	CIS compliance failures	Compliance monitoring
`NSAKubernetesHardeningFailures`	Info	NSA hardening failures	Security hardening gaps

Vulnerability Response Workflow

1. Alert Triage (within 1 hour)

When receiving a vulnerability alert:

# View all vulnerability reports
kubectl get vulnerabilityreports -A

# Get detailed report for specific workload
kubectl get vulnerabilityreport -n <namespace> <report-name> -o yaml

# Filter for CRITICAL vulnerabilities only
kubectl get vulnerabilityreports -A -o json | \
  jq -r '.items[] | select(.report.summary.criticalCount > 0) |
  "\(.metadata.namespace)/\(.metadata.labels."trivy-operator.resource.name"): \(.report.summary.criticalCount) CRITICAL"'

2. Vulnerability Assessment

For each CRITICAL vulnerability:

Identify the CVE:

kubectl get vulnerabilityreport -n <namespace> <report> -o json | \
  jq -r '.report.vulnerabilities[] | select(.severity == "CRITICAL") |
  "\(.vulnerabilityID): \(.title)"'

Check exploitability:
- Review CVE details at nvd.nist.gov
- Check if vulnerability is remotely exploitable
- Determine if affected component is exposed to network
Assess impact:
- Is the vulnerable package actually used by the application?
- What's the blast radius if exploited?
- Are there compensating controls (network policies, RBAC)?

3. Remediation Strategies

Strategy A: Update Container Image (Preferred)

Best for: Vendor-maintained images (ArgoCD, Grafana, Prometheus)

# Check current image version
kubectl get deployment -n <namespace> <name> -o jsonpath='{.spec.template.spec.containers[0].image}'

# Check for newer image versions
# For Helm charts:
helm search repo <chart-name> --versions | head -10

# Update Helm chart version in ArgoCD Application
vim manifests/applications/<app>.yaml
# Update targetRevision to latest stable version

# Create PR and deploy
git add manifests/applications/<app>.yaml
git commit -m "fix: Update <app> to address CVE-YYYY-XXXXX"
git push
gh pr create

Strategy B: Rebuild Custom Images

Best for: Custom applications and images you control

# Update base image in Dockerfile
FROM debian:12-slim  # Update to latest stable base image

# Rebuild and push
docker build -t <registry>/<image>:<new-tag> .
docker push <registry>/<image>:<new-tag>

# Update Kubernetes manifest
kubectl set image deployment/<name> <container>=<registry>/<image>:<new-tag> -n <namespace>

Strategy C: Accept Risk (Temporary)

Only when:

No patch available from vendor
Vulnerability not exploitable in your environment
Critical business application cannot be updated immediately

Document in issue tracker:

## Accepted Risk: CVE-YYYY-XXXXX in <component>

**Severity:** CRITICAL
**Affected:** <namespace>/<workload>
**Reason:** [No patch available / Business critical / Isolated environment]
**Compensating Controls:**
- Network policy restricts ingress to pod
- RBAC limits pod permissions
- WAF/ingress filtering applied
**Remediation Plan:** Upgrade to <version> when available (ETA: <date>)
**Review Date:** <date>

4. Post-Remediation Verification

After applying fixes:

# Force Trivy rescan (Trivy rescans every 24h by default)
kubectl delete vulnerabilityreport -n <namespace> <report>
# Trivy will automatically regenerate the report

# Wait 2-3 minutes for scan to complete, then verify
kubectl get vulnerabilityreport -n <namespace> -o json | \
  jq '.items[] | {name: .metadata.name, critical: .report.summary.criticalCount, high: .report.summary.highCount}'

# Check Grafana dashboard for updated metrics

Common Vulnerability Scenarios

Scenario 1: Base OS Package Vulnerabilities (Debian/Alpine)

Example: CVE-2024-37371 (Kerberos vulnerability in Debian base image)

Root Cause: Outdated base image layer

Remediation:

# Update base image to latest patch version
FROM debian:12.8-slim  # Instead of debian:11-slim

# Or switch to distroless for minimal attack surface
FROM gcr.io/distroless/base-debian12

Scenario 2: Go/Python Library Vulnerabilities

Example: CVE-2024-45337 (golang.org/x/crypto SSH vulnerability)

Root Cause: Outdated Go module dependencies

Remediation:

# Update Go dependencies
go get -u golang.org/x/crypto@latest
go mod tidy

# Rebuild application
docker build -t <image>:<new-tag> .

Scenario 3: Exposed Secrets in Container Images

CRITICAL - IMMEDIATE ACTION REQUIRED

Example: AWS credentials, API keys, passwords in image layers

Remediation:

Immediately rotate exposed credentials

Remove secret from image:

# Use Kubernetes Secrets instead
kubectl create secret generic <name> --from-literal=api-key=<value>

# Mount as environment variable
env:
- name: API_KEY
  valueFrom:
    secretKeyRef:
      name: <name>
      key: api-key

Rebuild image without secrets
Review git history - ensure secrets never committed to source

Scenario 4: Third-Party Helm Chart Vulnerabilities

Example: Synology CSI driver with 5 CRITICAL vulnerabilities

Remediation:

# Check for chart updates
helm search repo synology-csi --versions

# If no update available, check upstream GitHub
# File issue: https://github.com/SynologyOpenSource/synology-csi/issues

# Temporary mitigation:
# - Apply network policies to restrict CSI pod access
# - Monitor for suspicious activity in CSI pods

Vulnerability Remediation Priorities

Priority 1: CRITICAL - Act within 24 hours

Exposed secrets in images
Remotely exploitable RCE vulnerabilities
Privilege escalation in cluster-facing components (API server, kubelet)

Priority 2: HIGH - Act within 1 week

High severity vulnerabilities in internet-facing services
Container escape vulnerabilities
Authentication bypass in exposed services

Priority 3: MEDIUM - Act within 1 month

Medium severity vulnerabilities with no known exploits
Vulnerabilities in internal-only services
Denial of Service vulnerabilities

Priority 4: LOW - Best effort

Low severity vulnerabilities
Vulnerabilities in unused code paths
Informational findings

Compliance and Reporting

Weekly Vulnerability Review

# Generate weekly vulnerability summary
kubectl get vulnerabilityreports -A -o json | \
  jq -r '["NAMESPACE","RESOURCE","CRITICAL","HIGH","MEDIUM"],
  (.items[] | [
    .metadata.namespace,
    .metadata.labels."trivy-operator.resource.name",
    (.report.summary.criticalCount // 0),
    (.report.summary.highCount // 0),
    (.report.summary.mediumCount // 0)
  ]) | @tsv' | column -t

# Track remediation progress
# Compare against previous week's counts

Monthly Compliance Reports

# CIS Kubernetes Benchmark status
kubectl get clustercompliancereport k8s-cis-1.23 -o json | \
  jq '{
    title: .spec.title,
    passCount: (.status.summary.passCount // 0),
    failCount: (.status.summary.failCount // 0)
  }'

# NSA Kubernetes Hardening Guidance
kubectl get clustercompliancereport k8s-nsa-1.0 -o json | \
  jq '{
    title: .spec.title,
    passCount: (.status.summary.passCount // 0),
    failCount: (.status.summary.failCount // 0)
  }'

Preventive Measures

1. Image Selection Best Practices

Prefer official images: Use official vendor images (e.g., prom/prometheus, grafana/grafana)
Use minimal base images: Prefer alpine or distroless over full Debian/Ubuntu
Pin specific versions: Avoid :latest tag, use semantic versioning
Verify image signatures: Use Cosign for image signature verification

2. Continuous Scanning

Trivy automatically scans:

On deployment: New images scanned within minutes
Daily rescans: All images rescanned every 24 hours
Compliance checks: Daily CIS/NSA compliance assessment

3. Automated Updates

# Configure Renovate Bot for automated dependency updates
# .github/renovate.json
{
  "extends": ["config:base"],
  "kubernetes": {
    "fileMatch": ["manifests/.+\\.yaml$"]
  },
  "helm-values": {
    "fileMatch": ["manifests/base/.+/values\\.yaml$"]
  }
}

Troubleshooting

Trivy Scan Failures

# Check Trivy Operator logs
kubectl logs -n trivy-system deployment/trivy-operator --tail=100

# Check Trivy server logs
kubectl logs -n trivy-system statefulset/trivy-server --tail=100

# Manually trigger scan
kubectl delete vulnerabilityreport -n <namespace> <report>

False Positives

If Trivy reports a vulnerability that doesn't apply:

Verify the finding:

kubectl get vulnerabilityreport <name> -o json | \
  jq '.report.vulnerabilities[] | select(.vulnerabilityID == "CVE-YYYY-XXXXX")'

Check if package is actually used:

# Exec into container and verify
kubectl exec -n <namespace> <pod> -- dpkg -l | grep <package>

Create exception if confirmed false positive:

# Add to trivy-operator values.yaml
trivyOperator:
  ignoreUnfixed: true
  ignoreVulnerabilities:
    - CVE-YYYY-XXXXX  # Document reason

Resources

Trivy Documentation: aquasecurity.github.io/trivy/
CVE Database: nvd.nist.gov/
Kubernetes Security Best Practices: kubernetes.io/docs/concepts/security/
CIS Kubernetes Benchmark: www.cisecurity.org/benchmark/kubernetes
NSA/CISA Hardening Guide: media.defense.gov/2022/Aug/29/2003066362/-1/-1/0/CTR_KUBERNETES_HARDENING_GUIDANCE_1.2_20220829.PDF

Current Cluster Status

Production Data (2026-01-12, post-Major Remediation Day):

Severity	Count	Change from Initial (2026-01-05)	Remediation Impact
CRITICAL	10	⬇️ -43 (-81%)	Multiple components remediated
HIGH	332	⬇️ -422 (-56%)	Major sidecar and app updates
MEDIUM	1,074	⬇️ -425 (-28%)	Cluster-wide improvements
TOTAL	~1,416	⬇️ -890 (-39%)	Substantial security improvement ✅

Note: Major vulnerability remediation completed across multiple sessions. Only 10 CRITICAL vulnerabilities remain, all blocked on upstream vendor releases.

Vulnerability Trend: ⬇️ EXCELLENT (81% CRITICAL reduction achieved)

2026-01-11: Major Remediation Day Results

Component	CRITICAL Before	HIGH Before	HIGH After
ArgoCD Redis	3	34	0 ✅
MetalLB FRR	8	84	10
Blackbox Exporter	2	7	0 ✅
SNMP Exporter	2	6	0 ✅
External-DNS	1	7	1
Snapshot Controller	1	8	4
CSI Snapshotter	1	6	2

PRs Merged: #203, #205, #206, #207, #208, #209, #211, #212

2026-01-12: Synology CSI v1.2.1 Fix

Issue: v1.2.1 node plugin had iscsiadm mount regression
Solution: Added --chroot-dir=/host and --iscsiadm-path=/usr/sbin/iscsiadm flags
PR #216: Successfully deployed, all PVC mounts working
Impact: Node plugin now on v1.2.1 (was rolled back to v1.2.0)

Note: CRITICAL count unchanged as v1.2.0 and v1.2.1 share same base image vulnerabilities

Recent Remediation Actions:

2026-01-07 (evening): Promtail upgraded to 6.17.1 (app version 3.0.0 → 3.5.1)
- CRITICAL: 7 → 0 (100% elimination) ✅
- HIGH: 34 → 4 (88% reduction) ✅
- Deployment: Rolling update, 3 minutes, zero downtime
- Status: All 5 pods running successfully
- Cluster impact: 43 → 38 CRITICAL
2026-01-07 (late evening): Synology CSI sidecars upgraded
- csi-attacher: v4.0.0 → v4.10.0
- csi-node-driver-registrar: v2.3.0 → v2.15.0
- csi-snapshotter: v4.2.1 → v7.0.2
- synology-csi (node): v1.2.0 → v1.2.1
- Component CRITICAL: 13 → 11 (15% reduction, remaining in vendor base image)
- Component HIGH: 163 → 49 (70% reduction) ✅
- Deployment: Rolling updates across 3 StatefulSets/DaemonSets, all nodes updated successfully
- Verification: All PVCs remain Bound, test volume provisioning successful
- Cluster impact: 38 → 28 CRITICAL, 600 → 428 HIGH

The dramatic reduction in vulnerabilities demonstrates the effectiveness of targeted remediation of high-priority components.

Key CVEs to address:

CVE-2024-37371: Kerberos GSS (affects multiple base images) - High Priority
CVE-2024-41110: Docker/Moby authorization bypass - High Priority
CVE-2024-45337: Golang SSH vulnerability - Medium Priority
CVE-2024-24790: Golang net/netip issue - Medium Priority

Remediation Progress Tracking:

Component	CRITICAL (Before → After)	HIGH (Before → After)	Remediation Status	Completion Date
Promtail	7 → 0 ✅	34 → 4 ✅	🟢 Completed	2026-01-07
Synology CSI Sidecars	8 → 0 ✅	114 → 20 ✅	🟢 Completed	2026-01-07
ArgoCD Redis	3 → 0 ✅	34 → 0 ✅	🟢 Completed	2026-01-11
MetalLB FRR	8 → 0 ✅	84 → 10 ✅	🟢 Completed	2026-01-11
Blackbox Exporter	2 → 0 ✅	7 → 0 ✅	🟢 Completed	2026-01-11
SNMP Exporter	2 → 0 ✅	6 → 0 ✅	🟢 Completed	2026-01-11
External-DNS	1 → 0 ✅	7 → 1 ✅	🟢 Completed	2026-01-11
Snapshot Controller	1 → 0 ✅	8 → 4 ✅	🟢 Completed	2026-01-11
CSI Snapshotter	1 → 0 ✅	6 → 2 ✅	🟢 Completed	2026-01-11
Synology CSI (base image)	9	27	🔴 Blocked	Awaiting v1.2.2
Trivy Server	1	7	🔴 Blocked	Awaiting Alpine fix

Remediation Results:

Promtail (2026-01-07 evening):

Version Upgrade: Helm chart 6.16.6 → 6.17.1 (app version 3.0.0 → 3.5.1)
CRITICAL Reduction: 100% (7 → 0) - All CRITICAL CVEs eliminated ✅
HIGH Reduction: 88% (34 → 4) - Reduced from 34 to just 4 HIGH CVEs ✅
Deployment: Rolling update completed successfully in 3 minutes
Verification: All 5 pods running, logs flowing, metrics available
Resource Usage: Memory 26-35Mi (well under 128Mi limit)
Cluster Impact: Cluster-wide CRITICAL count reduced from 43 → 38

Synology CSI Sidecars (2026-01-07 late evening):

Component Upgrades (Final State):
- ✅ csi-attacher: v4.0.0 → v4.10.0
- ✅ csi-node-driver-registrar: v2.3.0 → v2.15.0
- ✅ csi-snapshotter: v4.2.1 → v7.0.2
- ⚠️ synology-csi (node): v1.2.0 → v1.2.1 → v1.2.0 (ROLLED BACK)
- ✅ synology-csi (controller/snapshotter): v1.2.1 (unchanged)
Issue Encountered:
- After upgrading synology-csi node plugin to v1.2.1, new iSCSI volume mounts failed with:
```
env: can't execute 'iscsiadm': No such file or directory (exit status 127)
```
- Grafana pod unable to start (stuck mounting PVC)
- Existing PVCs mounted before upgrade remained functional
- Root cause: v1.2.1 container regression - cannot find iscsiadm on host for new mounts
- Resolution: Hotfix PR #201 rolled back node plugin to v1.2.0, Grafana restored successfully
CRITICAL Reduction: 15% (13 → 11) - Upgraded sidecars now 0 CRITICAL, remaining 11 in vendor base image
HIGH Reduction: 70% (163 → 49) - Major reduction in sidecar vulnerabilities ✅
Deployment: Partial upgrade successful:
- ✅ Controller StatefulSet: 3 sidecar containers updated (csi-attacher, csi-provisioner, csi-resizer)
- ✅ Node DaemonSet: csi-node-driver-registrar v2.15.0, synology-csi v1.2.0 (rolled back)
- ✅ Snapshotter StatefulSet: csi-snapshotter v7.0.2
Verification:
- All 9 CSI pods Running successfully with rollback
- All 4 PVCs Bound and accessible (Prometheus 50Gi, Grafana 5Gi, Loki 20Gi, Trivy 5Gi)
- Grafana pod fully operational after rollback
- CSI driver registration confirmed
Cluster Impact: Cluster-wide CRITICAL reduced 38 → 28, HIGH reduced 600 → 428
Resolved Issue (2026-01-12): synology-csi v1.2.1 node plugin iscsiadm regression fixed by adding --chroot-dir=/host and --iscsiadm-path=/usr/sbin/iscsiadm flags (PR #216). All nodes now running v1.2.1 with working iSCSI mounts.

Next Steps:

Trivy Operator deployed and operational
Monitoring and alerting confirmed working
Review and update Promtail to latest version (Priority 1) ✅ COMPLETED 2026-01-07
Check for Synology CSI driver updates (Priority 1) ✅ COMPLETED 2026-01-12
- Sidecars successfully upgraded (csi-attacher, csi-node-driver-registrar, csi-snapshotter)
- Node plugin upgraded to v1.2.1 with iscsiadm-path fix (PR #216)
- Remaining 9 CRITICAL in base image - awaiting upstream v1.2.2 release
Review and update ArgoCD Redis (Priority 2) ✅ COMPLETED 2026-01-11
- Chart upgraded 9.0.5 → 9.2.4, Redis 8.2.2-alpine
- All 3 CRITICAL and 34 HIGH vulnerabilities eliminated
Major Remediation Day (2026-01-11) ✅ COMPLETED
- MetalLB, Blackbox Exporter, SNMP Exporter, External-DNS upgraded
- Snapshot Controller and CSI Snapshotter upgraded to v8.x
- 18 CRITICAL vulnerabilities eliminated in single day
Monitor upstream releases:
- Synology CSI v1.2.2 (fixes 9 CRITICAL) - check GitHub releases
- Trivy Server Alpine base image update (fixes 1 CRITICAL)
Implement automated update pipeline (Renovate Bot)

Production Status​

Overview​

Monitoring and Alerting​

Grafana Dashboard​

Active Alerts​

Vulnerability Response Workflow​

1. Alert Triage (within 1 hour)​

2. Vulnerability Assessment​

3. Remediation Strategies​

Strategy A: Update Container Image (Preferred)​

Strategy B: Rebuild Custom Images​

Strategy C: Accept Risk (Temporary)​

4. Post-Remediation Verification​

Common Vulnerability Scenarios​

Scenario 1: Base OS Package Vulnerabilities (Debian/Alpine)​

Scenario 2: Go/Python Library Vulnerabilities​

Scenario 3: Exposed Secrets in Container Images​

Scenario 4: Third-Party Helm Chart Vulnerabilities​

Vulnerability Remediation Priorities​

Priority 1: CRITICAL - Act within 24 hours​

Priority 2: HIGH - Act within 1 week​

Priority 3: MEDIUM - Act within 1 month​

Priority 4: LOW - Best effort​

Compliance and Reporting​

Weekly Vulnerability Review​

Monthly Compliance Reports​

Preventive Measures​

1. Image Selection Best Practices​

2. Continuous Scanning​

3. Automated Updates​

Troubleshooting​

Trivy Scan Failures​

False Positives​

Resources​

Current Cluster Status​

2026-01-11: Major Remediation Day Results​

2026-01-12: Synology CSI v1.2.1 Fix​

Production Status

Overview

Monitoring and Alerting

Grafana Dashboard

Active Alerts

Vulnerability Response Workflow

1. Alert Triage (within 1 hour)

2. Vulnerability Assessment

3. Remediation Strategies

Strategy A: Update Container Image (Preferred)

Strategy B: Rebuild Custom Images

Strategy C: Accept Risk (Temporary)

4. Post-Remediation Verification

Common Vulnerability Scenarios

Scenario 1: Base OS Package Vulnerabilities (Debian/Alpine)

Scenario 2: Go/Python Library Vulnerabilities

Scenario 3: Exposed Secrets in Container Images

Scenario 4: Third-Party Helm Chart Vulnerabilities

Vulnerability Remediation Priorities

Priority 1: CRITICAL - Act within 24 hours

Priority 2: HIGH - Act within 1 week

Priority 3: MEDIUM - Act within 1 month

Priority 4: LOW - Best effort

Compliance and Reporting

Weekly Vulnerability Review

Monthly Compliance Reports

Preventive Measures

1. Image Selection Best Practices

2. Continuous Scanning

3. Automated Updates

Troubleshooting

Trivy Scan Failures

False Positives

Resources

Current Cluster Status

2026-01-11: Major Remediation Day Results

2026-01-12: Synology CSI v1.2.1 Fix