Skip to main content

Trivy Vulnerability Remediation Guide

Production Status

Last Updated: 2026-01-12

Trivy Operator Status: ✅ OPERATIONAL (since 2026-01-05 deployment)

Current Alerts:

  • ✅ CriticalVulnerabilitiesDetected: Firing (4 images with CRITICAL CVEs)
  • ✅ HighVulnerabilityCount: Firing for 3 images
  • ✅ ExposedSecretsDetected: Not firing (no secrets exposed)
  • ✅ Alert routing to email confirmed working

Current Vulnerability Summary (2026-01-12):

SeverityCountChange from Initial (2026-01-05)
CRITICAL10⬇️ -43 (-81%)
HIGH332⬇️ -422 (-56%)
MEDIUM1,074⬇️ -425 (-28%)
Reports95+18 images scanned

Remaining CRITICAL Vulnerabilities:

ComponentCRITICALBlocker
Synology CSI (controller)3Awaiting upstream v1.2.2
Synology CSI (snapshotter)3Awaiting upstream v1.2.2
Synology CSI (node)3Awaiting upstream v1.2.2
Trivy Server1Awaiting Alpine base image fix

Overview

This guide provides procedures for responding to and remediating vulnerabilities detected by Trivy Operator in the Raspberry Pi 5 Kubernetes homelab cluster.

Monitoring and Alerting

Grafana Dashboard

Access the Trivy Security Dashboard at: https://grafana.k8s.n37.ca

Dashboard Panels:

  • Total Vulnerabilities: Critical, High, and Medium severity counts
  • Images Scanned: Total number of container images monitored
  • Vulnerabilities by Image: Sortable table with severity breakdown
  • Severity Distribution: Pie chart showing vulnerability composition
  • Namespace Breakdown: Critical+High vulnerabilities by namespace

Active Alerts

PrometheusRule alerts configured in manifests/base/trivy-operator/trivy-alerts.yaml:

Alert NameSeverityThresholdPurpose
CriticalVulnerabilitiesDetectedCriticalAny image with CRITICAL CVEsImmediate notification of critical security issues
HighVulnerabilityCountWarning>20 HIGH vulnerabilities in single imageWarn when vulnerability count is excessive
ClusterCriticalVulnerabilityThresholdExceededWarning>100 CRITICAL across clusterCluster-wide security posture degradation
ExposedSecretsDetectedCriticalAny exposed secrets in imagesIMMEDIATE ACTION REQUIRED
HighRiskRBACPermissionsWarningCritical RBAC issuesOverly permissive cluster roles
CISKubernetesBenchmarkFailuresInfoCIS compliance failuresCompliance monitoring
NSAKubernetesHardeningFailuresInfoNSA hardening failuresSecurity hardening gaps

Vulnerability Response Workflow

1. Alert Triage (within 1 hour)

When receiving a vulnerability alert:

# View all vulnerability reports
kubectl get vulnerabilityreports -A

# Get detailed report for specific workload
kubectl get vulnerabilityreport -n <namespace> <report-name> -o yaml

# Filter for CRITICAL vulnerabilities only
kubectl get vulnerabilityreports -A -o json | \
jq -r '.items[] | select(.report.summary.criticalCount > 0) |
"\(.metadata.namespace)/\(.metadata.labels."trivy-operator.resource.name"): \(.report.summary.criticalCount) CRITICAL"'

2. Vulnerability Assessment

For each CRITICAL vulnerability:

  1. Identify the CVE:

    kubectl get vulnerabilityreport -n <namespace> <report> -o json | \
    jq -r '.report.vulnerabilities[] | select(.severity == "CRITICAL") |
    "\(.vulnerabilityID): \(.title)"'
  2. Check exploitability:

    • Review CVE details at nvd.nist.gov
    • Check if vulnerability is remotely exploitable
    • Determine if affected component is exposed to network
  3. Assess impact:

    • Is the vulnerable package actually used by the application?
    • What's the blast radius if exploited?
    • Are there compensating controls (network policies, RBAC)?

3. Remediation Strategies

Strategy A: Update Container Image (Preferred)

Best for: Vendor-maintained images (ArgoCD, Grafana, Prometheus)

# Check current image version
kubectl get deployment -n <namespace> <name> -o jsonpath='{.spec.template.spec.containers[0].image}'

# Check for newer image versions
# For Helm charts:
helm search repo <chart-name> --versions | head -10

# Update Helm chart version in ArgoCD Application
vim manifests/applications/<app>.yaml
# Update targetRevision to latest stable version

# Create PR and deploy
git add manifests/applications/<app>.yaml
git commit -m "fix: Update <app> to address CVE-YYYY-XXXXX"
git push
gh pr create

Strategy B: Rebuild Custom Images

Best for: Custom applications and images you control

# Update base image in Dockerfile
FROM debian:12-slim # Update to latest stable base image

# Rebuild and push
docker build -t <registry>/<image>:<new-tag> .
docker push <registry>/<image>:<new-tag>

# Update Kubernetes manifest
kubectl set image deployment/<name> <container>=<registry>/<image>:<new-tag> -n <namespace>

Strategy C: Accept Risk (Temporary)

Only when:

  • No patch available from vendor
  • Vulnerability not exploitable in your environment
  • Critical business application cannot be updated immediately

Document in issue tracker:

## Accepted Risk: CVE-YYYY-XXXXX in <component>

**Severity:** CRITICAL
**Affected:** <namespace>/<workload>
**Reason:** [No patch available / Business critical / Isolated environment]
**Compensating Controls:**
- Network policy restricts ingress to pod
- RBAC limits pod permissions
- WAF/ingress filtering applied
**Remediation Plan:** Upgrade to <version> when available (ETA: <date>)
**Review Date:** <date>

4. Post-Remediation Verification

After applying fixes:

# Force Trivy rescan (Trivy rescans every 24h by default)
kubectl delete vulnerabilityreport -n <namespace> <report>
# Trivy will automatically regenerate the report

# Wait 2-3 minutes for scan to complete, then verify
kubectl get vulnerabilityreport -n <namespace> -o json | \
jq '.items[] | {name: .metadata.name, critical: .report.summary.criticalCount, high: .report.summary.highCount}'

# Check Grafana dashboard for updated metrics

Common Vulnerability Scenarios

Scenario 1: Base OS Package Vulnerabilities (Debian/Alpine)

Example: CVE-2024-37371 (Kerberos vulnerability in Debian base image)

Root Cause: Outdated base image layer

Remediation:

# Update base image to latest patch version
FROM debian:12.8-slim # Instead of debian:11-slim

# Or switch to distroless for minimal attack surface
FROM gcr.io/distroless/base-debian12

Scenario 2: Go/Python Library Vulnerabilities

Example: CVE-2024-45337 (golang.org/x/crypto SSH vulnerability)

Root Cause: Outdated Go module dependencies

Remediation:

# Update Go dependencies
go get -u golang.org/x/crypto@latest
go mod tidy

# Rebuild application
docker build -t <image>:<new-tag> .

Scenario 3: Exposed Secrets in Container Images

CRITICAL - IMMEDIATE ACTION REQUIRED

Example: AWS credentials, API keys, passwords in image layers

Remediation:

  1. Immediately rotate exposed credentials

  2. Remove secret from image:

    # Use Kubernetes Secrets instead
    kubectl create secret generic <name> --from-literal=api-key=<value>

    # Mount as environment variable
    env:
    - name: API_KEY
    valueFrom:
    secretKeyRef:
    name: <name>
    key: api-key
  3. Rebuild image without secrets

  4. Review git history - ensure secrets never committed to source

Scenario 4: Third-Party Helm Chart Vulnerabilities

Example: Synology CSI driver with 5 CRITICAL vulnerabilities

Remediation:

# Check for chart updates
helm search repo synology-csi --versions

# If no update available, check upstream GitHub
# File issue: https://github.com/SynologyOpenSource/synology-csi/issues

# Temporary mitigation:
# - Apply network policies to restrict CSI pod access
# - Monitor for suspicious activity in CSI pods

Vulnerability Remediation Priorities

Priority 1: CRITICAL - Act within 24 hours

  • Exposed secrets in images
  • Remotely exploitable RCE vulnerabilities
  • Privilege escalation in cluster-facing components (API server, kubelet)

Priority 2: HIGH - Act within 1 week

  • High severity vulnerabilities in internet-facing services
  • Container escape vulnerabilities
  • Authentication bypass in exposed services

Priority 3: MEDIUM - Act within 1 month

  • Medium severity vulnerabilities with no known exploits
  • Vulnerabilities in internal-only services
  • Denial of Service vulnerabilities

Priority 4: LOW - Best effort

  • Low severity vulnerabilities
  • Vulnerabilities in unused code paths
  • Informational findings

Compliance and Reporting

Weekly Vulnerability Review

# Generate weekly vulnerability summary
kubectl get vulnerabilityreports -A -o json | \
jq -r '["NAMESPACE","RESOURCE","CRITICAL","HIGH","MEDIUM"],
(.items[] | [
.metadata.namespace,
.metadata.labels."trivy-operator.resource.name",
(.report.summary.criticalCount // 0),
(.report.summary.highCount // 0),
(.report.summary.mediumCount // 0)
]) | @tsv' | column -t

# Track remediation progress
# Compare against previous week's counts

Monthly Compliance Reports

# CIS Kubernetes Benchmark status
kubectl get clustercompliancereport k8s-cis-1.23 -o json | \
jq '{
title: .spec.title,
passCount: (.status.summary.passCount // 0),
failCount: (.status.summary.failCount // 0)
}'

# NSA Kubernetes Hardening Guidance
kubectl get clustercompliancereport k8s-nsa-1.0 -o json | \
jq '{
title: .spec.title,
passCount: (.status.summary.passCount // 0),
failCount: (.status.summary.failCount // 0)
}'

Preventive Measures

1. Image Selection Best Practices

  • Prefer official images: Use official vendor images (e.g., prom/prometheus, grafana/grafana)
  • Use minimal base images: Prefer alpine or distroless over full Debian/Ubuntu
  • Pin specific versions: Avoid :latest tag, use semantic versioning
  • Verify image signatures: Use Cosign for image signature verification

2. Continuous Scanning

Trivy automatically scans:

  • On deployment: New images scanned within minutes
  • Daily rescans: All images rescanned every 24 hours
  • Compliance checks: Daily CIS/NSA compliance assessment

3. Automated Updates

# Configure Renovate Bot for automated dependency updates
# .github/renovate.json
{
"extends": ["config:base"],
"kubernetes": {
"fileMatch": ["manifests/.+\\.yaml$"]
},
"helm-values": {
"fileMatch": ["manifests/base/.+/values\\.yaml$"]
}
}

Troubleshooting

Trivy Scan Failures

# Check Trivy Operator logs
kubectl logs -n trivy-system deployment/trivy-operator --tail=100

# Check Trivy server logs
kubectl logs -n trivy-system statefulset/trivy-server --tail=100

# Manually trigger scan
kubectl delete vulnerabilityreport -n <namespace> <report>

False Positives

If Trivy reports a vulnerability that doesn't apply:

  1. Verify the finding:

    kubectl get vulnerabilityreport <name> -o json | \
    jq '.report.vulnerabilities[] | select(.vulnerabilityID == "CVE-YYYY-XXXXX")'
  2. Check if package is actually used:

    # Exec into container and verify
    kubectl exec -n <namespace> <pod> -- dpkg -l | grep <package>
  3. Create exception if confirmed false positive:

    # Add to trivy-operator values.yaml
    trivyOperator:
    ignoreUnfixed: true
    ignoreVulnerabilities:
    - CVE-YYYY-XXXXX # Document reason

Resources

Current Cluster Status

Production Data (2026-01-12, post-Major Remediation Day):

SeverityCountChange from Initial (2026-01-05)Remediation Impact
CRITICAL10⬇️ -43 (-81%)Multiple components remediated
HIGH332⬇️ -422 (-56%)Major sidecar and app updates
MEDIUM1,074⬇️ -425 (-28%)Cluster-wide improvements
TOTAL~1,416⬇️ -890 (-39%)Substantial security improvement

Note: Major vulnerability remediation completed across multiple sessions. Only 10 CRITICAL vulnerabilities remain, all blocked on upstream vendor releases.

Vulnerability Trend: ⬇️ EXCELLENT (81% CRITICAL reduction achieved)

2026-01-11: Major Remediation Day Results

ComponentCRITICAL BeforeCRITICAL AfterHIGH BeforeHIGH After
ArgoCD Redis30 ✅340 ✅
MetalLB FRR80 ✅8410
Blackbox Exporter20 ✅70 ✅
SNMP Exporter20 ✅60 ✅
External-DNS10 ✅71
Snapshot Controller10 ✅84
CSI Snapshotter10 ✅62

PRs Merged: #203, #205, #206, #207, #208, #209, #211, #212

2026-01-12: Synology CSI v1.2.1 Fix

  • Issue: v1.2.1 node plugin had iscsiadm mount regression
  • Solution: Added --chroot-dir=/host and --iscsiadm-path=/usr/sbin/iscsiadm flags
  • PR #216: Successfully deployed, all PVC mounts working
  • Impact: Node plugin now on v1.2.1 (was rolled back to v1.2.0)

Note: CRITICAL count unchanged as v1.2.0 and v1.2.1 share same base image vulnerabilities

Recent Remediation Actions:

  1. 2026-01-07 (evening): Promtail upgraded to 6.17.1 (app version 3.0.0 → 3.5.1)

    • CRITICAL: 7 → 0 (100% elimination) ✅
    • HIGH: 34 → 4 (88% reduction) ✅
    • Deployment: Rolling update, 3 minutes, zero downtime
    • Status: All 5 pods running successfully
    • Cluster impact: 43 → 38 CRITICAL
  2. 2026-01-07 (late evening): Synology CSI sidecars upgraded

    • csi-attacher: v4.0.0 → v4.10.0
    • csi-node-driver-registrar: v2.3.0 → v2.15.0
    • csi-snapshotter: v4.2.1 → v7.0.2
    • synology-csi (node): v1.2.0 → v1.2.1
    • Component CRITICAL: 13 → 11 (15% reduction, remaining in vendor base image)
    • Component HIGH: 163 → 49 (70% reduction) ✅
    • Deployment: Rolling updates across 3 StatefulSets/DaemonSets, all nodes updated successfully
    • Verification: All PVCs remain Bound, test volume provisioning successful
    • Cluster impact: 38 → 28 CRITICAL, 600 → 428 HIGH

The dramatic reduction in vulnerabilities demonstrates the effectiveness of targeted remediation of high-priority components.

Key CVEs to address:

  • CVE-2024-37371: Kerberos GSS (affects multiple base images) - High Priority
  • CVE-2024-41110: Docker/Moby authorization bypass - High Priority
  • CVE-2024-45337: Golang SSH vulnerability - Medium Priority
  • CVE-2024-24790: Golang net/netip issue - Medium Priority

Remediation Progress Tracking:

ComponentCRITICAL (Before → After)HIGH (Before → After)Remediation StatusCompletion Date
Promtail7 → 034 → 4🟢 Completed2026-01-07
Synology CSI Sidecars8 → 0114 → 20🟢 Completed2026-01-07
ArgoCD Redis3 → 034 → 0🟢 Completed2026-01-11
MetalLB FRR8 → 084 → 10🟢 Completed2026-01-11
Blackbox Exporter2 → 07 → 0🟢 Completed2026-01-11
SNMP Exporter2 → 06 → 0🟢 Completed2026-01-11
External-DNS1 → 07 → 1🟢 Completed2026-01-11
Snapshot Controller1 → 08 → 4🟢 Completed2026-01-11
CSI Snapshotter1 → 06 → 2🟢 Completed2026-01-11
Synology CSI (base image)927🔴 BlockedAwaiting v1.2.2
Trivy Server17🔴 BlockedAwaiting Alpine fix

Remediation Results:

Promtail (2026-01-07 evening):

  • Version Upgrade: Helm chart 6.16.6 → 6.17.1 (app version 3.0.0 → 3.5.1)
  • CRITICAL Reduction: 100% (7 → 0) - All CRITICAL CVEs eliminated ✅
  • HIGH Reduction: 88% (34 → 4) - Reduced from 34 to just 4 HIGH CVEs ✅
  • Deployment: Rolling update completed successfully in 3 minutes
  • Verification: All 5 pods running, logs flowing, metrics available
  • Resource Usage: Memory 26-35Mi (well under 128Mi limit)
  • Cluster Impact: Cluster-wide CRITICAL count reduced from 43 → 38

Synology CSI Sidecars (2026-01-07 late evening):

  • Component Upgrades (Final State):

    • ✅ csi-attacher: v4.0.0 → v4.10.0
    • ✅ csi-node-driver-registrar: v2.3.0 → v2.15.0
    • ✅ csi-snapshotter: v4.2.1 → v7.0.2
    • ⚠️ synology-csi (node): v1.2.0 → v1.2.1 → v1.2.0 (ROLLED BACK)
    • ✅ synology-csi (controller/snapshotter): v1.2.1 (unchanged)
  • Issue Encountered:

    • After upgrading synology-csi node plugin to v1.2.1, new iSCSI volume mounts failed with:

      env: can't execute 'iscsiadm': No such file or directory (exit status 127)
    • Grafana pod unable to start (stuck mounting PVC)

    • Existing PVCs mounted before upgrade remained functional

    • Root cause: v1.2.1 container regression - cannot find iscsiadm on host for new mounts

    • Resolution: Hotfix PR #201 rolled back node plugin to v1.2.0, Grafana restored successfully

  • CRITICAL Reduction: 15% (13 → 11) - Upgraded sidecars now 0 CRITICAL, remaining 11 in vendor base image

  • HIGH Reduction: 70% (163 → 49) - Major reduction in sidecar vulnerabilities ✅

  • Deployment: Partial upgrade successful:

    • ✅ Controller StatefulSet: 3 sidecar containers updated (csi-attacher, csi-provisioner, csi-resizer)
    • ✅ Node DaemonSet: csi-node-driver-registrar v2.15.0, synology-csi v1.2.0 (rolled back)
    • ✅ Snapshotter StatefulSet: csi-snapshotter v7.0.2
  • Verification:

    • All 9 CSI pods Running successfully with rollback
    • All 4 PVCs Bound and accessible (Prometheus 50Gi, Grafana 5Gi, Loki 20Gi, Trivy 5Gi)
    • Grafana pod fully operational after rollback
    • CSI driver registration confirmed
  • Cluster Impact: Cluster-wide CRITICAL reduced 38 → 28, HIGH reduced 600 → 428

  • Resolved Issue (2026-01-12): synology-csi v1.2.1 node plugin iscsiadm regression fixed by adding --chroot-dir=/host and --iscsiadm-path=/usr/sbin/iscsiadm flags (PR #216). All nodes now running v1.2.1 with working iSCSI mounts.

Next Steps:

  • Trivy Operator deployed and operational
  • Monitoring and alerting confirmed working
  • Review and update Promtail to latest version (Priority 1)COMPLETED 2026-01-07
  • Check for Synology CSI driver updates (Priority 1)COMPLETED 2026-01-12
    • Sidecars successfully upgraded (csi-attacher, csi-node-driver-registrar, csi-snapshotter)
    • Node plugin upgraded to v1.2.1 with iscsiadm-path fix (PR #216)
    • Remaining 9 CRITICAL in base image - awaiting upstream v1.2.2 release
  • Review and update ArgoCD Redis (Priority 2)COMPLETED 2026-01-11
    • Chart upgraded 9.0.5 → 9.2.4, Redis 8.2.2-alpine
    • All 3 CRITICAL and 34 HIGH vulnerabilities eliminated
  • Major Remediation Day (2026-01-11)COMPLETED
    • MetalLB, Blackbox Exporter, SNMP Exporter, External-DNS upgraded
    • Snapshot Controller and CSI Snapshotter upgraded to v8.x
    • 18 CRITICAL vulnerabilities eliminated in single day
  • Monitor upstream releases:
    • Synology CSI v1.2.2 (fixes 9 CRITICAL) - check GitHub releases
    • Trivy Server Alpine base image update (fixes 1 CRITICAL)
  • Implement automated update pipeline (Renovate Bot)