Blackbox Exporter
Overview
Blackbox Exporter is a Prometheus exporter that allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP, and ICMP protocols. It's used for monitoring external service availability, SSL certificate expiry, network latency, and overall connectivity.
Key Features
- Multi-Protocol Probing: HTTP, HTTPS, DNS, TCP, and ICMP
- SSL/TLS Monitoring: Certificate expiry tracking and TLS version validation
- Response Time Metrics: Latency and performance monitoring
- Flexible Configuration: Customizable probe modules for different use cases
- Comprehensive Alerts: Automatic alerting for service downtime, certificate expiry, and degraded performance
Use Cases in This Cluster
- Internal Service Monitoring: HTTP/HTTPS probes for ArgoCD and Grafana
- External Connectivity: Monitoring internet connectivity via Google and Cloudflare
- DNS Health: Query monitoring for gateway, Cloudflare, and Google DNS servers
- Certificate Expiry: SSL certificate monitoring for all external HTTPS endpoints
- Infrastructure Availability: ICMP ping monitoring for NAS, gateway, and critical infrastructure
Architecture
┌─────────────────┐
│ Prometheus │
│ │
│ (Scrapes │
│ Blackbox │
│ Exporter) │
└────────┬────────┘
│
│ HTTP GET /probe?target=X&module=Y
│
▼
┌─────────────────────────────┐
│ Blackbox Exporter │
│ │
│ - HTTP/HTTPS Prober │
│ - DNS Prober │
│ - ICMP Prober │
│ - TCP Prober │
└──────────┬──────────────────┘
│
│ Probe Target
│
▼
┌──────────┐
│ Target │
│ Endpoint │
└──────────┘
Deployment Details
Container Image
- Image:
prom/blackbox-exporter:v0.28.0 - Port: 9115 (HTTP metrics and probes)
- Probes: Liveness and readiness checks on
/health
Upgraded from v0.25.0 to v0.28.0 to address 2 CRITICAL and 7 HIGH vulnerabilities. Now has 0 vulnerabilities.
Resource Allocation
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 128Mi
Security Context
- RunAsNonRoot:
true - RunAsUser:
65534(nobody) - ReadOnlyRootFilesystem:
true - Capabilities:
NET_RAW(required for ICMP probes)
Configuration
Probe Modules
The Blackbox Exporter is configured with multiple probe modules for different monitoring scenarios:
HTTP/HTTPS Probes
http_2xx - Basic HTTP probe
prober: http
timeout: 5s
http:
valid_status_codes: [] # 2xx
method: GET
follow_redirects: true
https_cert_expiry - HTTPS with certificate validation
prober: http
timeout: 5s
http:
method: GET
fail_if_not_ssl: true
tls_config:
insecure_skip_verify: false
DNS Probe
dns_query - DNS resolution monitoring
prober: dns
timeout: 5s
dns:
query_name: "kubernetes.default.svc.cluster.local"
query_type: "A"
ICMP Probe
icmp_ping - Network connectivity via ping
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
TCP Probe
tcp_connect - TCP port connectivity
prober: tcp
timeout: 5s
Monitored Targets
Internal Services (HTTP)
http://argocd-server.argocd:80- ArgoCD server
External Services (HTTPS)
https://argocd.k8s.n37.ca- ArgoCD external accesshttps://grafana.k8s.n37.ca- Grafana dashboardshttps://google.com- External connectivity testhttps://cloudflare.com- External connectivity test
DNS Servers
10.0.1.1:53- Gateway DNS1.1.1.1:53- Cloudflare DNS8.8.8.8:53- Google DNS
ICMP Targets
10.0.1.204- Synology NAS10.0.1.1- Network gateway8.8.8.8- Google DNS (connectivity test)
Prometheus Integration
Scrape Configuration
The Blackbox Exporter is configured in Prometheus via additionalScrapeConfigs with four separate jobs:
- blackbox-http: HTTP endpoint monitoring (30s interval)
- blackbox-https: HTTPS with cert expiry (60s interval)
- blackbox-dns: DNS query monitoring (30s interval)
- blackbox-icmp: ICMP ping monitoring (30s interval)
Relabeling Configuration
Each scrape job uses relabeling to properly set target labels:
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
This configuration:
- Moves the target address to a query parameter
- Sets the instance label to the target
- Directs Prometheus to scrape the Blackbox Exporter
Alerting
Alert Rules
Comprehensive alerting is configured via PrometheusRule:
Service Availability
| Alert Name | Condition | Duration | Severity |
|---|---|---|---|
EndpointDown | probe_success == 0 | 5 minutes | Critical |
EndpointDegraded | probe_success == 0 | 1 minute | Warning |
SSL Certificates
| Alert Name | Condition | Duration | Severity |
|---|---|---|---|
SSLCertificateExpiresIn30Days | Expires in < 30 days | 1 hour | Warning |
SSLCertificateExpiresIn7Days | Expires in < 7 days | 1 hour | Critical |
SSLCertificateExpired | Expired certificate | 5 minutes | Critical |
TLSVersionTooOld | TLS 1.0 or 1.1 | 1 hour | Warning |
Performance
| Alert Name | Condition | Duration | Severity |
|---|---|---|---|
HighHTTPResponseTime | Response time > 5s | 5 minutes | Warning |
VeryHighHTTPResponseTime | Response time > 10s | 2 minutes | Critical |
HighDNSResponseTime | DNS lookup > 1s | 5 minutes | Warning |
HighICMPLatency | Ping latency > 100ms | 5 minutes | Warning |
DNS & Network
| Alert Name | Condition | Duration | Severity |
|---|---|---|---|
DNSQueryFailed | DNS probe fails | 5 minutes | Critical |
HostUnreachable | ICMP probe fails | 5 minutes | Critical |
Key Metrics
Probe Success
- probe_success:
1if probe succeeded,0if failed - probe_duration_seconds: Total probe duration
HTTP Metrics
- probe_http_status_code: HTTP status code returned
- probe_http_duration_seconds: HTTP request duration by phase
- probe_http_redirects: Number of redirects followed
- probe_http_ssl:
1if SSL was used
SSL/TLS Metrics
- probe_ssl_earliest_cert_expiry: Unix timestamp of certificate expiry
- probe_tls_version_info: TLS version used (1.0, 1.1, 1.2, 1.3)
DNS Metrics
- probe_dns_lookup_time_seconds: DNS query duration
- probe_dns_answer_rrs: Number of DNS answer records
ICMP Metrics
- probe_icmp_duration_seconds: ICMP round-trip time
- probe_icmp_reply_hop_limit: TTL of ICMP reply
Useful Queries
Service Availability
# Current status of all probes
probe_success
# Failed probes
probe_success == 0
# Uptime percentage (last 24h)
avg_over_time(probe_success[24h]) * 100
SSL Certificate Expiry
# Days until certificate expires
(probe_ssl_earliest_cert_expiry - time()) / 86400
# Certificates expiring in < 30 days
(probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
Response Times
# HTTP response times
probe_http_duration_seconds
# DNS query times
probe_dns_lookup_time_seconds
# ICMP ping latency
probe_icmp_duration_seconds
# 95th percentile response time (last 5m)
histogram_quantile(0.95, rate(probe_http_duration_seconds_bucket[5m]))
Connectivity Health
# ICMP packet loss rate
1 - avg_over_time(probe_success{job="blackbox-icmp"}[5m])
# DNS failure rate
1 - avg_over_time(probe_success{job="blackbox-dns"}[5m])
Grafana Dashboards
Recommended Dashboards
-
Prometheus Blackbox Exporter (ID: 7587)
- Overview of all probes
- Success rates and response times
- SSL certificate status
-
Blackbox Exporter SSL/TLS (ID: 13659)
- Certificate expiry tracking
- TLS version distribution
- Certificate chain details
Custom Dashboard Panels
Service Availability Timeline
probe_success{job=~"blackbox.*"}
Certificate Expiry (Days)
(probe_ssl_earliest_cert_expiry - time()) / 86400
Response Time Heatmap
probe_http_duration_seconds
Troubleshooting
Common Issues
Probes Failing
Problem: probe_success == 0 for a target
Solutions:
-
Check target is reachable from the cluster:
kubectl exec -it deployment/blackbox-exporter -- wget -O- http://target -
Verify DNS resolution:
kubectl exec -it deployment/blackbox-exporter -- nslookup target -
Check probe configuration:
kubectl logs deployment/blackbox-exporter
ICMP Probes Not Working
Problem: ICMP probes fail with permission errors or are blocked by network policy
Solutions:
- Verify the deployment has
NET_RAWcapability:
kubectl get deployment blackbox-exporter -o yaml | grep -A5 capabilities
- Verify the Calico NetworkPolicy allows ICMP egress:
kubectl get networkpolicy.p.projectcalico.org -n default allow-egress-icmp-calico -o yaml
Kubernetes NetworkPolicies only support TCP, UDP, and SCTP protocols — not ICMP. A Calico NetworkPolicy (allow-egress-icmp-calico) is required to permit ICMP egress from the blackbox-exporter pod. The policy uses Calico selector syntax (app == 'blackbox-exporter') and restricts ICMP to the exact probe targets: 10.0.1.1/32 (gateway), 10.0.1.204/32 (NAS), 8.8.8.8/32 (Google DNS).
apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
name: allow-egress-icmp-calico
namespace: default
spec:
selector: app == 'blackbox-exporter'
types:
- Egress
egress:
- action: Allow
destination:
nets:
- 10.0.1.1/32
- 10.0.1.204/32
- 8.8.8.8/32
protocol: ICMP
SSL Certificate Warnings
Problem: SSL certificate metrics not appearing
Solution: Ensure https_cert_expiry module is used for HTTPS targets, not http_2xx
High Memory Usage
Problem: Blackbox exporter consuming excessive memory
Solutions:
- Reduce probe frequency in
additionalScrapeConfigs - Limit number of targets
- Increase memory limits if justified
Debug Commands
# Check deployment status
kubectl get deployment blackbox-exporter
# View logs
kubectl logs deployment/blackbox-exporter
# Test a probe manually
kubectl exec -it deployment/blackbox-exporter -- wget -O- \
'http://localhost:9115/probe?target=https://google.com&module=https_cert_expiry'
# View current configuration
kubectl get configmap blackbox-exporter-config -o yaml
# Check Prometheus targets
kubectl port-forward -n default svc/kube-prometheus-stack-prometheus 9090:9090
# Navigate to: http://localhost:9090/targets
Hairpin NAT Issues (Internal HTTPS Probes)
Problem: Probes to internal services via external DNS names (e.g., https://grafana.k8s.n37.ca) fail with "context deadline exceeded"
Symptoms:
Get "https://10.0.10.10": context deadline exceeded
Why it happens: Pods inside the cluster cannot reach external IPs (MetalLB LoadBalancer IPs) that route back to the same cluster. This is called "hairpin NAT" and causes connection timeouts due to NAT asymmetry.
Solution: Use hostAliases in the deployment to resolve external hostnames directly to the ingress ClusterIP:
# In blackbox-exporter-deployment.yaml
spec:
template:
spec:
hostAliases:
- ip: "10.98.168.24" # ingress-nginx ClusterIP
hostnames:
- "argocd.k8s.n37.ca"
- "grafana.k8s.n37.ca"
- "workflows.k8s.n37.ca"
Get the ingress ClusterIP:
kubectl get svc -n ingress-nginx ingress-nginx-controller -o jsonpath='{.spec.clusterIP}'
The ClusterIP is stable unless the ingress-nginx service is deleted and recreated. If probes start failing after ingress-nginx changes, check if the ClusterIP has changed.
This fix was implemented to resolve certificate health check failures for internal HTTPS endpoints.
Maintenance
Adding New Targets
-
Edit
manifests/base/kube-prometheus-stack/values.yaml -
Add target to appropriate
additionalScrapeConfigsjob:- job_name: 'blackbox-https'
static_configs:
- targets:
- https://new-service.k8s.n37.ca -
Commit and let ArgoCD sync
Updating Probe Modules
-
Edit
manifests/base/kube-prometheus-stack/blackbox-exporter-configmap.yaml -
Modify or add probe modules as needed
-
Restart the deployment:
kubectl rollout restart deployment/blackbox-exporter
Certificate Monitoring Best Practices
- Monitor certificates at least 30 days before expiry
- Set up critical alerts for 7-day threshold
- Verify cert-manager is renewing certificates automatically
- Test alert notifications regularly
References
- Official Documentation: Blackbox Exporter GitHub
- Probe Configuration: Configuration Guide
- Example Configurations: Examples
- Grafana Dashboards: Dashboard Gallery
Related Documentation
- Kube Prometheus Stack - Core monitoring stack
- SNMP Exporter - Synology NAS monitoring
- Cert Manager - Automatic certificate management
- Monitoring Overview - Complete monitoring architecture