Loki + Promtail

Grafana Loki with Promtail provides centralized log aggregation and querying for all Kubernetes pods across the 5-node Raspberry Pi cluster.

Overview

Loki:

Namespace: loki
Helm Chart: grafana/loki
Chart Version: 6.53.0
Deployment Mode: SingleBinary (monolithic)
Deployment: Managed by ArgoCD
Sync Wave: -12 (after kube-prometheus-stack -15, before cert-manager -10)

Promtail:

Namespace: loki
Helm Chart: grafana/promtail
Chart Version: 6.17.1 (App version 3.5.1)
Deployment: DaemonSet (one pod per node)
Sync Wave: -11 (after Loki -12)

Note: Promtail upgraded from 6.16.6 → 6.17.1 on 2026-01-07, eliminating 7 CRITICAL vulnerabilities.

Architecture

┌─────────────────────────────────────────┐
│  Promtail DaemonSet (5 pods)            │
│  - One pod per Pi node                  │
│  - Collects logs from /var/log/pods/    │
│  - 50m/100m CPU, 64Mi/128Mi memory each │
└────────────────┬────────────────────────┘
                 ↓
┌─────────────────────────────────────────┐
│  Loki SingleBinary (1 pod)              │
│  - Storage: 20Gi PVC on Synology        │
│  - Retention: 7 days (168h)             │
│  - 200m/500m CPU, 384Mi/768Mi memory    │
└────────────────┬────────────────────────┘
                 ↓
┌─────────────────────────────────────────┐
│  Grafana (auto-discovered datasource)   │
│  - Query logs via LogQL                 │
│  - Dashboards and exploration           │
└─────────────────────────────────────────┘

Components

Loki SingleBinary

Purpose: Centralized log storage, querying, and compaction

Replicas: 1
Storage: 20Gi PVC on synology-iscsi-retain
Retention: 7 days (168h) with automatic compaction
Schema: v13 (TSDB store with filesystem backend)
Resources:
- Requests: 200m CPU, 512Mi memory
- Limits: 500m CPU, 768Mi memory
Memory Management:
- GOMEMLIMIT: 700MiB (Go runtime memory cap)
- Ingestion rate limits: 10MB/sec, 20MB burst
- Internal caching only (no external memcached)
Sidecar (k8s-sidecar): 10m/64Mi request, 100m/256Mi limit
Loki Canary: 10m/32Mi request, 50m/64Mi limit

Note: Memory configuration optimized on 2026-01-05 for singleBinary mode stability. Uses internal caching instead of external memcached to prevent distributed mode activation conflicts.

Sidecar Memory (2026-02-07)

The k8s-sidecar container (loki-sc-rules) watches ConfigMaps across many namespaces and needs 256Mi+ memory. The original 64Mi limit caused OOMKills. Loki canary resources are set via the top-level lokiCanary.resources key (not under monitoring.selfMonitoring.lokiCanary).

Service Endpoints:

Internal: loki.loki.svc.cluster.local:3100
Push API: http://loki.loki.svc.cluster.local:3100/loki/api/v1/push
Query API: http://loki.loki.svc.cluster.local:3100/loki/api/v1/query

Key Features:

TSDB Index: Modern time-series database index for efficient queries
Compaction: Runs every 10 minutes to reduce storage usage
Retention: Automatically deletes logs older than 7 days
Filesystem Storage: Uses PVC (no object storage required)

Promtail DaemonSet

Purpose: Log collection agent running on all nodes

Replicas: 5 (one DaemonSet pod per node, including control-plane)
Host Access: hostPID: true, hostNetwork: false
Log Sources: /var/log/pods/, /var/log/containers/
Tolerations: Configured to run on control-plane node
Resources per pod:
- Requests: 50m CPU, 64Mi memory
- Limits: 100m CPU, 128Mi memory
Metrics: ServiceMonitor enabled for Prometheus scraping

Total Resource Impact:

CPU Requests: 450m total (Loki 200m + 5 × Promtail 50m)
Memory Requests: 704Mi total (Loki 384Mi + 5 × Promtail 64Mi)
CPU Limits: 1000m total (Loki 500m + 5 × Promtail 100m)
Memory Limits: 1.6Gi total (Loki 768Mi + 5 × Promtail 128Mi)

Log Labeling: Promtail automatically adds these labels to all logs:

namespace - Kubernetes namespace
pod - Pod name
container - Container name
node - Node name (useful for Pi cluster debugging)

Loki Label Limit (15 Labels Maximum)

Important

Loki has a default limit of 15 labels per log stream. Kubernetes pods with many labels (especially Istio, Helm-deployed apps) can exceed this limit.

Symptoms:

entry for stream '{...17 labels...}' has 17 label names; limit 15

Common high-label-count sources:

Istio pods (ztunnel, istio-cni-node): 17+ labels
Helm-deployed StatefulSets: 16+ labels
Apps with many app.kubernetes.io/* labels

Solution - Selective labelmap:

Don't use labeldrop after labelmap - it doesn't work because relabel_configs process against the original label set, not transformed labels.

Instead, use a selective labelmap regex to only capture essential labels:

# In promtail values.yaml scrapeConfigs
relabel_configs:
  # Fixed labels (always included)
  - source_labels: [__meta_kubernetes_namespace]
    target_label: namespace
  - source_labels: [__meta_kubernetes_pod_name]
    target_label: pod
  - source_labels: [__meta_kubernetes_pod_container_name]
    target_label: container
  - source_labels: [__meta_kubernetes_pod_node_name]
    target_label: node

  # Selective labelmap - only capture useful labels
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(app|app_kubernetes_io_name|app_kubernetes_io_instance|app_kubernetes_io_component|app_kubernetes_io_part_of|k8s_app|service_name)

Label count after fix:

Fixed: namespace, pod, container, node (4)
Mapped: up to 7 essential labels (only if present)
Auto: filename, stream (2)
Total: max 13 labels (under 15 limit)

Added 2026-01-28

This fix was implemented after adding Istio Ambient mesh, which added many labels to pods.

Control-Plane Scheduling: Promtail includes a toleration to run on the control-plane node:

tolerations:
  - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule

This enables log collection from critical control plane components:

kube-apiserver
kube-controller-manager
kube-scheduler
etcd
CoreDNS

Important Notes:

Uses hostNetwork: false to avoid Calico CNI routing issues
Learned from control plane monitoring troubleshooting
Runs on ALL 5 nodes (4 workers + 1 control-plane)

Storage

PVC Configuration

Size: 20Gi
Storage Class: synology-iscsi-retain
Access Mode: ReadWriteOnce
Backend: Synology DS925+ NAS via iSCSI

Storage Calculation

Expected usage for the 5-node cluster:

~250 pods total (5 nodes × ~50 pods)
~1.8 GB/day compressed logs
7-day retention = ~12.6 GB
20Gi provides ~50% buffer for growth

Expanding Storage

If storage fills up, you can expand the PVC online:

# Edit PVC to increase size
kubectl edit pvc loki-chunks-loki-0 -n loki

# Change spec.resources.requests.storage to new size
# Synology CSI supports online expansion

Retention Configuration

Loki is configured with automatic log retention:

loki:
  limits_config:
    retention_period: 168h  # 7 days
  compactor:
    retention_enabled: true
    compaction_interval: 10m

How it works:

Compactor runs every 10 minutes
Identifies log chunks older than 7 days
Automatically deletes expired chunks
Reduces storage usage and improves query performance

Adjusting Retention: To change retention period, edit manifests/base/loki/values.yaml:

limits_config:
  retention_period: 120h  # 5 days (example)

Grafana Integration

Datasource Configuration

Loki datasource is automatically discovered by Grafana via ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-grafana-datasource
  namespace: loki
  labels:
    grafana_datasource: "1"  # Auto-discovery label

The Grafana sidecar automatically:

Watches for ConfigMaps with label grafana_datasource: "1"
Loads datasource configuration
Makes Loki available in Grafana Explore and dashboards

No manual configuration needed!

Accessing Grafana

# Port-forward to Grafana
kubectl port-forward -n default svc/kube-prometheus-stack-grafana 3000:80

# Open browser
open http://localhost:3000

# Login credentials
username: admin
password: $(kubectl get secret kube-prometheus-stack-grafana -n default \
  -o jsonpath="{.data.admin-password}" | base64 -d)

Using Loki in Grafana

Navigate to Explore (compass icon)
Select Loki from datasource dropdown
Enter LogQL query
Click Run query

LogQL Query Examples

Basic Queries

# All logs from default namespace
{namespace="default"}

# All logs from specific pod
{pod="loki-0"}

# All logs from specific node
{node="node04"}

# All logs from Loki namespace
{namespace="loki"}

Filtering Logs

# Filter for errors in default namespace
{namespace="default"} |= "error"

# Filter for errors OR fatal
{namespace="default"} |= "error" or "fatal"

# Exclude info logs
{namespace="default"} != "info"

# Case-insensitive search
{namespace="default"} |~ "(?i)error"

Pattern Matching

# All Prometheus pods
{pod=~"prometheus.*"}

# All kube-system errors
{namespace="kube-system"} |= "error"

# Critical system errors
{namespace="kube-system"} |= "error" |= "fatal"

# Application errors (exclude system namespaces)
{namespace!~"kube-.*"} |= "error"

Advanced Queries

# Count error rate per minute
sum(rate({namespace="default"} |= "error" [1m])) by (pod)

# Top 10 pods by log volume
topk(10, sum(rate({namespace=~".+"} [5m])) by (pod))

# Logs from last hour
{namespace="default"} |= "error" [1h]

Common Use Cases

Troubleshooting Pod Crashes

# View logs from crashed pod
{namespace="default", pod="myapp-xyz"}

# Find errors before crash
{namespace="default", pod="myapp-xyz"} |= "error" or "fatal" or "panic"

Monitoring Deployments

# Watch logs during deployment
{namespace="default", pod=~"myapp-.*"} |= "Started" or "Ready" or "Error"

Debugging Network Issues

# Find connection errors
{namespace=~".+"} |= "connection refused" or "timeout" or "network"

Checking Control Plane Health

# API server errors
{namespace="kube-system", pod=~"kube-apiserver.*"} |= "error"

# CoreDNS issues
{namespace="kube-system", pod=~"coredns.*"} |= "error" or "timeout"

# Node-level issues
{namespace="kube-system"} |= "node" |= "not ready" or "evicted"

Grafana Dashboards

Community Dashboards

Import these dashboards in Grafana (Dashboard → Import):

Dashboard ID	Name	Description
12611	Loki Dashboard	Quick search and log volume overview
13639	Logs / App	Application log analysis
13407	Kubernetes Logs	Kubernetes-specific log patterns

Custom Dashboard Example

Create a custom dashboard to monitor errors across namespaces:

Create new dashboard

Add panel with query:

sum(rate({namespace=~".+"} |= "error" [5m])) by (namespace)

Visualization: Time series or Bar chart
Set alert threshold for error rate > X/min

Troubleshooting

No Logs Appearing

Check Promtail pods are running:

kubectl get pods -n loki -l app.kubernetes.io/name=promtail
# Expect: 5 pods (one per node)

Check Promtail logs:

kubectl logs -n loki -l app.kubernetes.io/name=promtail --tail=50
# Look for connection errors or scrape failures

Verify Loki service:

kubectl get svc -n loki loki
# Should show ClusterIP on port 3100

Loki Pod Crashes or OOM

Check memory usage:

kubectl top pod -n loki loki-0

Increase memory limits if needed: Edit manifests/base/loki/values.yaml:

singleBinary:
  resources:
    limits:
      memory: 1Gi  # Increase from 768Mi

Check query complexity:

Avoid queries with very long time ranges
Use max_query_series limit to prevent expensive queries

Slow Queries

Check compaction status:

kubectl logs -n loki loki-0 -c loki | grep compaction

Reduce retention if storage is filling:

limits_config:
  retention_period: 120h  # Reduce to 5 days

Storage Full

Check PVC usage:

kubectl exec -n loki loki-0 -c loki -- df -h /var/loki

Expand PVC:

kubectl edit pvc loki-chunks-loki-0 -n loki
# Increase spec.resources.requests.storage

Or reduce retention:

limits_config:
  retention_period: 72h  # 3 days

Monitoring Loki

ServiceMonitors are enabled for both Loki and Promtail, allowing Prometheus to scrape metrics automatically.

Prometheus Integration

ServiceMonitor Configuration:

# Loki
monitoring:
  serviceMonitor:
    enabled: true
    labels:
      release: kube-prometheus-stack

# Promtail
serviceMonitor:
  enabled: true
  labels:
    release: kube-prometheus-stack

Verify ServiceMonitors:

kubectl get servicemonitor -n loki
# Expected: loki and promtail ServiceMonitors

Check Prometheus Targets:

# Port-forward to Prometheus
kubectl port-forward -n default svc/kube-prometheus-stack-prometheus 9090:9090

# Navigate to: http://localhost:9090/targets
# Look for: serviceMonitor/loki/loki and serviceMonitor/loki/promtail

Key Metrics to Watch

Loki and Promtail metrics are available in Prometheus:

Loki Metrics:

# Ingestion rate (logs/second)
sum(rate(loki_distributor_lines_received_total[1m]))

# Query performance (99th percentile)
histogram_quantile(0.99, rate(loki_request_duration_seconds_bucket[5m]))

# Active log streams
loki_ingester_streams

# Storage usage
loki_store_chunk_entries

# Compaction status
loki_compactor_compaction_interval_seconds

Promtail Metrics (per pod × 5):

# Logs sent to Loki (rate)
rate(promtail_sent_entries_total[5m])

# Bytes read from log files
rate(promtail_read_bytes_total[5m])

# Active scrape targets (should show ~250 pods)
promtail_targets_active_total

# Files being watched
promtail_files_active_total

Health Checks

Check Loki readiness:

kubectl exec -n loki loki-0 -c loki -- wget -qO- http://localhost:3100/ready
# Should return: "ready"

Check Loki metrics endpoint:

kubectl exec -n loki loki-0 -c loki -- wget -qO- http://localhost:3100/metrics | head

Log-Based Alerting

Loki includes a Ruler component that evaluates LogQL expressions and sends alerts to AlertManager, enabling log-based monitoring and proactive incident detection.

Loki Ruler Overview

Deployment:

Enabled: Yes (as of 2025-12-28)
Mode: Deployment (1 replica)
Namespace: loki
Resources:
- Requests: 50m CPU, 128Mi memory
- Limits: 200m CPU, 256Mi memory
Persistence: 1Gi PVC on synology-iscsi-retain

Purpose: Evaluates log-based alerting rules and sends alerts to AlertManager for notification routing (email, Slack, etc.)

AlertManager Integration

The Loki Ruler is configured to send alerts to the kube-prometheus-stack AlertManager:

loki:
  rulerConfig:
    alertmanager_url: http://kube-prometheus-stack-alertmanager.default.svc.cluster.local:9093
    enable_api: true
    enable_alertmanager_v2: true

Alert Flow:

Loki Logs → Ruler (LogQL evaluation) → AlertManager → Email/Slack/etc.

Alert Rules Configuration

Alert rules are defined using PrometheusRule CRDs with the following labels for Prometheus Operator discovery:

metadata:
  namespace: loki
  labels:
    prometheus: kube-prometheus
    role: alert-rules
    app: loki

Configuration File: manifests/base/loki/loki-alerts.yaml in homelab repository

Implemented Alert Rules

The following 11 alert rules are deployed across 4 categories:

1. Log Errors (Group: loki.log_errors)

HighErrorLogRate:

Expression: Error log rate > 1/sec for 5 minutes
Severity: Warning
Pattern: Matches "error", "err", "fatal", "exception", "panic", "fail" (case-insensitive)
Purpose: Detect elevated error rates across all namespaces

CriticalErrorLogs:

Expression: Critical error rate > 0.1/sec for 2 minutes
Severity: Critical
Pattern: Matches "critical", "fatal", "panic", "emergency" (case-insensitive)
Purpose: Immediate notification for severe errors

2. Pod Failures (Group: loki.pod_failures)

CrashLoopBackOffDetected:

Expression: CrashLoopBackOff messages detected for 5 minutes
Severity: Critical
Pattern: Matches "back-off restarting failed container", "crashloopbackoff"
Purpose: Alert on pods unable to start successfully

OOMKilledDetected:

Expression: Out-of-memory kill events detected for 1 minute
Severity: Critical
Pattern: Matches "oomkilled", "out of memory", "memory cgroup out of memory"
Purpose: Identify pods exceeding memory limits

PersistentPodRestarts:

Expression: More than 5 restart messages in 15 minutes
Severity: Warning
Pattern: Matches "restarting container", "restart count"
Purpose: Detect unstable pods with frequent restarts

3. Application Errors (Group: loki.application_errors)

HighHTTP5xxErrorRate:

Expression: HTTP 5xx error rate > 1/sec for 5 minutes
Severity: Warning
Pattern: Matches "status[= :]5[0-9]2", "http.*5[0-9]2"
Purpose: Monitor application-level server errors

DatabaseConnectionErrors:

Expression: Database error rate > 0.5/sec for 5 minutes
Severity: Warning
Pattern: Matches "connection.*refused", "connection.*timeout", "database.*error", "sql.*error"
Purpose: Detect database connectivity issues

4. Security Events (Group: loki.security_events)

AuthenticationFailures:

Expression: Authentication failure rate > 5/sec for 5 minutes
Severity: Warning
Pattern: Matches "authentication.*failed", "auth.*failed", "unauthorized", "invalid.*credentials"
Purpose: Detect potential brute-force or misconfiguration

SuspiciousActivity:

Expression: Any suspicious keywords detected for 2 minutes
Severity: Critical
Pattern: Matches "attack", "intrusion", "exploit", "malicious", "suspicious"
Purpose: Early detection of potential security incidents

Verifying Alerting Setup

Check Ruler Pod:

kubectl get pods -n loki -l app.kubernetes.io/component=ruler
# Expected: 1 running pod

Check Ruler Logs:

kubectl logs -n loki -l app.kubernetes.io/component=ruler --tail=50
# Look for "ruler started" and "evaluating rules"

Verify PrometheusRule Created:

kubectl get prometheusrule -n loki loki-log-based-alerts
# Should show the alert rules resource

Check Alert Rules in Prometheus:

# Port-forward to Prometheus
kubectl port-forward -n default svc/kube-prometheus-stack-prometheus 9090:9090

# Navigate to: http://localhost:9090/alerts
# Look for alerts with prefix "loki" or check "loki.log_errors" group

Verify AlertManager Integration:

# Port-forward to AlertManager
kubectl port-forward -n default svc/kube-prometheus-stack-alertmanager 9093:9093

# Navigate to: http://localhost:9093
# Check for any firing Loki alerts

Testing Alert Rules

Trigger a Test Alert (HighErrorLogRate):

# Generate error logs in a test pod
kubectl run test-error-logs --image=busybox --restart=Never -- \
  sh -c 'for i in $(seq 1 100); do echo "ERROR: Test error message $i"; sleep 1; done'

# Wait 5-6 minutes for alert to fire
# Check AlertManager UI for the alert

Clean Up Test Pod:

kubectl delete pod test-error-logs

Tuning Alert Thresholds

Alert thresholds can be adjusted in manifests/base/loki/loki-alerts.yaml:

Example - Reduce HighErrorLogRate sensitivity:

- alert: HighErrorLogRate
  expr: |
    sum by (namespace, pod) (
      rate({namespace=~".+"} |~ "(?i)(error|err|fatal|exception|panic|fail)" [5m])
    ) > 5  # Changed from 1 to 5 errors/sec
  for: 10m  # Changed from 5m to 10m

After editing, commit and push changes - ArgoCD will auto-sync the new rules.

Monitoring Ruler Performance

Ruler Metrics (via Prometheus):

# Rules evaluated per second
rate(loki_prometheus_rule_evaluations_total[5m])

# Rule evaluation duration
histogram_quantile(0.99, rate(loki_prometheus_rule_evaluation_duration_seconds_bucket[5m]))

# Failed rule evaluations
rate(loki_prometheus_rule_evaluation_failures_total[5m])

Resource Usage:

kubectl top pod -n loki -l app.kubernetes.io/component=ruler

Alert Notification Channels

Alerts are routed through AlertManager, which supports:

Email: Configured for homelab (Gmail SMTP)
Slack: Can be configured for team notifications
Discord: Alternative messaging platform
Webhook: Custom integrations
PagerDuty: For production on-call rotations

Email Configuration: See manifests/base/kube-prometheus-stack/alertmanager-secret.yaml in homelab repository

Troubleshooting Alerts

Alert Not Firing:

Check if logs match the pattern:

{namespace=~".+"} |~ "(?i)(error|err|fatal|exception|panic|fail)"

Verify threshold is exceeded:

sum by (namespace, pod) (
  rate({namespace=~".+"} |~ "(?i)(error|err|fatal|exception|panic|fail)" [5m])
)

Check Ruler logs for evaluation errors

Too Many False Positives:

Adjust pattern to be more specific
Increase threshold or evaluation duration
Add exclusions for known noisy patterns

Alert Not Reaching Email/Slack:

Verify AlertManager configuration
Check AlertManager logs for routing errors
Test AlertManager notification channels

Best Practices

When Creating Custom Alert Rules:

Test LogQL queries in Grafana Explore before creating alert rules
Use appropriate for duration to avoid flapping alerts
Set meaningful annotations with runbook links
Include helpful context in alert descriptions (namespace, pod, error rate)
Use label-based routing in AlertManager for different severities
Monitor alert evaluation performance to avoid overloading Ruler

Alert Naming Convention:

Use descriptive names: HighErrorLogRate, not Alert1
Include severity in name when appropriate: CriticalErrorLogs
Group related alerts with common prefix

Future Enhancements

Potential improvements for log-based alerting:

SLO-Based Alerting: Define error budget and alert when budget is exhausted
Anomaly Detection: ML-based detection of unusual log patterns
Log Correlation: Combine log patterns with metric thresholds
Dynamic Thresholds: Adjust thresholds based on time of day or traffic volume
Custom Runbooks: Detailed troubleshooting guides for each alert type

Configuration Files

ArgoCD Applications

Loki:

Path: manifests/applications/loki.yaml
Sync Wave: -12
Destination: loki namespace

Promtail:

Path: manifests/applications/promtail.yaml
Sync Wave: -11
Destination: loki namespace

Helm Values

Loki Configuration:

Path: manifests/base/loki/values.yaml
Key settings: deployment mode, storage, retention

Promtail Configuration:

Path: manifests/base/promtail/values.yaml
Key settings: resources, scrape config, labels

Grafana Datasource:

Path: manifests/base/loki/loki-datasource.yaml
Auto-discovered by Grafana sidecar

Performance Tuning

For Raspberry Pi Cluster

Current settings are optimized for:

5 Raspberry Pi 5 nodes (16GB RAM each)
~250 pods total
Moderate log volume (~1.8GB/day)

If experiencing performance issues:

Reduce scrape frequency (less CPU on nodes):

# In promtail values.yaml, not currently set (uses default)

Reduce query parallelism (less memory in Loki):

limits_config:
  max_query_parallelism: 16  # Default: 32

Increase Loki memory (better query performance):

singleBinary:
  resources:
    limits:
      memory: 1Gi  # From 768Mi

Security Considerations

Authentication

Loki is configured with auth_enabled: false for simplicity in homelab environment.

For multi-tenant or production use, enable authentication:

loki:
  auth_enabled: true

Network Policies

Consider adding NetworkPolicy to restrict access:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: loki-allow-promtail
  namespace: loki
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: loki
  ingress:
    - from:
      - podSelector:
          matchLabels:
            app.kubernetes.io/name: promtail
      ports:
        - port: 3100

Upgrading

Loki Chart Upgrades

# Update chart version in manifests/applications/loki.yaml
targetRevision: 6.50.0  # Example new version

# Commit and push (ArgoCD will auto-sync)
git commit -am "chore: Upgrade Loki to 6.50.0"
git push

Important: Check Loki release notes for breaking changes.

Promtail Chart Upgrades

# Update chart version in manifests/applications/promtail.yaml
targetRevision: 6.17.0  # Example new version

# Commit and push
git commit -am "chore: Upgrade Promtail to 6.17.0"
git push

Overview​

Architecture​

Components​

Loki SingleBinary​

Promtail DaemonSet​

Loki Label Limit (15 Labels Maximum)​

Storage​

PVC Configuration​

Storage Calculation​

Expanding Storage​

Retention Configuration​

Grafana Integration​

Datasource Configuration​

Accessing Grafana​

Using Loki in Grafana​

LogQL Query Examples​

Basic Queries​

Filtering Logs​

Pattern Matching​

Advanced Queries​

Common Use Cases​

Troubleshooting Pod Crashes​

Monitoring Deployments​

Debugging Network Issues​

Checking Control Plane Health​

Grafana Dashboards​

Community Dashboards​

Custom Dashboard Example​

Troubleshooting​

No Logs Appearing​

Loki Pod Crashes or OOM​

Slow Queries​

Storage Full​

Monitoring Loki​

Prometheus Integration​

Key Metrics to Watch​

Health Checks​

Log-Based Alerting​

Loki Ruler Overview​

AlertManager Integration​

Alert Rules Configuration​

Implemented Alert Rules​

1. Log Errors (Group: loki.log_errors)​

2. Pod Failures (Group: loki.pod_failures)​

3. Application Errors (Group: loki.application_errors)​

4. Security Events (Group: loki.security_events)​

Verifying Alerting Setup​

Testing Alert Rules​

Tuning Alert Thresholds​

Monitoring Ruler Performance​

Alert Notification Channels​

Troubleshooting Alerts​

Best Practices​

Future Enhancements​

Configuration Files​

ArgoCD Applications​

Helm Values​

Performance Tuning​

For Raspberry Pi Cluster​

Security Considerations​

Authentication​

Network Policies​

Upgrading​

Loki Chart Upgrades​

Promtail Chart Upgrades​

Related Documentation​

External Resources​