base/monitoring/alertrules-observability.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: observability-alerts
  namespace: monitoring
  labels:
    role: alert-rules
    release: kube-prometheus-stack
spec:
  groups:
    - name: prometheus
      rules:
        - alert: PrometheusWALCorruption
          expr: increase(prometheus_tsdb_wal_corruptions_total[5m]) > 0
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: "Prometheus WAL corruption detected"
            description: "Prometheus detected WAL corruption — data loss may be occurring."

        - alert: PrometheusRuleFailures
          expr: increase(prometheus_rule_evaluation_failures_total[5m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Prometheus rule evaluation failures"
            description: "Some Prometheus rules are failing to evaluate — alerts may not fire."

        - alert: PrometheusStorageFull
          expr: prometheus_tsdb_storage_blocks_bytes > 25.5e9
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Prometheus storage over 85% of 30Gi PVC"
            description: "Prometheus TSDB is using {{ $value | humanize1024 }}B of its 30Gi PVC."

    - name: loki
      rules:
        - alert: LokiDown
          expr: up{job=~".*loki.*", container="loki"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Loki is down"
            description: "Loki log aggregation is offline — logs are being dropped."

    - name: tempo
      rules:
        - alert: TempoDown
          expr: up{job=~".*tempo.*"} == 0
          for: 2m
          labels:
            severity: warning
          annotations:
            summary: "Tempo is down"
            description: "Tempo trace backend is offline — traces are being dropped."

    - name: alertmanager
      rules:
        - alert: AlertManagerWebhookFailures
          expr: increase(alertmanager_notifications_failed_total{integration="webhook"}[15m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "AlertManager webhook delivery failing"
            description: "AlertManager cannot deliver alerts to Matrix webhook receiver."
feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules The Longhorn memory leak went undetected for 14 days because alerting was broken (email receiver, missing label selector, no node alerts). This overhaul brings alerting to production grade. Fixes: - Alloy Loki URL pointed to deleted loki-gateway, now loki:3100 - seaweedfs-bucket-init crash on unsupported `mc versioning` command - All PrometheusRules now have `release: kube-prometheus-stack` label - Removed broken email receiver, Matrix-only alerting New alert coverage: - Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM - Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full - Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror - Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down - Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart) - SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s - Recording rules for Linkerd RED metrics and node aggregates - Watchdog heartbeat → Matrix every 12h (dead pipeline detection) - Inhibition: critical suppresses warning for same alert+namespace - OpenSearchClusterYellow only fires with >1 data node (single-node aware) 2026-04-06 15:52:06 +01:00			`apiVersion: monitoring.coreos.com/v1`
			`kind: PrometheusRule`
			`metadata:`
			`name: observability-alerts`
			`namespace: monitoring`
			`labels:`
			`role: alert-rules`
			`release: kube-prometheus-stack`
			`spec:`
			`groups:`
			`- name: prometheus`
			`rules:`
			`- alert: PrometheusWALCorruption`
			`expr: increase(prometheus_tsdb_wal_corruptions_total[5m]) > 0`
			`for: 0m`
			`labels:`
			`severity: critical`
			`annotations:`
			`summary: "Prometheus WAL corruption detected"`
			`description: "Prometheus detected WAL corruption — data loss may be occurring."`

			`- alert: PrometheusRuleFailures`
			`expr: increase(prometheus_rule_evaluation_failures_total[5m]) > 0`
			`for: 5m`
			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "Prometheus rule evaluation failures"`
			`description: "Some Prometheus rules are failing to evaluate — alerts may not fire."`

			`- alert: PrometheusStorageFull`
			`expr: prometheus_tsdb_storage_blocks_bytes > 25.5e9`
			`for: 10m`
			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "Prometheus storage over 85% of 30Gi PVC"`
			`description: "Prometheus TSDB is using {{ $value \| humanize1024 }}B of its 30Gi PVC."`

			`- name: loki`
			`rules:`
			`- alert: LokiDown`
			`expr: up{job=~".loki.", container="loki"} == 0`
			`for: 2m`
			`labels:`
			`severity: critical`
			`annotations:`
			`summary: "Loki is down"`
			`description: "Loki log aggregation is offline — logs are being dropped."`

			`- name: tempo`
			`rules:`
			`- alert: TempoDown`
			`expr: up{job=~".tempo."} == 0`
			`for: 2m`
			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "Tempo is down"`
			`description: "Tempo trace backend is offline — traces are being dropped."`

			`- name: alertmanager`
			`rules:`
			`- alert: AlertManagerWebhookFailures`
			`expr: increase(alertmanager_notifications_failed_total{integration="webhook"}[15m]) > 0`
			`for: 5m`
			`labels:`
			`severity: critical`
			`annotations:`
			`summary: "AlertManager webhook delivery failing"`
			`description: "AlertManager cannot deliver alerts to Matrix webhook receiver."`