feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules

The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.

Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting

New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
This commit is contained in:
2026-04-06 15:52:06 +01:00
parent f07b3353aa
commit e4987b4c58
22 changed files with 515 additions and 24 deletions

View File

@@ -61,7 +61,7 @@ grafana:
- name: Loki
type: loki
uid: loki
url: "http://loki-gateway.monitoring.svc.cluster.local:80"
url: "http://loki.monitoring.svc.cluster.local:3100"
access: proxy
isDefault: false
jsonData:
@@ -130,10 +130,6 @@ alertmanager:
requests:
storage: 2Gi
config:
global:
smtp_from: "alerts@DOMAIN_SUFFIX"
smtp_smarthost: "postfix.lasuite.svc.cluster.local:25"
smtp_require_tls: false
route:
group_by: [alertname, namespace]
group_wait: 30s
@@ -143,30 +139,26 @@ alertmanager:
routes:
- matchers:
- alertname = Watchdog
receiver: "null"
receiver: matrix
repeat_interval: 12h
- matchers:
- severity = critical
receiver: critical
receiver: matrix
- matchers:
- severity = warning
receiver: matrix
receivers:
- name: "null"
- name: email
email_configs:
- to: "ops@DOMAIN_SUFFIX"
send_resolved: true
- name: matrix
webhook_configs:
- url: "http://matrix-alertmanager-receiver.monitoring.svc.cluster.local:3000/alerts/alerts"
send_resolved: true
- name: critical
webhook_configs:
- url: "http://matrix-alertmanager-receiver.monitoring.svc.cluster.local:3000/alerts/alerts"
send_resolved: true
email_configs:
- to: "ops@DOMAIN_SUFFIX"
send_resolved: true
inhibitRules:
# Critical alerts suppress warnings for the same alertname+namespace
- source_matchers:
- severity = critical
target_matchers:
- severity = warning
equal: [alertname, namespace]
# Disable monitors for components k3s doesn't expose
kubeEtcd: