Files
sbbb/base/monitoring/kustomization.yaml
Sienna Meridian Satterwhite e4987b4c58 feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.

Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting

New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
2026-04-06 15:52:06 +01:00

63 lines
1.8 KiB
YAML

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: monitoring
resources:
- namespace.yaml
- vault-secrets.yaml
- grafana-oauth2client.yaml
# Dashboards (one ConfigMap per Grafana folder)
- dashboards-ingress.yaml
- dashboards-observability.yaml
- dashboards-infrastructure.yaml
- dashboards-storage.yaml
- dashboards-identity.yaml
- dashboards-devtools.yaml
- dashboards-search.yaml
- dashboards-media.yaml
- dashboards-lasuite.yaml
- dashboards-comms.yaml
# AlertManager → Matrix bridge
- matrix-alertmanager-receiver-deployment.yaml
- matrix-bot-secret.yaml
# Alert rules
- alertrules-infrastructure.yaml
- alertrules-observability.yaml
- alertrules-slo.yaml
- recording-rules.yaml
helmCharts:
# helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
- name: kube-prometheus-stack
repo: https://prometheus-community.github.io/helm-charts
version: "82.9.0"
releaseName: kube-prometheus-stack
namespace: monitoring
valuesFile: prometheus-values.yaml
includeCRDs: true
# helm repo add grafana https://grafana.github.io/helm-charts
- name: loki
repo: https://grafana.github.io/helm-charts
version: "6.53.0"
releaseName: loki
namespace: monitoring
valuesFile: loki-values.yaml
- name: tempo
repo: https://grafana.github.io/helm-charts
version: "1.24.4"
releaseName: tempo
namespace: monitoring
valuesFile: tempo-values.yaml
# Grafana Alloy — DaemonSet that ships container logs → Loki
# and provides an in-cluster OTLP receiver → Tempo.
- name: alloy
repo: https://grafana.github.io/helm-charts
version: "0.12.0"
releaseName: alloy
namespace: monitoring
valuesFile: alloy-values.yaml