The Longhorn memory leak went undetected for 14 days because alerting was broken (email receiver, missing label selector, no node alerts). This overhaul brings alerting to production grade. Fixes: - Alloy Loki URL pointed to deleted loki-gateway, now loki:3100 - seaweedfs-bucket-init crash on unsupported `mc versioning` command - All PrometheusRules now have `release: kube-prometheus-stack` label - Removed broken email receiver, Matrix-only alerting New alert coverage: - Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM - Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full - Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror - Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down - Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart) - SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s - Recording rules for Linkerd RED metrics and node aggregates - Watchdog heartbeat → Matrix every 12h (dead pipeline detection) - Inhibition: critical suppresses warning for same alert+namespace - OpenSearchClusterYellow only fires with >1 data node (single-node aware)
63 lines
1.8 KiB
YAML
63 lines
1.8 KiB
YAML
apiVersion: kustomize.config.k8s.io/v1beta1
|
|
kind: Kustomization
|
|
|
|
namespace: monitoring
|
|
|
|
resources:
|
|
- namespace.yaml
|
|
- vault-secrets.yaml
|
|
- grafana-oauth2client.yaml
|
|
# Dashboards (one ConfigMap per Grafana folder)
|
|
- dashboards-ingress.yaml
|
|
- dashboards-observability.yaml
|
|
- dashboards-infrastructure.yaml
|
|
- dashboards-storage.yaml
|
|
- dashboards-identity.yaml
|
|
- dashboards-devtools.yaml
|
|
- dashboards-search.yaml
|
|
- dashboards-media.yaml
|
|
- dashboards-lasuite.yaml
|
|
- dashboards-comms.yaml
|
|
# AlertManager → Matrix bridge
|
|
- matrix-alertmanager-receiver-deployment.yaml
|
|
- matrix-bot-secret.yaml
|
|
# Alert rules
|
|
- alertrules-infrastructure.yaml
|
|
- alertrules-observability.yaml
|
|
- alertrules-slo.yaml
|
|
- recording-rules.yaml
|
|
|
|
helmCharts:
|
|
# helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
|
- name: kube-prometheus-stack
|
|
repo: https://prometheus-community.github.io/helm-charts
|
|
version: "82.9.0"
|
|
releaseName: kube-prometheus-stack
|
|
namespace: monitoring
|
|
valuesFile: prometheus-values.yaml
|
|
includeCRDs: true
|
|
|
|
# helm repo add grafana https://grafana.github.io/helm-charts
|
|
- name: loki
|
|
repo: https://grafana.github.io/helm-charts
|
|
version: "6.53.0"
|
|
releaseName: loki
|
|
namespace: monitoring
|
|
valuesFile: loki-values.yaml
|
|
|
|
- name: tempo
|
|
repo: https://grafana.github.io/helm-charts
|
|
version: "1.24.4"
|
|
releaseName: tempo
|
|
namespace: monitoring
|
|
valuesFile: tempo-values.yaml
|
|
|
|
# Grafana Alloy — DaemonSet that ships container logs → Loki
|
|
# and provides an in-cluster OTLP receiver → Tempo.
|
|
- name: alloy
|
|
repo: https://grafana.github.io/helm-charts
|
|
version: "0.12.0"
|
|
releaseName: alloy
|
|
namespace: monitoring
|
|
valuesFile: alloy-values.yaml
|