feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
The Longhorn memory leak went undetected for 14 days because alerting was broken (email receiver, missing label selector, no node alerts). This overhaul brings alerting to production grade. Fixes: - Alloy Loki URL pointed to deleted loki-gateway, now loki:3100 - seaweedfs-bucket-init crash on unsupported `mc versioning` command - All PrometheusRules now have `release: kube-prometheus-stack` label - Removed broken email receiver, Matrix-only alerting New alert coverage: - Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM - Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full - Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror - Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down - Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart) - SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s - Recording rules for Linkerd RED metrics and node aggregates - Watchdog heartbeat → Matrix every 12h (dead pipeline detection) - Inhibition: critical suppresses warning for same alert+namespace - OpenSearchClusterYellow only fires with >1 data node (single-node aware)
This commit is contained in:
@@ -18,6 +18,7 @@ resources:
|
||||
- openbao-servicemonitor.yaml
|
||||
- postgres-alertrules.yaml
|
||||
- openbao-alertrules.yaml
|
||||
- valkey-alertrules.yaml
|
||||
- searxng-deployment.yaml
|
||||
|
||||
helmCharts:
|
||||
|
||||
@@ -5,6 +5,7 @@ metadata:
|
||||
namespace: data
|
||||
labels:
|
||||
role: alert-rules
|
||||
release: kube-prometheus-stack
|
||||
spec:
|
||||
groups:
|
||||
- name: openbao
|
||||
|
||||
@@ -5,6 +5,7 @@ metadata:
|
||||
namespace: data
|
||||
labels:
|
||||
role: alert-rules
|
||||
release: kube-prometheus-stack
|
||||
spec:
|
||||
groups:
|
||||
- name: opensearch
|
||||
@@ -19,13 +20,16 @@ spec:
|
||||
description: "OpenSearch cluster {{ $labels.cluster }} health status is red."
|
||||
|
||||
- alert: OpenSearchClusterYellow
|
||||
expr: elasticsearch_cluster_health_status{color="yellow"} == 1
|
||||
expr: |
|
||||
elasticsearch_cluster_health_status{color="yellow"} == 1
|
||||
and on(cluster)
|
||||
elasticsearch_cluster_health_number_of_data_nodes > 1
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "OpenSearch cluster health is YELLOW"
|
||||
description: "OpenSearch cluster {{ $labels.cluster }} health status is yellow."
|
||||
description: "OpenSearch cluster {{ $labels.cluster }} health status is yellow (multi-node, so unassigned shards indicate a real problem)."
|
||||
|
||||
- alert: OpenSearchHeapHigh
|
||||
expr: (elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}) > 0.85
|
||||
|
||||
@@ -5,6 +5,7 @@ metadata:
|
||||
namespace: data
|
||||
labels:
|
||||
role: alert-rules
|
||||
release: kube-prometheus-stack
|
||||
spec:
|
||||
groups:
|
||||
- name: postgres
|
||||
@@ -35,3 +36,41 @@ spec:
|
||||
annotations:
|
||||
summary: "PostgreSQL connection count is high"
|
||||
description: "Pod {{ $labels.pod }} has {{ $value }} active connections."
|
||||
|
||||
- alert: PostgresBackupStale
|
||||
expr: |
|
||||
time() - cnpg_collector_last_available_backup_timestamp > 90000
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "PostgreSQL backup is stale"
|
||||
description: "No successful backup in over 25 hours (daily schedule expected)."
|
||||
|
||||
- alert: PostgresBackupFailed
|
||||
expr: |
|
||||
cnpg_collector_last_failed_backup_timestamp > cnpg_collector_last_available_backup_timestamp
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "PostgreSQL backup failed"
|
||||
description: "Last backup failed more recently than last success. Check barman/S3."
|
||||
|
||||
- alert: PostgresWALArchivingStale
|
||||
expr: cnpg_pg_stat_archiver_seconds_since_last_archival > 300
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "PostgreSQL WAL archiving stale"
|
||||
description: "No WAL archived in {{ $value | humanizeDuration }}. Point-in-time recovery may be impossible."
|
||||
|
||||
- alert: PostgresDeadlocks
|
||||
expr: rate(cnpg_pg_stat_database_deadlocks[5m]) > 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "PostgreSQL deadlocks detected"
|
||||
description: "Database {{ $labels.datname }} is experiencing deadlocks."
|
||||
|
||||
21
base/data/valkey-alertrules.yaml
Normal file
21
base/data/valkey-alertrules.yaml
Normal file
@@ -0,0 +1,21 @@
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: PrometheusRule
|
||||
metadata:
|
||||
name: valkey-alerts
|
||||
namespace: data
|
||||
labels:
|
||||
role: alert-rules
|
||||
release: kube-prometheus-stack
|
||||
spec:
|
||||
groups:
|
||||
- name: valkey
|
||||
rules:
|
||||
- alert: ValkeyDown
|
||||
expr: |
|
||||
kube_deployment_status_replicas_available{namespace="data", deployment="valkey"} == 0
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Valkey (Redis) is down"
|
||||
description: "Valkey cache server is down. All apps using Redis/Celery are affected."
|
||||
Reference in New Issue
Block a user