base/ory/ory-alertrules.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ory-alerts
  namespace: ory
  labels:
    role: alert-rules
    release: kube-prometheus-stack
spec:
  groups:
    - name: ory
      rules:
        - alert: HydraDown
          expr: up{job=~".*hydra.*"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Hydra is down"
            description: "Hydra instance {{ $labels.namespace }}/{{ $labels.pod }} is down."

        - alert: KratosDown
          expr: up{job=~".*kratos.*"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Kratos is down"
            description: "Kratos instance {{ $labels.namespace }}/{{ $labels.pod }} is down."

        - alert: HydraHighErrorRate
          expr: sum(rate(http_requests_total{job=~".*hydra.*",code=~"5.."}[5m])) / sum(rate(http_requests_total{job=~".*hydra.*"}[5m])) > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Hydra has a high HTTP error rate"
            description: "Hydra 5xx error rate is {{ $value | humanizePercentage }}."

        - alert: KratosHighErrorRate
          expr: sum(rate(http_requests_total{job=~".*kratos.*",code=~"5.."}[5m])) / sum(rate(http_requests_total{job=~".*kratos.*"}[5m])) > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Kratos has a high HTTP error rate"
            description: "Kratos 5xx error rate is {{ $value | humanizePercentage }}."
feat: add PrometheusRule alerts for all services 28 alert rules across 9 PrometheusRule files covering infrastructure (Longhorn, cert-manager), data (PostgreSQL, OpenBao, OpenSearch), storage (SeaweedFS), devtools (Gitea), identity (Hydra, Kratos), media (LiveKit), and mesh (Linkerd golden signals for all services). Severity routing: critical alerts fire to Matrix + email, warnings to Matrix only (AlertManager config updated in separate commit). 2026-03-24 12:20:55 +00:00			`apiVersion: monitoring.coreos.com/v1`
			`kind: PrometheusRule`
			`metadata:`
			`name: ory-alerts`
			`namespace: ory`
			`labels:`
			`role: alert-rules`
feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules The Longhorn memory leak went undetected for 14 days because alerting was broken (email receiver, missing label selector, no node alerts). This overhaul brings alerting to production grade. Fixes: - Alloy Loki URL pointed to deleted loki-gateway, now loki:3100 - seaweedfs-bucket-init crash on unsupported `mc versioning` command - All PrometheusRules now have `release: kube-prometheus-stack` label - Removed broken email receiver, Matrix-only alerting New alert coverage: - Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM - Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full - Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror - Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down - Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart) - SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s - Recording rules for Linkerd RED metrics and node aggregates - Watchdog heartbeat → Matrix every 12h (dead pipeline detection) - Inhibition: critical suppresses warning for same alert+namespace - OpenSearchClusterYellow only fires with >1 data node (single-node aware) 2026-04-06 15:52:06 +01:00			`release: kube-prometheus-stack`
feat: add PrometheusRule alerts for all services 28 alert rules across 9 PrometheusRule files covering infrastructure (Longhorn, cert-manager), data (PostgreSQL, OpenBao, OpenSearch), storage (SeaweedFS), devtools (Gitea), identity (Hydra, Kratos), media (LiveKit), and mesh (Linkerd golden signals for all services). Severity routing: critical alerts fire to Matrix + email, warnings to Matrix only (AlertManager config updated in separate commit). 2026-03-24 12:20:55 +00:00			`spec:`
			`groups:`
			`- name: ory`
			`rules:`
			`- alert: HydraDown`
			`expr: up{job=~".hydra."} == 0`
			`for: 2m`
			`labels:`
			`severity: critical`
			`annotations:`
			`summary: "Hydra is down"`
			`description: "Hydra instance {{ $labels.namespace }}/{{ $labels.pod }} is down."`

			`- alert: KratosDown`
			`expr: up{job=~".kratos."} == 0`
			`for: 2m`
			`labels:`
			`severity: critical`
			`annotations:`
			`summary: "Kratos is down"`
			`description: "Kratos instance {{ $labels.namespace }}/{{ $labels.pod }} is down."`

			`- alert: HydraHighErrorRate`
			`expr: sum(rate(http_requests_total{job=~".hydra.",code=~"5.."}[5m])) / sum(rate(http_requests_total{job=~".hydra."}[5m])) > 0.05`
			`for: 5m`
			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "Hydra has a high HTTP error rate"`
			`description: "Hydra 5xx error rate is {{ $value \| humanizePercentage }}."`

			`- alert: KratosHighErrorRate`
			`expr: sum(rate(http_requests_total{job=~".kratos.",code=~"5.."}[5m])) / sum(rate(http_requests_total{job=~".kratos."}[5m])) > 0.05`
			`for: 5m`
			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "Kratos has a high HTTP error rate"`
			`description: "Kratos 5xx error rate is {{ $value \| humanizePercentage }}."`