Commit Graph

11 Commits

Author SHA1 Message Date
e4987b4c58 feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.

Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting

New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
2026-04-06 15:52:06 +01:00
9f15f5099e fix: meet external-api route, drive media proxy, alertbot, misc tweaks
- Meet: add external-api backend path, CSRF trusted origins
- Drive: fix media proxy regex for preview URLs and S3 key signing
- OpenBao: enable Prometheus telemetry
- Postgres alerts: fix metric name (cnpg_backends_total)
- Gitea: bump memory limits for mirror workloads
- Alertbot: expanded deployment config
- Kratos: add find/cal/projects to allowed return URLs, settings path
- Pingora: meet external-api route fix
- Sol: config update
2026-03-25 18:01:15 +00:00
eab91eb85d feat(monitoring): expanded dashboards for all services
Enriched dashboards for DevTools (Gitea), Identity (Hydra/Kratos),
Infrastructure (Longhorn, PostgreSQL, cert-manager, OpenBao),
Ingress (Pingora), and Storage (SeaweedFS).
2026-03-25 17:58:51 +00:00
9ee40aaa69 feat(monitoring): comprehensive OpenSearch dashboard
6 collapsible rows covering all exporter metrics:
Overview, Search & Queries, Indexing, Storage & Indices,
Circuit Breakers & Thread Pools, OS & Process.
2026-03-25 17:54:39 +00:00
5e622ce316 feat: AlertManager Matrix integration with severity routing
Deploy matrix-alertmanager-receiver bridge (pending bot credentials in
OpenBao). Update AlertManager routing: critical → Matrix + email,
warning → Matrix only, Watchdog → null. Reduce repeat interval to 4h.
2026-03-24 12:21:29 +00:00
3fc54c8851 feat: add PrometheusRule alerts for all services
28 alert rules across 9 PrometheusRule files covering infrastructure
(Longhorn, cert-manager), data (PostgreSQL, OpenBao, OpenSearch),
storage (SeaweedFS), devtools (Gitea), identity (Hydra, Kratos),
media (LiveKit), and mesh (Linkerd golden signals for all services).

Severity routing: critical alerts fire to Matrix + email, warnings
to Matrix only (AlertManager config updated in separate commit).
2026-03-24 12:20:55 +00:00
74bb59cfdc feat: split Grafana dashboards into per-folder ConfigMaps
Replace monolithic dashboards-configmap.yaml with 10 dedicated files,
one per Grafana folder: Ingress, Observability, Infrastructure, Storage,
Identity, DevTools, Search, Media, La Suite, Communications.

New dashboards for Longhorn, PostgreSQL/CNPG, Cert-Manager, SeaweedFS,
Hydra, Kratos, Gitea, OpenSearch, LiveKit, La Suite golden signals
(Linkerd metrics), Matrix, and Email Pipeline.
2026-03-24 12:20:42 +00:00
d3943c9a84 feat(monitoring): wire up full LGTM observability stack
- Prometheus: discover ServiceMonitors/PodMonitors in all namespaces,
  enable remote write receiver for Tempo metrics generator
- Tempo: enable metrics generator (service-graphs + span-metrics)
  with remote write to Prometheus
- Loki: add Grafana Alloy DaemonSet to ship container logs
- Grafana: enable dashboard sidecar, add Pingora/Loki/Tempo/OpenBao
  dashboards, add stable UIDs and cross-linking between datasources
  (Loki↔Tempo derived fields, traces→logs, traces→metrics, service map)
- Linkerd: enable proxy tracing to Alloy OTLP collector, point
  linkerd-viz at existing Prometheus instead of deploying its own
- Pingora: add OTLP rollout plan (endpoint commented out until proxy
  telemetry panic fix is deployed and Alloy is verified healthy)
2026-03-21 17:36:54 +00:00
ccfe8b877a feat: La Suite email/messages, buildkitd, monitoring, vault and storage updates
- Add Messages (email) service: backend, frontend, MTA in/out, MPA, SOCKS
  proxy, worker, DKIM config, and theme customization
- Add Collabora deployment for document collaboration
- Add Drive frontend nginx config and values
- Add buildkitd namespace for in-cluster container builds
- Add SeaweedFS remote sync and additional S3 buckets
- Update vault secrets across namespaces (devtools, lasuite, media,
  monitoring, ory, storage) with expanded credential management
- Update monitoring: rename grafana→metrics OAuth2Client, add Prometheus
  remote write and additional scrape configs
- Update local/production overlays with resource patches
- Remove stale login-ui resource patch from production overlay
2026-03-10 19:00:57 +00:00
91983ddf29 feat(observability): enable OTLP tracing, fix Prometheus scraping, add proxy ServiceMonitor
- Set otlp_endpoint to Tempo HTTP receiver (port 4318) for request tracing
- Add hostNetwork to prometheusSpec so it can reach kubelet/node-exporter on node public IP
- Add ServiceMonitor for proxy metrics scrape on port 9090
- Add CORS origin and Grafana datasource config for monitoring stack
2026-03-09 08:20:42 +00:00
7ff35d3e0c feat(infra): production bootstrap — cert-manager, longhorn, monitoring
Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn
distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo
+ Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning.

Production overlay: add patches for postgres sizing, SeaweedFS volume,
OpenSearch storage, LiveKit service, Pingora host ports, resource limits,
and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames
for all *.sunbeam.pt subdomains.
2026-03-06 12:06:27 +00:00