base/monitoring/kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: monitoring

resources:
  - namespace.yaml
  - vault-secrets.yaml
  - grafana-oauth2client.yaml
  # Dashboards (one ConfigMap per Grafana folder)
  - dashboards-ingress.yaml
  - dashboards-observability.yaml
  - dashboards-infrastructure.yaml
  - dashboards-storage.yaml
  - dashboards-identity.yaml
  - dashboards-devtools.yaml
  - dashboards-search.yaml
  - dashboards-media.yaml
  - dashboards-lasuite.yaml
  - dashboards-comms.yaml
  # AlertManager → Matrix bridge
  - matrix-alertmanager-receiver-deployment.yaml
  - matrix-bot-secret.yaml
  # Alert rules
  - alertrules-infrastructure.yaml
  - alertrules-observability.yaml
  - alertrules-slo.yaml
  - recording-rules.yaml

helmCharts:
  # helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
  - name: kube-prometheus-stack
    repo: https://prometheus-community.github.io/helm-charts
    version: "82.9.0"
    releaseName: kube-prometheus-stack
    namespace: monitoring
    valuesFile: prometheus-values.yaml
    includeCRDs: true

  # helm repo add grafana https://grafana.github.io/helm-charts
  - name: loki
    repo: https://grafana.github.io/helm-charts
    version: "6.53.0"
    releaseName: loki
    namespace: monitoring
    valuesFile: loki-values.yaml

  - name: tempo
    repo: https://grafana.github.io/helm-charts
    version: "1.24.4"
    releaseName: tempo
    namespace: monitoring
    valuesFile: tempo-values.yaml

  # Grafana Alloy — DaemonSet that ships container logs → Loki
  # and provides an in-cluster OTLP receiver → Tempo.
  - name: alloy
    repo: https://grafana.github.io/helm-charts
    version: "0.12.0"
    releaseName: alloy
    namespace: monitoring
    valuesFile: alloy-values.yaml
feat(infra): production bootstrap — cert-manager, longhorn, monitoring Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo + Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning. Production overlay: add patches for postgres sizing, SeaweedFS volume, OpenSearch storage, LiveKit service, Pingora host ports, resource limits, and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames for all *.sunbeam.pt subdomains. 2026-03-06 12:06:27 +00:00			`apiVersion: kustomize.config.k8s.io/v1beta1`
			`kind: Kustomization`

			`namespace: monitoring`

			`resources:`
			`- namespace.yaml`
			`- vault-secrets.yaml`
			`- grafana-oauth2client.yaml`
feat: split Grafana dashboards into per-folder ConfigMaps Replace monolithic dashboards-configmap.yaml with 10 dedicated files, one per Grafana folder: Ingress, Observability, Infrastructure, Storage, Identity, DevTools, Search, Media, La Suite, Communications. New dashboards for Longhorn, PostgreSQL/CNPG, Cert-Manager, SeaweedFS, Hydra, Kratos, Gitea, OpenSearch, LiveKit, La Suite golden signals (Linkerd metrics), Matrix, and Email Pipeline. 2026-03-24 12:20:42 +00:00			`# Dashboards (one ConfigMap per Grafana folder)`
			`- dashboards-ingress.yaml`
			`- dashboards-observability.yaml`
			`- dashboards-infrastructure.yaml`
			`- dashboards-storage.yaml`
			`- dashboards-identity.yaml`
			`- dashboards-devtools.yaml`
			`- dashboards-search.yaml`
			`- dashboards-media.yaml`
			`- dashboards-lasuite.yaml`
			`- dashboards-comms.yaml`
			`# AlertManager → Matrix bridge`
			`- matrix-alertmanager-receiver-deployment.yaml`
			`- matrix-bot-secret.yaml`
			`# Alert rules`
			`- alertrules-infrastructure.yaml`
feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules The Longhorn memory leak went undetected for 14 days because alerting was broken (email receiver, missing label selector, no node alerts). This overhaul brings alerting to production grade. Fixes: - Alloy Loki URL pointed to deleted loki-gateway, now loki:3100 - seaweedfs-bucket-init crash on unsupported `mc versioning` command - All PrometheusRules now have `release: kube-prometheus-stack` label - Removed broken email receiver, Matrix-only alerting New alert coverage: - Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM - Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full - Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror - Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down - Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart) - SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s - Recording rules for Linkerd RED metrics and node aggregates - Watchdog heartbeat → Matrix every 12h (dead pipeline detection) - Inhibition: critical suppresses warning for same alert+namespace - OpenSearchClusterYellow only fires with >1 data node (single-node aware) 2026-04-06 15:52:06 +01:00			`- alertrules-observability.yaml`
			`- alertrules-slo.yaml`
			`- recording-rules.yaml`
feat(infra): production bootstrap — cert-manager, longhorn, monitoring Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo + Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning. Production overlay: add patches for postgres sizing, SeaweedFS volume, OpenSearch storage, LiveKit service, Pingora host ports, resource limits, and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames for all *.sunbeam.pt subdomains. 2026-03-06 12:06:27 +00:00
			`helmCharts:`
			`# helm repo add prometheus-community https://prometheus-community.github.io/helm-charts`
			`- name: kube-prometheus-stack`
			`repo: https://prometheus-community.github.io/helm-charts`
			`version: "82.9.0"`
			`releaseName: kube-prometheus-stack`
			`namespace: monitoring`
			`valuesFile: prometheus-values.yaml`
			`includeCRDs: true`

			`# helm repo add grafana https://grafana.github.io/helm-charts`
			`- name: loki`
			`repo: https://grafana.github.io/helm-charts`
			`version: "6.53.0"`
			`releaseName: loki`
			`namespace: monitoring`
			`valuesFile: loki-values.yaml`

			`- name: tempo`
			`repo: https://grafana.github.io/helm-charts`
			`version: "1.24.4"`
			`releaseName: tempo`
			`namespace: monitoring`
			`valuesFile: tempo-values.yaml`
feat(monitoring): wire up full LGTM observability stack - Prometheus: discover ServiceMonitors/PodMonitors in all namespaces, enable remote write receiver for Tempo metrics generator - Tempo: enable metrics generator (service-graphs + span-metrics) with remote write to Prometheus - Loki: add Grafana Alloy DaemonSet to ship container logs - Grafana: enable dashboard sidecar, add Pingora/Loki/Tempo/OpenBao dashboards, add stable UIDs and cross-linking between datasources (Loki↔Tempo derived fields, traces→logs, traces→metrics, service map) - Linkerd: enable proxy tracing to Alloy OTLP collector, point linkerd-viz at existing Prometheus instead of deploying its own - Pingora: add OTLP rollout plan (endpoint commented out until proxy telemetry panic fix is deployed and Alloy is verified healthy) 2026-03-21 17:36:54 +00:00
			`# Grafana Alloy — DaemonSet that ships container logs → Loki`
			`# and provides an in-cluster OTLP receiver → Tempo.`
			`- name: alloy`
			`repo: https://grafana.github.io/helm-charts`
			`version: "0.12.0"`
			`releaseName: alloy`
			`namespace: monitoring`
			`valuesFile: alloy-values.yaml`