feat(infra): production bootstrap — cert-manager, longhorn, monitoring
Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn
distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo
+ Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning.
Production overlay: add patches for postgres sizing, SeaweedFS volume,
OpenSearch storage, LiveKit service, Pingora host ports, resource limits,
and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames
for all *.sunbeam.pt subdomains.
2026-03-06 12:06:27 +00:00
|
|
|
# kube-prometheus-stack — Prometheus + AlertManager + Grafana + node-exporter + kube-state-metrics
|
|
|
|
|
#
|
|
|
|
|
# k3s quirks: kube-proxy is replaced by Cilium; etcd/scheduler/controller-manager
|
|
|
|
|
# don't expose metrics on standard ports. Disable their monitors to avoid noise.
|
|
|
|
|
|
|
|
|
|
grafana:
|
|
|
|
|
adminUser: admin
|
|
|
|
|
admin:
|
|
|
|
|
existingSecret: grafana-admin
|
|
|
|
|
passwordKey: admin-password
|
|
|
|
|
persistence:
|
|
|
|
|
enabled: true
|
|
|
|
|
size: 2Gi
|
|
|
|
|
# Inject Hydra OIDC client credentials (created by Hydra Maester from the OAuth2Client CRD)
|
|
|
|
|
envFromSecret: grafana-oidc
|
|
|
|
|
grafana.ini:
|
|
|
|
|
server:
|
2026-03-09 08:20:42 +00:00
|
|
|
root_url: "https://metrics.DOMAIN_SUFFIX"
|
feat(infra): production bootstrap — cert-manager, longhorn, monitoring
Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn
distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo
+ Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning.
Production overlay: add patches for postgres sizing, SeaweedFS volume,
OpenSearch storage, LiveKit service, Pingora host ports, resource limits,
and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames
for all *.sunbeam.pt subdomains.
2026-03-06 12:06:27 +00:00
|
|
|
auth:
|
|
|
|
|
# Keep local login as fallback (admin password from grafana-admin secret)
|
|
|
|
|
disable_login_form: false
|
|
|
|
|
signout_redirect_url: "https://auth.DOMAIN_SUFFIX/oauth2/sessions/logout"
|
|
|
|
|
auth.generic_oauth:
|
|
|
|
|
enabled: true
|
|
|
|
|
name: Sunbeam
|
|
|
|
|
icon: signin
|
|
|
|
|
# CLIENT_ID / CLIENT_SECRET injected from grafana-oidc K8s Secret via envFromSecret
|
|
|
|
|
client_id: "${CLIENT_ID}"
|
|
|
|
|
client_secret: "${CLIENT_SECRET}"
|
|
|
|
|
scopes: "openid email profile"
|
|
|
|
|
auth_url: "https://auth.DOMAIN_SUFFIX/oauth2/auth"
|
|
|
|
|
token_url: "https://auth.DOMAIN_SUFFIX/oauth2/token"
|
|
|
|
|
api_url: "https://auth.DOMAIN_SUFFIX/userinfo"
|
|
|
|
|
allow_sign_up: true
|
|
|
|
|
# Small studio — anyone with a valid La Suite account is an admin.
|
|
|
|
|
# To restrict to specific users, set role_attribute_path instead.
|
|
|
|
|
auto_assign_org_role: Admin
|
|
|
|
|
skip_org_role_sync: true
|
2026-03-09 08:20:42 +00:00
|
|
|
sidecar:
|
|
|
|
|
datasources:
|
|
|
|
|
defaultDatasourceEnabled: false
|
2026-03-21 17:36:54 +00:00
|
|
|
dashboards:
|
|
|
|
|
enabled: true
|
|
|
|
|
# Pick up ConfigMaps with this label in any namespace
|
|
|
|
|
label: grafana_dashboard
|
|
|
|
|
labelValue: "1"
|
|
|
|
|
searchNamespace: ALL
|
|
|
|
|
folderAnnotation: grafana_folder
|
|
|
|
|
provider:
|
|
|
|
|
foldersFromFilesStructure: false
|
2026-03-09 08:20:42 +00:00
|
|
|
|
feat(infra): production bootstrap — cert-manager, longhorn, monitoring
Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn
distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo
+ Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning.
Production overlay: add patches for postgres sizing, SeaweedFS volume,
OpenSearch storage, LiveKit service, Pingora host ports, resource limits,
and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames
for all *.sunbeam.pt subdomains.
2026-03-06 12:06:27 +00:00
|
|
|
additionalDataSources:
|
2026-03-09 08:20:42 +00:00
|
|
|
- name: Prometheus
|
|
|
|
|
type: prometheus
|
2026-03-21 17:36:54 +00:00
|
|
|
uid: prometheus
|
feat: La Suite email/messages, buildkitd, monitoring, vault and storage updates
- Add Messages (email) service: backend, frontend, MTA in/out, MPA, SOCKS
proxy, worker, DKIM config, and theme customization
- Add Collabora deployment for document collaboration
- Add Drive frontend nginx config and values
- Add buildkitd namespace for in-cluster container builds
- Add SeaweedFS remote sync and additional S3 buckets
- Update vault secrets across namespaces (devtools, lasuite, media,
monitoring, ory, storage) with expanded credential management
- Update monitoring: rename grafana→metrics OAuth2Client, add Prometheus
remote write and additional scrape configs
- Update local/production overlays with resource patches
- Remove stale login-ui resource patch from production overlay
2026-03-10 19:00:57 +00:00
|
|
|
url: "http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090"
|
2026-03-09 08:20:42 +00:00
|
|
|
access: proxy
|
|
|
|
|
isDefault: true
|
|
|
|
|
jsonData:
|
|
|
|
|
timeInterval: 30s
|
feat(infra): production bootstrap — cert-manager, longhorn, monitoring
Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn
distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo
+ Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning.
Production overlay: add patches for postgres sizing, SeaweedFS volume,
OpenSearch storage, LiveKit service, Pingora host ports, resource limits,
and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames
for all *.sunbeam.pt subdomains.
2026-03-06 12:06:27 +00:00
|
|
|
- name: Loki
|
|
|
|
|
type: loki
|
2026-03-21 17:36:54 +00:00
|
|
|
uid: loki
|
feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.
Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting
New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
2026-04-06 15:52:06 +01:00
|
|
|
url: "http://loki.monitoring.svc.cluster.local:3100"
|
feat(infra): production bootstrap — cert-manager, longhorn, monitoring
Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn
distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo
+ Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning.
Production overlay: add patches for postgres sizing, SeaweedFS volume,
OpenSearch storage, LiveKit service, Pingora host ports, resource limits,
and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames
for all *.sunbeam.pt subdomains.
2026-03-06 12:06:27 +00:00
|
|
|
access: proxy
|
|
|
|
|
isDefault: false
|
2026-03-21 17:36:54 +00:00
|
|
|
jsonData:
|
|
|
|
|
derivedFields:
|
|
|
|
|
# Click a traceID in a log line → jump straight to Tempo
|
|
|
|
|
- datasourceUid: tempo
|
|
|
|
|
matcherRegex: '"traceID":"(\w+)"'
|
|
|
|
|
name: TraceID
|
|
|
|
|
url: "$${__value.raw}"
|
feat(infra): production bootstrap — cert-manager, longhorn, monitoring
Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn
distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo
+ Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning.
Production overlay: add patches for postgres sizing, SeaweedFS volume,
OpenSearch storage, LiveKit service, Pingora host ports, resource limits,
and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames
for all *.sunbeam.pt subdomains.
2026-03-06 12:06:27 +00:00
|
|
|
- name: Tempo
|
|
|
|
|
type: tempo
|
2026-03-21 17:36:54 +00:00
|
|
|
uid: tempo
|
feat: La Suite email/messages, buildkitd, monitoring, vault and storage updates
- Add Messages (email) service: backend, frontend, MTA in/out, MPA, SOCKS
proxy, worker, DKIM config, and theme customization
- Add Collabora deployment for document collaboration
- Add Drive frontend nginx config and values
- Add buildkitd namespace for in-cluster container builds
- Add SeaweedFS remote sync and additional S3 buckets
- Update vault secrets across namespaces (devtools, lasuite, media,
monitoring, ory, storage) with expanded credential management
- Update monitoring: rename grafana→metrics OAuth2Client, add Prometheus
remote write and additional scrape configs
- Update local/production overlays with resource patches
- Remove stale login-ui resource patch from production overlay
2026-03-10 19:00:57 +00:00
|
|
|
url: "http://tempo.monitoring.svc.cluster.local:3200"
|
feat(infra): production bootstrap — cert-manager, longhorn, monitoring
Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn
distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo
+ Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning.
Production overlay: add patches for postgres sizing, SeaweedFS volume,
OpenSearch storage, LiveKit service, Pingora host ports, resource limits,
and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames
for all *.sunbeam.pt subdomains.
2026-03-06 12:06:27 +00:00
|
|
|
access: proxy
|
|
|
|
|
isDefault: false
|
2026-03-21 17:36:54 +00:00
|
|
|
jsonData:
|
|
|
|
|
tracesToLogsV2:
|
|
|
|
|
datasourceUid: loki
|
|
|
|
|
filterByTraceID: true
|
|
|
|
|
filterBySpanID: false
|
|
|
|
|
tags:
|
|
|
|
|
- key: namespace
|
|
|
|
|
- key: pod
|
|
|
|
|
tracesToMetrics:
|
|
|
|
|
datasourceUid: prometheus
|
|
|
|
|
tags:
|
|
|
|
|
- key: service.name
|
|
|
|
|
value: service
|
|
|
|
|
lokiSearch:
|
|
|
|
|
datasourceUid: loki
|
|
|
|
|
serviceMap:
|
|
|
|
|
datasourceUid: prometheus
|
feat(infra): production bootstrap — cert-manager, longhorn, monitoring
Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn
distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo
+ Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning.
Production overlay: add patches for postgres sizing, SeaweedFS volume,
OpenSearch storage, LiveKit service, Pingora host ports, resource limits,
and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames
for all *.sunbeam.pt subdomains.
2026-03-06 12:06:27 +00:00
|
|
|
|
|
|
|
|
prometheus:
|
|
|
|
|
prometheusSpec:
|
2026-03-21 17:36:54 +00:00
|
|
|
# Discover ServiceMonitors / PodMonitors / PrometheusRules in ALL namespaces,
|
|
|
|
|
# not just "monitoring". Without this, monitors in ingress, mesh,
|
|
|
|
|
# cert-manager, devtools, etc. are invisible to Prometheus.
|
|
|
|
|
serviceMonitorNamespaceSelector: {}
|
|
|
|
|
podMonitorNamespaceSelector: {}
|
|
|
|
|
ruleNamespaceSelector: {}
|
|
|
|
|
serviceMonitorSelector: {}
|
|
|
|
|
podMonitorSelector: {}
|
|
|
|
|
# Accept remote-write from Tempo metrics generator
|
|
|
|
|
enableRemoteWriteReceiver: true
|
feat(infra): production bootstrap — cert-manager, longhorn, monitoring
Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn
distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo
+ Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning.
Production overlay: add patches for postgres sizing, SeaweedFS volume,
OpenSearch storage, LiveKit service, Pingora host ports, resource limits,
and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames
for all *.sunbeam.pt subdomains.
2026-03-06 12:06:27 +00:00
|
|
|
retention: 90d
|
2026-03-09 08:20:42 +00:00
|
|
|
additionalArgs:
|
|
|
|
|
# Allow browser-direct queries from the Grafana UI origin.
|
|
|
|
|
- name: web.cors.origin
|
|
|
|
|
value: "https://metrics.DOMAIN_SUFFIX"
|
feat(infra): production bootstrap — cert-manager, longhorn, monitoring
Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn
distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo
+ Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning.
Production overlay: add patches for postgres sizing, SeaweedFS volume,
OpenSearch storage, LiveKit service, Pingora host ports, resource limits,
and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames
for all *.sunbeam.pt subdomains.
2026-03-06 12:06:27 +00:00
|
|
|
storageSpec:
|
|
|
|
|
volumeClaimTemplate:
|
|
|
|
|
spec:
|
|
|
|
|
accessModes: [ReadWriteOnce]
|
|
|
|
|
resources:
|
|
|
|
|
requests:
|
|
|
|
|
storage: 30Gi
|
|
|
|
|
|
|
|
|
|
alertmanager:
|
|
|
|
|
alertmanagerSpec:
|
|
|
|
|
storage:
|
|
|
|
|
volumeClaimTemplate:
|
|
|
|
|
spec:
|
|
|
|
|
accessModes: [ReadWriteOnce]
|
|
|
|
|
resources:
|
|
|
|
|
requests:
|
|
|
|
|
storage: 2Gi
|
|
|
|
|
config:
|
|
|
|
|
route:
|
|
|
|
|
group_by: [alertname, namespace]
|
|
|
|
|
group_wait: 30s
|
|
|
|
|
group_interval: 5m
|
2026-03-24 12:21:29 +00:00
|
|
|
repeat_interval: 4h
|
|
|
|
|
receiver: matrix
|
|
|
|
|
routes:
|
|
|
|
|
- matchers:
|
|
|
|
|
- alertname = Watchdog
|
feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.
Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting
New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
2026-04-06 15:52:06 +01:00
|
|
|
receiver: matrix
|
|
|
|
|
repeat_interval: 12h
|
2026-04-06 17:40:56 +01:00
|
|
|
- matchers:
|
|
|
|
|
- alertname = InfoInhibitor
|
|
|
|
|
receiver: "null"
|
2026-03-24 12:21:29 +00:00
|
|
|
- matchers:
|
|
|
|
|
- severity = critical
|
feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.
Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting
New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
2026-04-06 15:52:06 +01:00
|
|
|
receiver: matrix
|
2026-03-24 12:21:29 +00:00
|
|
|
- matchers:
|
|
|
|
|
- severity = warning
|
|
|
|
|
receiver: matrix
|
feat(infra): production bootstrap — cert-manager, longhorn, monitoring
Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn
distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo
+ Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning.
Production overlay: add patches for postgres sizing, SeaweedFS volume,
OpenSearch storage, LiveKit service, Pingora host ports, resource limits,
and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames
for all *.sunbeam.pt subdomains.
2026-03-06 12:06:27 +00:00
|
|
|
receivers:
|
2026-04-06 17:40:56 +01:00
|
|
|
- name: "null"
|
2026-03-24 12:21:29 +00:00
|
|
|
- name: matrix
|
|
|
|
|
webhook_configs:
|
2026-03-25 18:01:15 +00:00
|
|
|
- url: "http://matrix-alertmanager-receiver.monitoring.svc.cluster.local:3000/alerts/alerts"
|
2026-03-24 12:21:29 +00:00
|
|
|
send_resolved: true
|
2026-04-06 17:40:56 +01:00
|
|
|
inhibit_rules:
|
feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.
Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting
New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
2026-04-06 15:52:06 +01:00
|
|
|
# Critical alerts suppress warnings for the same alertname+namespace
|
|
|
|
|
- source_matchers:
|
|
|
|
|
- severity = critical
|
|
|
|
|
target_matchers:
|
|
|
|
|
- severity = warning
|
|
|
|
|
equal: [alertname, namespace]
|
feat(infra): production bootstrap — cert-manager, longhorn, monitoring
Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn
distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo
+ Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning.
Production overlay: add patches for postgres sizing, SeaweedFS volume,
OpenSearch storage, LiveKit service, Pingora host ports, resource limits,
and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames
for all *.sunbeam.pt subdomains.
2026-03-06 12:06:27 +00:00
|
|
|
|
|
|
|
|
# Disable monitors for components k3s doesn't expose
|
|
|
|
|
kubeEtcd:
|
|
|
|
|
enabled: false
|
|
|
|
|
kubeControllerManager:
|
|
|
|
|
enabled: false
|
|
|
|
|
kubeScheduler:
|
|
|
|
|
enabled: false
|
|
|
|
|
kubeProxy:
|
|
|
|
|
enabled: false
|