e4987b4c58
feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
...
The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.
Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting
New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
2026-04-06 15:52:06 +01:00
632099893a
feat(media): deploy Element Call fork + optimize LiveKit quality
...
- Deploy self-hosted Element Call at call.sunbeam.pt with SSO login
- LiveKit: VP9 > AV1 > H.264 codec preferences, Opus stereo
- LiveKit: congestion_control.allow_pause=false, larger NACK buffers
- LiveKit: resources bumped to 2Gi/4CPU for VP9 SVC
- Proxy: add call.* route, TLS cert SAN for call.sunbeam.pt
2026-03-26 09:38:53 +00:00
4837983380
feat(media): deploy lk-jwt-service for Matrix Element Call
...
Bridges Element Call to LiveKit by exchanging Matrix OpenID tokens for
LiveKit JWTs. Shares API credentials with livekit-server via the
existing VSO secret (removed excludeRaw so raw fields are available).
2026-03-25 13:23:48 +00:00
e8c64e6f18
feat: add ServiceMonitors and enable metrics scraping
...
- SeaweedFS: enable -metricsPort=9091 on master/volume/filer, add
service labels, create ServiceMonitor
- Gitea: enable metrics in config, create ServiceMonitor
- Hydra/Kratos: standalone ServiceMonitors (chart templates require
.Capabilities.APIVersions unavailable in kustomize helm template)
- LiveKit: add prometheus_port=6789, standalone ServiceMonitor
(disabled in kustomization — host firewall blocks port 6789)
- OpenSearch: revert prometheus-exporter attempt (no plugin for v3.x),
add service label for future exporter sidecar
2026-03-24 12:21:18 +00:00
3fc54c8851
feat: add PrometheusRule alerts for all services
...
28 alert rules across 9 PrometheusRule files covering infrastructure
(Longhorn, cert-manager), data (PostgreSQL, OpenBao, OpenSearch),
storage (SeaweedFS), devtools (Gitea), identity (Hydra, Kratos),
media (LiveKit), and mesh (Linkerd golden signals for all services).
Severity routing: critical alerts fire to Matrix + email, warnings
to Matrix only (AlertManager config updated in separate commit).
2026-03-24 12:20:55 +00:00
ccfe8b877a
feat: La Suite email/messages, buildkitd, monitoring, vault and storage updates
...
- Add Messages (email) service: backend, frontend, MTA in/out, MPA, SOCKS
proxy, worker, DKIM config, and theme customization
- Add Collabora deployment for document collaboration
- Add Drive frontend nginx config and values
- Add buildkitd namespace for in-cluster container builds
- Add SeaweedFS remote sync and additional S3 buckets
- Update vault secrets across namespaces (devtools, lasuite, media,
monitoring, ory, storage) with expanded credential management
- Update monitoring: rename grafana→metrics OAuth2Client, add Prometheus
remote write and additional scrape configs
- Update local/production overlays with resource patches
- Remove stale login-ui resource patch from production overlay
2026-03-10 19:00:57 +00:00
f3faf31d4b
Fix meet: ALLOWED_HOSTS, OIDC callback, and LiveKit connectivity
...
- meet-config: rename ALLOWED_HOSTS → DJANGO_ALLOWED_HOSTS (django-configurations
ListValue uses DJANGO_ prefix by default; without it the list was empty and
every browser request got 400 DisallowedHost)
- meet-config: set LIVEKIT_API_URL to public https://livekit.DOMAIN_SUFFIX so
the meet frontend can reach LiveKit for WebSocket signaling
- pingora-config: add livekit.DOMAIN_SUFFIX → livekit-server:80 WebSocket route
- cert-manager: add livekit.DOMAIN_SUFFIX to TLS cert dnsNames
- oidc-clients: fix meet redirect URI /oidc/callback/ → /api/v1.0/callback/
(meet embeds mozilla-django-oidc inside the api/v1.0/ prefix); add
postLogoutRedirectUri for clean logout
- livekit-values: replace hardcoded devkey:secret-placeholder with key_file
loaded from a VSO-managed K8s Secret (secret/livekit in OpenBao)
- media/vault-secrets: add VaultAuth + VaultStaticSecret for media namespace
to sync livekit API credentials from OpenBao
2026-03-06 13:56:29 +00:00
d32d1435f9
feat(infra): data, storage, devtools, and ory layer updates
...
- data: CNPG cluster tuning, OpenBao values, OpenSearch deployment fixes,
OpenSearch PVC, barman vault secret for S3 backup credentials
- storage: SeaweedFS filer updates (s3.json via secret subPath), PVC for
filer persistent storage
- devtools: Gitea values (SSH service, custom theme), gitea-theme-cm ConfigMap
- ory: add kratos-selfservice-urls.yaml for self-service flow URLs
- media: LiveKit values updated (TURN config, STUN, resource limits)
- vso: kustomization cleanup
2026-03-06 12:07:28 +00:00
7de6e94a8d
fix: resource tuning — LiveKit Recreate strategy, OpenSearch JVM heap, login-ui
...
LiveKit: switch to Recreate deployment strategy. hostPorts (TURN UDP relay
range) block RollingUpdate because the new pod cannot schedule while the
old one still holds the ports.
OpenSearch: set OPENSEARCH_JAVA_OPTS to -Xms192m -Xmx256m. The upstream
default (-Xms512m -Xmx1g) immediately OOMs the container given our 512Mi
memory limit.
login-ui: raise memory limit from 64Mi to 192Mi and add a 64Mi request;
the previous limit was too tight and caused OOMKilled restarts under load.
2026-03-02 18:33:42 +00:00
a589e6280d
feat: bring up local dev stack — all services running
...
- Ory Hydra + Kratos: fixed secret management, DSN config, DB migrations,
OAuth2Client CRD (helm template skips crds/ dir), login-ui env vars
- SeaweedFS: added s3.json credentials file via -s3.config CLI flag
- OpenBao: standalone mode with auto-unseal sidecar, keys in K8s secret
- OpenSearch: increased memory to 1.5Gi / JVM 1g heap
- Gitea: SSL_MODE disable, S3 bucket creation fixed
- Hive: automountServiceAccountToken: false (Lima virtiofs read-only rootfs quirk)
- LiveKit: API keys in values, hostPort conflict resolved
- Linkerd: native sidecar (proxy.nativeSidecar=true) to avoid blocking Jobs
- All placeholder images replaced: pingora→nginx:alpine, login-ui→oryd/kratos-selfservice-ui-node
Full stack running: postgres, valkey, openbao, opensearch, seaweedfs,
kratos, hydra, gitea, livekit, hive (placeholder), login-ui
2026-02-28 22:08:38 +00:00
886c4221b2
fix(local): kustomize render passes cleanly
...
- Remove base/mesh from local overlay (Linkerd installed via CLI in local-up.sh)
- Fix LiveKit namespace: chart doesn't set .Release.Namespace, add explicit patches
- Fix release names: livekit-server and cloudnative-pg match chart names (avoid double-prefix)
- Disable hydra-maester (not needed for local dev)
- Add memory limits for cloudnative-pg operator and livekit-server deployments
- Remove non-functional values-ory.yaml patch (DOMAIN_SUFFIX handled by sed in local-up.sh)
- Gitignore **/charts/ (kustomize helm cache, generated artifact)
2026-02-28 14:00:31 +00:00
5d9bd7b067
chore: initial infrastructure scaffold
...
Kustomize base + overlays for the full Sunbeam k3s stack:
- base/mesh — Linkerd edge (crds + control-plane + viz)
- base/ingress — custom Pingora edge proxy
- base/ory — Kratos 0.60.1 + Hydra 0.60.1 + login-ui
- base/data — CloudNativePG 0.27.1, Valkey 8, OpenSearch 2
- base/storage — SeaweedFS master + volume + filer (S3 on :8333)
- base/lasuite — Hive sync daemon + La Suite app placeholders
- base/media — LiveKit livekit-server 1.9.0
- base/devtools — Gitea 12.5.0 (external PG + Valkey)
overlays/local — sslip.io domain, mkcert TLS, Lima hostPort
overlays/production — stub (TODOs for sunbeam.pt values)
scripts/ — local-up/down/certs/urls helpers
justfile — up / down / certs / urls targets
2026-02-28 13:42:27 +00:00