Commit Graph

7 Commits

Author SHA1 Message Date
e4987b4c58 feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.

Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting

New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
2026-04-06 15:52:06 +01:00
efe574f48e feat(storage): sccache S3 build cache with scoped SeaweedFS identity
Add sunbeam-sccache bucket and a dedicated sccache S3 identity scoped
to Read/Write/List/Tagging on that bucket only. Bump volume server
max from 50 to 100 (was full, blocking all new writes).
2026-04-05 21:50:46 +01:00
048319f70b fix(devtools): stabilize Penpot MCP, fix S3 creds, OIDC registration
MCP server:
- Replace vite build --watch + livePreview with static vite preview
  (watch mode was reloading the plugin iframe, killing WebSocket)
- Bake WS_URI at Docker build time for production WebSocket URL
- Add server-side application-level keepalive messages every 25s
- Add client-side auto-reconnect with exponential backoff
- Set Pingora route timeout to 86400s for WebSocket idle tolerance

Penpot:
- Add AWS_ACCESS_KEY_ID/SECRET env vars for S3 SDK compatibility
- Set S3 region to satisfy AWS SDK credential chain
- Enable OIDC registration (disable-registration blocks OIDC signup)
- Fix frontend port (8080 not 80)
- Add penpot bucket to seaweedfs-buckets init job
2026-04-04 15:37:45 +01:00
0a322c8a7c remove: Docs (impress) and People (desk) from La Suite
Collabora stays (Drive needs it for WOPI document editing).
Removed: Helm charts, values, nginx configs, patches, OIDC clients,
Vault secrets, S3 buckets, Pingora routes, Kratos return URLs,
overlay image overrides and resource patches, local-up.sh restarts.
2026-03-25 17:53:43 +00:00
ccfe8b877a feat: La Suite email/messages, buildkitd, monitoring, vault and storage updates
- Add Messages (email) service: backend, frontend, MTA in/out, MPA, SOCKS
  proxy, worker, DKIM config, and theme customization
- Add Collabora deployment for document collaboration
- Add Drive frontend nginx config and values
- Add buildkitd namespace for in-cluster container builds
- Add SeaweedFS remote sync and additional S3 buckets
- Update vault secrets across namespaces (devtools, lasuite, media,
  monitoring, ory, storage) with expanded credential management
- Update monitoring: rename grafana→metrics OAuth2Client, add Prometheus
  remote write and additional scrape configs
- Update local/production overlays with resource patches
- Remove stale login-ui resource patch from production overlay
2026-03-10 19:00:57 +00:00
5e36322a3b lasuite: declarative pre-work for La Suite app deployments
- Add find user and find_db to postgres-cluster.yaml (11th database)
- Add sunbeam-messages-imports and sunbeam-people buckets to SeaweedFS
- Configure Hydra Maester with enabledNamespaces: [lasuite] so it can
  create and update OAuth2Client secrets in the lasuite namespace
- Add find to Kratos allowed_return_urls
- Add shared ConfigMaps: lasuite-postgres, lasuite-valkey, lasuite-s3,
  lasuite-oidc-provider — single source of truth for all app env vars
- Add HydraOAuth2Client CRDs for all nine La Suite apps (docs, drive,
  meet, conversations, messages, people, find, gitea, hive); Maester
  will create oidc-<app> secrets with CLIENT_ID and CLIENT_SECRET
2026-03-01 18:03:13 +00:00
5d9bd7b067 chore: initial infrastructure scaffold
Kustomize base + overlays for the full Sunbeam k3s stack:
- base/mesh      — Linkerd edge (crds + control-plane + viz)
- base/ingress   — custom Pingora edge proxy
- base/ory       — Kratos 0.60.1 + Hydra 0.60.1 + login-ui
- base/data      — CloudNativePG 0.27.1, Valkey 8, OpenSearch 2
- base/storage   — SeaweedFS master + volume + filer (S3 on :8333)
- base/lasuite   — Hive sync daemon + La Suite app placeholders
- base/media     — LiveKit livekit-server 1.9.0
- base/devtools  — Gitea 12.5.0 (external PG + Valkey)
overlays/local   — sslip.io domain, mkcert TLS, Lima hostPort
overlays/production — stub (TODOs for sunbeam.pt values)
scripts/         — local-up/down/certs/urls helpers
justfile         — up / down / certs / urls targets
2026-02-28 13:42:27 +00:00