Commit Graph

14 Commits

Author SHA1 Message Date
e4987b4c58 feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.

Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting

New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
2026-04-06 15:52:06 +01:00
048319f70b fix(devtools): stabilize Penpot MCP, fix S3 creds, OIDC registration
MCP server:
- Replace vite build --watch + livePreview with static vite preview
  (watch mode was reloading the plugin iframe, killing WebSocket)
- Bake WS_URI at Docker build time for production WebSocket URL
- Add server-side application-level keepalive messages every 25s
- Add client-side auto-reconnect with exponential backoff
- Set Pingora route timeout to 86400s for WebSocket idle tolerance

Penpot:
- Add AWS_ACCESS_KEY_ID/SECRET env vars for S3 SDK compatibility
- Set S3 region to satisfy AWS SDK credential chain
- Enable OIDC registration (disable-registration blocks OIDC signup)
- Fix frontend port (8080 not 80)
- Add penpot bucket to seaweedfs-buckets init job
2026-04-04 15:37:45 +01:00
fcb80f1f37 feat(devtools): deploy Penpot + MCP server, wildcard TLS via DNS-01
Penpot (designer.sunbeam.pt):
- Frontend/backend/exporter deployments with OIDC-only auth via Hydra
- VSO-managed DB, S3, and app secrets from OpenBao
- PostgreSQL user/db in CNPG postInitSQL
- Hydra Maester enabledNamespaces extended to devtools

Penpot MCP server (mcp-designer.sunbeam.pt):
- Pre-built Node.js image pushed to Gitea registry
- Auth-gated via Pingora auth_request → Hydra /userinfo
- WebSocket path for browser plugin connection

Wildcard TLS:
- Switched cert-manager from HTTP-01 (per-SAN) to DNS-01 via Scaleway webhook
- Certificate collapsed to *.sunbeam.pt + sunbeam.pt
- Added scaleway-certmanager-webhook Helm chart
- VSO secret for Scaleway DNS API credentials in cert-manager namespace
- Added cert-manager to OpenBao VSO auth role
2026-04-04 12:53:27 +01:00
97628b0f6f feat(ory): OIDC group-to-team mapping, social login, Gitea OIDC-only mode
Identity permissions flow from Kratos metadata_admin.groups through
Hydra ID token claims to Gitea's OIDC group-to-team mapping:
- super-admin → site admin + Owners + Employees teams
- employee → Owners + Employees teams
- community → Contributors team (social sign-up users)

Kratos: Discord + GitHub social login providers, community identity
schema, OIDC method enabled with env-var credential injection via VSO.

Gitea: OIDC-only login (no local registration, no password form),
APP_NAME, favicon, auto-registration with account linking.

Also: messages-mta-in recreate strategy + liveness probe for milter.
2026-03-27 17:46:11 +00:00
9f15f5099e fix: meet external-api route, drive media proxy, alertbot, misc tweaks
- Meet: add external-api backend path, CSRF trusted origins
- Drive: fix media proxy regex for preview URLs and S3 key signing
- OpenBao: enable Prometheus telemetry
- Postgres alerts: fix metric name (cnpg_backends_total)
- Gitea: bump memory limits for mirror workloads
- Alertbot: expanded deployment config
- Kratos: add find/cal/projects to allowed return URLs, settings path
- Pingora: meet external-api route fix
- Sol: config update
2026-03-25 18:01:15 +00:00
e8c64e6f18 feat: add ServiceMonitors and enable metrics scraping
- SeaweedFS: enable -metricsPort=9091 on master/volume/filer, add
  service labels, create ServiceMonitor
- Gitea: enable metrics in config, create ServiceMonitor
- Hydra/Kratos: standalone ServiceMonitors (chart templates require
  .Capabilities.APIVersions unavailable in kustomize helm template)
- LiveKit: add prometheus_port=6789, standalone ServiceMonitor
  (disabled in kustomization — host firewall blocks port 6789)
- OpenSearch: revert prometheus-exporter attempt (no plugin for v3.x),
  add service label for future exporter sidecar
2026-03-24 12:21:18 +00:00
3fc54c8851 feat: add PrometheusRule alerts for all services
28 alert rules across 9 PrometheusRule files covering infrastructure
(Longhorn, cert-manager), data (PostgreSQL, OpenBao, OpenSearch),
storage (SeaweedFS), devtools (Gitea), identity (Hydra, Kratos),
media (LiveKit), and mesh (Linkerd golden signals for all services).

Severity routing: critical alerts fire to Matrix + email, warnings
to Matrix only (AlertManager config updated in separate commit).
2026-03-24 12:20:55 +00:00
b9d9ad72fe fix(ory): enable MFA methods, fix font URL, clean up login-ui
Enable TOTP, WebAuthn, and lookup secret MFA methods in Kratos config.
Fix Monaspace Neon font CDN URL in Gitea theme ConfigMap. Remove
redundant Google Fonts preconnect from people-frontend nginx config.
Delete unused login-ui-deployment.yaml (login-ui is part of the Ory
Helm chart, not a standalone deployment).
2026-03-18 18:36:15 +00:00
ccfe8b877a feat: La Suite email/messages, buildkitd, monitoring, vault and storage updates
- Add Messages (email) service: backend, frontend, MTA in/out, MPA, SOCKS
  proxy, worker, DKIM config, and theme customization
- Add Collabora deployment for document collaboration
- Add Drive frontend nginx config and values
- Add buildkitd namespace for in-cluster container builds
- Add SeaweedFS remote sync and additional S3 buckets
- Update vault secrets across namespaces (devtools, lasuite, media,
  monitoring, ory, storage) with expanded credential management
- Update monitoring: rename grafana→metrics OAuth2Client, add Prometheus
  remote write and additional scrape configs
- Update local/production overlays with resource patches
- Remove stale login-ui resource patch from production overlay
2026-03-10 19:00:57 +00:00
d32d1435f9 feat(infra): data, storage, devtools, and ory layer updates
- data: CNPG cluster tuning, OpenBao values, OpenSearch deployment fixes,
  OpenSearch PVC, barman vault secret for S3 backup credentials
- storage: SeaweedFS filer updates (s3.json via secret subPath), PVC for
  filer persistent storage
- devtools: Gitea values (SSH service, custom theme), gitea-theme-cm ConfigMap
- ory: add kratos-selfservice-urls.yaml for self-service flow URLs
- media: LiveKit values updated (TURN config, STUN, resource limits)
- vso: kustomization cleanup
2026-03-06 12:07:28 +00:00
7ffddcafcd fix(ory,lasuite): harden session security and fix logout + WebSocket routing
- Fix Hydra postLogoutRedirectUris for docs and people to match the
  actual URI sent by mozilla_django_oidc v5 (/api/v1.0/logout-callback/)
  instead of the root URL, resolving 599 logout errors.

- Fix docs y-provider WebSocket backend port: use Service port 443
  (not pod port 4444 which has no DNAT rule) in Pingora config.

- Tighten VSO VaultDynamicSecret rotation sync: add allowStaticCreds:true
  and reduce refreshAfter from 1h to 5m across all static-creds paths
  (kratos, hydra, gitea, hive, people, docs) so credential rotation is
  reflected within 5 minutes instead of up to 1 hour.

- Set Hydra token TTLs: access_token and id_token to 5m; refresh_token
  to 720h (30 days). Kratos session carries silent re-auth so the short
  access token TTL does not require users to log in manually.

- Set SESSION_COOKIE_AGE=3600 (1h) in docs and people backends. After
  1h, apps silently re-auth via the active Kratos session. Disabled
  identities (sunbeam user disable) cannot re-auth on next expiry.
2026-03-03 18:07:08 +00:00
8cb705fecc feat(devtools): migrate Gitea to OpenBao DB static role; sync admin creds via VSO
- gitea-db-credentials is now a VaultDynamicSecret reading from
  database/static-creds/gitea (OpenBao static role, 24h password rotation).
  Replaces the previous KV-based Secret that used a hardcoded localdev password.
- gitea-admin-credentials and gitea-s3-credentials remain VaultStaticSecrets
  synced from secret/gitea and secret/seaweedfs respectively.
- gitea-values.yaml adds gitea.admin.existingSecret so the chart reads the
  admin username/password from the VSO-managed Secret instead of values.
2026-03-02 18:33:16 +00:00
a589e6280d feat: bring up local dev stack — all services running
- Ory Hydra + Kratos: fixed secret management, DSN config, DB migrations,
  OAuth2Client CRD (helm template skips crds/ dir), login-ui env vars
- SeaweedFS: added s3.json credentials file via -s3.config CLI flag
- OpenBao: standalone mode with auto-unseal sidecar, keys in K8s secret
- OpenSearch: increased memory to 1.5Gi / JVM 1g heap
- Gitea: SSL_MODE disable, S3 bucket creation fixed
- Hive: automountServiceAccountToken: false (Lima virtiofs read-only rootfs quirk)
- LiveKit: API keys in values, hostPort conflict resolved
- Linkerd: native sidecar (proxy.nativeSidecar=true) to avoid blocking Jobs
- All placeholder images replaced: pingora→nginx:alpine, login-ui→oryd/kratos-selfservice-ui-node

Full stack running: postgres, valkey, openbao, opensearch, seaweedfs,
kratos, hydra, gitea, livekit, hive (placeholder), login-ui
2026-02-28 22:08:38 +00:00
5d9bd7b067 chore: initial infrastructure scaffold
Kustomize base + overlays for the full Sunbeam k3s stack:
- base/mesh      — Linkerd edge (crds + control-plane + viz)
- base/ingress   — custom Pingora edge proxy
- base/ory       — Kratos 0.60.1 + Hydra 0.60.1 + login-ui
- base/data      — CloudNativePG 0.27.1, Valkey 8, OpenSearch 2
- base/storage   — SeaweedFS master + volume + filer (S3 on :8333)
- base/lasuite   — Hive sync daemon + La Suite app placeholders
- base/media     — LiveKit livekit-server 1.9.0
- base/devtools  — Gitea 12.5.0 (external PG + Valkey)
overlays/local   — sslip.io domain, mkcert TLS, Lima hostPort
overlays/production — stub (TODOs for sunbeam.pt values)
scripts/         — local-up/down/certs/urls helpers
justfile         — up / down / certs / urls targets
2026-02-28 13:42:27 +00:00