Commit Graph

23 Commits

Author SHA1 Message Date
8662c79212 checkpoint: stalwart deploy, beam-design, migration scripts, config tweaks
Stalwart + Bulwark mail server deployment with OIDC, TLS cert, vault
secrets. Beam design service. Pingora config cleanup. SeaweedFS
replication fix. Kratos values tweak. Migration scripts for mbox/messages
/calendars from La Suite to Stalwart.
2026-04-06 17:52:30 +01:00
e4987b4c58 feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.

Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting

New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
2026-04-06 15:52:06 +01:00
fcb80f1f37 feat(devtools): deploy Penpot + MCP server, wildcard TLS via DNS-01
Penpot (designer.sunbeam.pt):
- Frontend/backend/exporter deployments with OIDC-only auth via Hydra
- VSO-managed DB, S3, and app secrets from OpenBao
- PostgreSQL user/db in CNPG postInitSQL
- Hydra Maester enabledNamespaces extended to devtools

Penpot MCP server (mcp-designer.sunbeam.pt):
- Pre-built Node.js image pushed to Gitea registry
- Auth-gated via Pingora auth_request → Hydra /userinfo
- WebSocket path for browser plugin connection

Wildcard TLS:
- Switched cert-manager from HTTP-01 (per-SAN) to DNS-01 via Scaleway webhook
- Certificate collapsed to *.sunbeam.pt + sunbeam.pt
- Added scaleway-certmanager-webhook Helm chart
- VSO secret for Scaleway DNS API credentials in cert-manager namespace
- Added cert-manager to OpenBao VSO auth role
2026-04-04 12:53:27 +01:00
97628b0f6f feat(ory): OIDC group-to-team mapping, social login, Gitea OIDC-only mode
Identity permissions flow from Kratos metadata_admin.groups through
Hydra ID token claims to Gitea's OIDC group-to-team mapping:
- super-admin → site admin + Owners + Employees teams
- employee → Owners + Employees teams
- community → Contributors team (social sign-up users)

Kratos: Discord + GitHub social login providers, community identity
schema, OIDC method enabled with env-var credential injection via VSO.

Gitea: OIDC-only login (no local registration, no password form),
APP_NAME, favicon, auto-registration with account linking.

Also: messages-mta-in recreate strategy + liveness probe for milter.
2026-03-27 17:46:11 +00:00
a912331f97 feat: CNPG PodMonitor, OpenBao ServiceMonitor, CLI OIDC client CRD
- CNPG PodMonitor for PostgreSQL cluster metrics
- OpenBao ServiceMonitor for vault metrics scraping
- Sunbeam CLI OAuth2Client CRD (moved from seed to declarative)
2026-03-25 18:01:52 +00:00
50a4abf94f fix(ory): harden Kratos and Hydra production security configuration
Kratos: xchacha20-poly1305 cipher for at-rest encryption, 12-char min
password with HaveIBeenPwned + similarity check, recovery/verification
switched to code (not link), anti-enumeration on unknown recipients,
15m privileged session, 24h session extend throttle, JSON structured
logging, WebAuthn passwordless enabled, additionalProperties: false on
all identity schemas, memory limits bumped to 256Mi.

Hydra: expose_internal_errors disabled, PKCE enforced for public
clients, janitor CronJob every 6h, cookie domain set explicitly,
SSRF prevention via disallow_private_ip_ranges, JSON structured
logging, Maester enabledNamespaces includes monitoring.

Also: fixed selfservice URL patch divergence (settings path, missing
allowed_return_urls), removed invalid responseTypes on Hive client.
2026-03-24 19:40:58 +00:00
e8c64e6f18 feat: add ServiceMonitors and enable metrics scraping
- SeaweedFS: enable -metricsPort=9091 on master/volume/filer, add
  service labels, create ServiceMonitor
- Gitea: enable metrics in config, create ServiceMonitor
- Hydra/Kratos: standalone ServiceMonitors (chart templates require
  .Capabilities.APIVersions unavailable in kustomize helm template)
- LiveKit: add prometheus_port=6789, standalone ServiceMonitor
  (disabled in kustomization — host firewall blocks port 6789)
- OpenSearch: revert prometheus-exporter attempt (no plugin for v3.x),
  add service label for future exporter sidecar
2026-03-24 12:21:18 +00:00
3fc54c8851 feat: add PrometheusRule alerts for all services
28 alert rules across 9 PrometheusRule files covering infrastructure
(Longhorn, cert-manager), data (PostgreSQL, OpenBao, OpenSearch),
storage (SeaweedFS), devtools (Gitea), identity (Hydra, Kratos),
media (LiveKit), and mesh (Linkerd golden signals for all services).

Severity routing: critical alerts fire to Matrix + email, warnings
to Matrix only (AlertManager config updated in separate commit).
2026-03-24 12:20:55 +00:00
b9d9ad72fe fix(ory): enable MFA methods, fix font URL, clean up login-ui
Enable TOTP, WebAuthn, and lookup secret MFA methods in Kratos config.
Fix Monaspace Neon font CDN URL in Gitea theme ConfigMap. Remove
redundant Google Fonts preconnect from people-frontend nginx config.
Delete unused login-ui-deployment.yaml (login-ui is part of the Ory
Helm chart, not a standalone deployment).
2026-03-18 18:36:15 +00:00
ccfe8b877a feat: La Suite email/messages, buildkitd, monitoring, vault and storage updates
- Add Messages (email) service: backend, frontend, MTA in/out, MPA, SOCKS
  proxy, worker, DKIM config, and theme customization
- Add Collabora deployment for document collaboration
- Add Drive frontend nginx config and values
- Add buildkitd namespace for in-cluster container builds
- Add SeaweedFS remote sync and additional S3 buckets
- Update vault secrets across namespaces (devtools, lasuite, media,
  monitoring, ory, storage) with expanded credential management
- Update monitoring: rename grafana→metrics OAuth2Client, add Prometheus
  remote write and additional scrape configs
- Update local/production overlays with resource patches
- Remove stale login-ui resource patch from production overlay
2026-03-10 19:00:57 +00:00
e5741c4df6 feat: integrate tuwunel with Ory SSO, rename chat to messages subdomain
- Add matrix to hydra-maester enabledNamespaces for OAuth2Client CRD
- Update allowed_return_urls and selfservice URLs: chat→messages
- Add Kratos verification flow, employee/external identity schemas
- Extend session lifespan to 30 days with persistent cookies
- Route messages.* to tuwunel via Pingora with WebSocket support
- Replace login-ui with kratos-admin-ui as unified auth frontend
- Update TLS certificate SANs: chat→messages, add monitoring subdomains
- Add tuwunel + La Suite images to production overlay
- Switch DDoS/scanner detection to compiled-in ensemble models (observe_only)
2026-03-10 18:52:47 +00:00
d32d1435f9 feat(infra): data, storage, devtools, and ory layer updates
- data: CNPG cluster tuning, OpenBao values, OpenSearch deployment fixes,
  OpenSearch PVC, barman vault secret for S3 backup credentials
- storage: SeaweedFS filer updates (s3.json via secret subPath), PVC for
  filer persistent storage
- devtools: Gitea values (SSH service, custom theme), gitea-theme-cm ConfigMap
- ory: add kratos-selfservice-urls.yaml for self-service flow URLs
- media: LiveKit values updated (TURN config, STUN, resource limits)
- vso: kustomization cleanup
2026-03-06 12:07:28 +00:00
7ffddcafcd fix(ory,lasuite): harden session security and fix logout + WebSocket routing
- Fix Hydra postLogoutRedirectUris for docs and people to match the
  actual URI sent by mozilla_django_oidc v5 (/api/v1.0/logout-callback/)
  instead of the root URL, resolving 599 logout errors.

- Fix docs y-provider WebSocket backend port: use Service port 443
  (not pod port 4444 which has no DNAT rule) in Pingora config.

- Tighten VSO VaultDynamicSecret rotation sync: add allowStaticCreds:true
  and reduce refreshAfter from 1h to 5m across all static-creds paths
  (kratos, hydra, gitea, hive, people, docs) so credential rotation is
  reflected within 5 minutes instead of up to 1 hour.

- Set Hydra token TTLs: access_token and id_token to 5m; refresh_token
  to 720h (30 days). Kratos session carries silent re-auth so the short
  access token TTL does not require users to log in manually.

- Set SESSION_COOKIE_AGE=3600 (1h) in docs and people backends. After
  1h, apps silently re-auth via the active Kratos session. Disabled
  identities (sunbeam user disable) cannot re-auth on next expiry.
2026-03-03 18:07:08 +00:00
b19e553f54 fix(ory): configure Kratos oauth2 provider, session cookie domain, and flows
- Add oauth2_provider.url pointing to hydra-admin so login_challenge
  params are accepted (fixes People OIDC login flow)
- Scope session cookie to parent DOMAIN_SUFFIX so admin.* subdomains
  share the session (fixes redirect loop on kratos-admin-ui)
- Add allowed_return_urls for admin.*, enable recovery flow, add error
  and recovery ui_url entries
- Fix KRATOS_PUBLIC_URL port in login-ui deployment (4433 → 80)
2026-03-03 11:31:00 +00:00
6cc60c66ff feat(ory): add kratos-admin-ui service
Deploy the custom Kratos admin UI (Deno/Hono + Cunningham React):
- K8s Deployment + Service in ory namespace
- VSO VaultStaticSecret for cookie/csrf/admin-identity-ids secrets
- Pingora route for admin.DOMAIN_SUFFIX
2026-03-03 11:30:52 +00:00
8621c0dd65 fix: correct Pingora upstream ports and kustomize namespace conflict
pingora-config.yaml: kratos-public and people-backend K8s Services
expose port 80, not 4433/8000. The wrong ports caused Pingora to
return timeouts for /kratos/* and all people.* routes.

ory/kustomization.yaml: remove kustomization-level namespace: ory
transformer. All non-Helm resources already declare namespace: ory
explicitly. The transformer was incorrectly moving hydra-maester's
enabledNamespaces Role (generated for the lasuite namespace) into ory,
producing a duplicate-name conflict during kustomize build.
2026-03-03 00:57:58 +00:00
c7b812dde8 feat(ory): replace hardcoded DSN + secrets with OpenBao DB engine + VSO
All Ory service credentials now flow from OpenBao through VSO instead
of being hardcoded in Helm values or Deployment env vars.

Kratos:
- Remove config.dsn; flip secret.enabled=false with nameOverride pointing
  at kratos-app-secrets (a VSO-managed Secret with secretsDefault,
  secretsCookie, smtpConnectionURI).
- Inject DSN at runtime via deployment.extraEnv from kratos-db-creds
  (VaultDynamicSecret backed by OpenBao database static role, 24h rotation).

Hydra:
- Remove config.dsn; inject DSN via deployment.extraEnv from hydra-db-creds
  (VaultDynamicSecret, same rotation scheme).

Login UI:
- Replace hardcoded COOKIE_SECRET/CSRF_COOKIE_SECRET env var values with
  secretKeyRef reads from login-ui-secrets (VaultStaticSecret → secret/login-ui).

vault-secrets.yaml adds: VaultAuth, Hydra VSS, kratos-app-secrets VSS,
login-ui-secrets VSS, kratos-db-creds VDS, hydra-db-creds VDS.
2026-03-02 18:32:33 +00:00
5e36322a3b lasuite: declarative pre-work for La Suite app deployments
- Add find user and find_db to postgres-cluster.yaml (11th database)
- Add sunbeam-messages-imports and sunbeam-people buckets to SeaweedFS
- Configure Hydra Maester with enabledNamespaces: [lasuite] so it can
  create and update OAuth2Client secrets in the lasuite namespace
- Add find to Kratos allowed_return_urls
- Add shared ConfigMaps: lasuite-postgres, lasuite-valkey, lasuite-s3,
  lasuite-oidc-provider — single source of truth for all app env vars
- Add HydraOAuth2Client CRDs for all nine La Suite apps (docs, drive,
  meet, conversations, messages, people, find, gitea, hive); Maester
  will create oidc-<app> secrets with CLIENT_ID and CLIENT_SECRET
2026-03-01 18:03:13 +00:00
cdddc334ff feat: replace nginx placeholder with custom Pingora proxy; add Postfix MTA
Ingress:
- Deploy custom sunbeam-proxy (Pingora/Rust) replacing nginx placeholder
- HTTPS termination with mkcert (local) / rustls-acme (production)
- Host-prefix routing with path-based sub-routing for auth virtual host:
  /oauth2 + /.well-known + /userinfo → Hydra, /kratos → Kratos (prefix stripped), default → login-ui
- HTTP→HTTPS redirect, WebSocket passthrough, JSON audit logging, OTEL stub
- cert-manager HTTP-01 ACME challenge routing via Ingress watcher
- RBAC for Ingress watcher (pingora-watcher ClusterRole)
- local overlay: hostPorts 80/443, LiveKit TURN demoted to ClusterIP to avoid klipper conflict

Infrastructure:
- socket_vmnet shared network for host↔VM reachability (192.168.105.2)
- local-up.sh: cert-manager installation, eth1-based LIMA_IP detection, correct DOMAIN_SUFFIX sed substitution
- Postfix MTA in lasuite namespace: outbound relay via Scaleway TEM, accepts SMTP from cluster pods
- Kratos SMTP courier pointed at postfix.lasuite.svc.cluster.local:25
- Production overlay: cert-manager ClusterIssuer, ACME-enabled Pingora values
2026-03-01 16:25:11 +00:00
a589e6280d feat: bring up local dev stack — all services running
- Ory Hydra + Kratos: fixed secret management, DSN config, DB migrations,
  OAuth2Client CRD (helm template skips crds/ dir), login-ui env vars
- SeaweedFS: added s3.json credentials file via -s3.config CLI flag
- OpenBao: standalone mode with auto-unseal sidecar, keys in K8s secret
- OpenSearch: increased memory to 1.5Gi / JVM 1g heap
- Gitea: SSL_MODE disable, S3 bucket creation fixed
- Hive: automountServiceAccountToken: false (Lima virtiofs read-only rootfs quirk)
- LiveKit: API keys in values, hostPort conflict resolved
- Linkerd: native sidecar (proxy.nativeSidecar=true) to avoid blocking Jobs
- All placeholder images replaced: pingora→nginx:alpine, login-ui→oryd/kratos-selfservice-ui-node

Full stack running: postgres, valkey, openbao, opensearch, seaweedfs,
kratos, hydra, gitea, livekit, hive (placeholder), login-ui
2026-02-28 22:08:38 +00:00
92e80a761c fix(ory): re-enable hydra-maester, fix namespace, add memory limit 2026-02-28 14:02:47 +00:00
886c4221b2 fix(local): kustomize render passes cleanly
- Remove base/mesh from local overlay (Linkerd installed via CLI in local-up.sh)
- Fix LiveKit namespace: chart doesn't set .Release.Namespace, add explicit patches
- Fix release names: livekit-server and cloudnative-pg match chart names (avoid double-prefix)
- Disable hydra-maester (not needed for local dev)
- Add memory limits for cloudnative-pg operator and livekit-server deployments
- Remove non-functional values-ory.yaml patch (DOMAIN_SUFFIX handled by sed in local-up.sh)
- Gitignore **/charts/ (kustomize helm cache, generated artifact)
2026-02-28 14:00:31 +00:00
5d9bd7b067 chore: initial infrastructure scaffold
Kustomize base + overlays for the full Sunbeam k3s stack:
- base/mesh      — Linkerd edge (crds + control-plane + viz)
- base/ingress   — custom Pingora edge proxy
- base/ory       — Kratos 0.60.1 + Hydra 0.60.1 + login-ui
- base/data      — CloudNativePG 0.27.1, Valkey 8, OpenSearch 2
- base/storage   — SeaweedFS master + volume + filer (S3 on :8333)
- base/lasuite   — Hive sync daemon + La Suite app placeholders
- base/media     — LiveKit livekit-server 1.9.0
- base/devtools  — Gitea 12.5.0 (external PG + Valkey)
overlays/local   — sslip.io domain, mkcert TLS, Lima hostPort
overlays/production — stub (TODOs for sunbeam.pt values)
scripts/         — local-up/down/certs/urls helpers
justfile         — up / down / certs / urls targets
2026-02-28 13:42:27 +00:00