Commit Graph

90 Commits

Author SHA1 Message Date
b13555607a docs(ops): COE-2026-002 Matrix RTC / Element Call + IPv6
Documents the missing lk-jwt-service, well-known URL fix, bare domain
routing, DNS apex records, and IPv6 cert-manager self-check failure.
Includes dual-stack K3s migration plan.
2026-03-25 13:25:05 +00:00
f8e9d32a8b feat(ingress): add bare domain sunbeam.pt to TLS cert SANs
Element X resolves .well-known from the server_name (sunbeam.pt),
not the homeserver URL. The bare domain needs a valid cert.
2026-03-25 13:24:38 +00:00
fdcc15080f fix(matrix): use https:// for livekit_url in well-known
Element Call expects livekit_service_url to be an HTTPS endpoint
(lk-jwt-service), not a WebSocket URL. The client connects to LiveKit
via WSS separately after getting a JWT.
2026-03-25 13:24:12 +00:00
84c5548f2e feat(ingress): route lk-jwt-service paths + bare domain well-known
Split livekit.* requests: /sfu/get, /healthz, /get_token → lk-jwt-service,
everything else → livekit-server (WebSocket). Add sunbeam.pt bare domain
route so Element X can discover RTC foci from the server_name.
2026-03-25 13:23:59 +00:00
4837983380 feat(media): deploy lk-jwt-service for Matrix Element Call
Bridges Element Call to LiveKit by exchanging Matrix OpenID tokens for
LiveKit JWTs. Shares API credentials with livekit-server via the
existing VSO secret (removed excludeRaw so raw fields are available).
2026-03-25 13:23:48 +00:00
50a4abf94f fix(ory): harden Kratos and Hydra production security configuration
Kratos: xchacha20-poly1305 cipher for at-rest encryption, 12-char min
password with HaveIBeenPwned + similarity check, recovery/verification
switched to code (not link), anti-enumeration on unknown recipients,
15m privileged session, 24h session extend throttle, JSON structured
logging, WebAuthn passwordless enabled, additionalProperties: false on
all identity schemas, memory limits bumped to 256Mi.

Hydra: expose_internal_errors disabled, PKCE enforced for public
clients, janitor CronJob every 6h, cookie domain set explicitly,
SSRF prevention via disallow_private_ip_ranges, JSON structured
logging, Maester enabledNamespaces includes monitoring.

Also: fixed selfservice URL patch divergence (settings path, missing
allowed_return_urls), removed invalid responseTypes on Hive client.
2026-03-24 19:40:58 +00:00
4c02fe18ed fix: use Kratos session auth for observability endpoints
Observability routes (systemmetrics, systemlogs, systemtracing) use
Kratos /sessions/whoami for auth_request — validates browser session
cookies scoped to the parent domain. Admin API routes (id, hydra,
search, vault) keep Hydra /userinfo for Bearer token auth (CLI access).
2026-03-24 13:58:34 +00:00
0498d1c6b3 fix: gate systemmetrics/systemlogs/systemtracing behind OIDC auth
Prometheus, Loki, and Tempo external endpoints were publicly accessible
with no authentication. Add auth_request to all three routes using
Hydra's userinfo endpoint (same pattern as admin APIs).
2026-03-24 13:48:27 +00:00
1147b1a5aa fix: WOPI registration on restart + Collabora readiness probes
- Add readiness/liveness probes to Collabora (GET /hosting/discovery)
- Add init container to Drive backend that waits for Collabora and runs
  trigger_wopi_configuration on every pod start — fixes WOPI silently
  breaking after server restarts (chart Job only ran on sunbeam apply)
- Add OIDC_RESPONSE_MODE=query to Projects config
2026-03-24 12:22:10 +00:00
5e622ce316 feat: AlertManager Matrix integration with severity routing
Deploy matrix-alertmanager-receiver bridge (pending bot credentials in
OpenBao). Update AlertManager routing: critical → Matrix + email,
warning → Matrix only, Watchdog → null. Reduce repeat interval to 4h.
2026-03-24 12:21:29 +00:00
e8c64e6f18 feat: add ServiceMonitors and enable metrics scraping
- SeaweedFS: enable -metricsPort=9091 on master/volume/filer, add
  service labels, create ServiceMonitor
- Gitea: enable metrics in config, create ServiceMonitor
- Hydra/Kratos: standalone ServiceMonitors (chart templates require
  .Capabilities.APIVersions unavailable in kustomize helm template)
- LiveKit: add prometheus_port=6789, standalone ServiceMonitor
  (disabled in kustomization — host firewall blocks port 6789)
- OpenSearch: revert prometheus-exporter attempt (no plugin for v3.x),
  add service label for future exporter sidecar
2026-03-24 12:21:18 +00:00
3fc54c8851 feat: add PrometheusRule alerts for all services
28 alert rules across 9 PrometheusRule files covering infrastructure
(Longhorn, cert-manager), data (PostgreSQL, OpenBao, OpenSearch),
storage (SeaweedFS), devtools (Gitea), identity (Hydra, Kratos),
media (LiveKit), and mesh (Linkerd golden signals for all services).

Severity routing: critical alerts fire to Matrix + email, warnings
to Matrix only (AlertManager config updated in separate commit).
2026-03-24 12:20:55 +00:00
74bb59cfdc feat: split Grafana dashboards into per-folder ConfigMaps
Replace monolithic dashboards-configmap.yaml with 10 dedicated files,
one per Grafana folder: Ingress, Observability, Infrastructure, Storage,
Identity, DevTools, Search, Media, La Suite, Communications.

New dashboards for Longhorn, PostgreSQL/CNPG, Cert-Manager, SeaweedFS,
Hydra, Kratos, Gitea, OpenSearch, LiveKit, La Suite golden signals
(Linkerd metrics), Matrix, and Email Pipeline.
2026-03-24 12:20:42 +00:00
234fe72707 chore: updated readme
Signed-off-by: Sienna Meridian Satterwhite <sienna@sunbeam.pt>
2026-03-24 12:04:35 +00:00
8037184a9e chore: updated README
Signed-off-by: Sienna Meridian Satterwhite <sienna@sunbeam.pt>
2026-03-24 11:51:34 +00:00
330d0758ff docs: archive system-design.md — replaced by new documentation suite
Moved to docs/archive/ as historical reference. All content has been
merged into the new boujee documentation.
2026-03-24 11:48:28 +00:00
6ad3fdeac9 docs: restyle COE-2026-001 with boujee tone
Same technical rigor, more personality. Timeline reads like a story,
5 Whys have flair, closes with wine. 🍷
2026-03-24 11:48:19 +00:00
ceb038382f docs: add infrastructure conventions — House Rules, Darling 🏠
Do's and don'ts, kustomize patterns, secret management, deployment
conventions, naming conventions, AI config, domain patterns.
2026-03-24 11:47:40 +00:00
e0afd0a4d7 docs: add ops runbook — When Things Go Sideways, Gorgeous 🚨
Diagnostic ladder, COE format, runbooks for backup/restore, secret
rotation, cert renewal, database recovery, Sol☀️ restart, alerts.
2026-03-24 11:47:08 +00:00
2f7785774b docs: add production deployment guide — Serving Looks in Production 👠
Scaleway setup, k3s, kustomize structure, deployment phases, DNS,
cert-manager, backup strategy, image registry.
2026-03-24 11:46:47 +00:00
265a68d85f docs: add local dev setup guide — Setting Up Your Vanity 💄
Lima VM, k3s, mkcert, sslip.io, sunbeam CLI setup, resource budget,
differences from production, common commands, troubleshooting.
2026-03-24 11:46:40 +00:00
977972d9f3 docs: add observability documentation — Keeping an Eye on the Girlies 👁️
Prometheus, Grafana (10 dashboards), Loki, Tempo, AlertManager → Matrix,
ServiceMonitors, PrometheusRules per component.
2026-03-24 11:46:33 +00:00
97e87c6dda docs: add identity & auth documentation — The Guest List 💋
OIDC auth flow, Kratos + Hydra, client registry (12 apps), session
management, identity schemas, self-service flows, Vault integration.
2026-03-24 11:46:28 +00:00
2b05cfd383 docs: add Sunbeam CLI documentation — The Remote Control 💅
Infrastructure lifecycle, service management, build/deploy, all
service-specific subcommands, OAuth2 PKCE login, self-update.
2026-03-24 11:46:19 +00:00
66e3692c8b docs: add Pingora proxy documentation — The Bouncer 💎
Security pipeline (DDoS, scanner, rate limiting), route table, ML
models, training pipeline, static serving, TLS, auth requests, metrics.
2026-03-24 11:46:11 +00:00
cb474ce0d4 docs: add Sol☀️ documentation — Meet Sol☀️
Covers capabilities (search, memory, code, research mode, compute,
web search, identity), engagement pipeline, multi-model orchestration,
integration depth, and deployment. They/them throughout.
2026-03-24 11:46:05 +00:00
041ef98b65 docs: add architectural overview — What's In The Box, Babe? 💅
Full tour of the SBBB stack: Pingora proxy, Ory identity, La Suite
apps, Linkerd mesh, OpenBao secrets, data layer, monitoring, Matrix,
Sol☀️, and the platform itself.
2026-03-24 11:45:56 +00:00
e1fbaa445d docs: rewrite README as the front door to The Super Boujee Business Box
Full rewrite with boujee tone — app inventory, architecture diagram,
custom components (Sol☀️, Pingora, Sunbeam CLI), team bios, and links
to the new documentation suite.
2026-03-24 11:45:39 +00:00
fe6634e9c2 docs: COE-2026-001 vault root token loss postmortem
Root token and unseal key were lost when a placeholder manifest
overwrote the openbao-keys Secret. Documents root cause, timeline,
5 whys, remediation actions, and monitoring requirements.
2026-03-23 13:43:51 +00:00
dc95e1d8ec sol v1.1.0: SearXNG web search, evaluator redesign, research agents
- SearXNG deployment in data namespace (free, no-tracking web search)
- sol-config: SearXNG URL, research config, identity agent, updated
  system prompt (DM search rules, research mode, silence, hard rules)
- sol-deployment: debug logging (RUST_LOG=sol=debug), full image path
- opensearch: tolerate missing prometheus-exporter plugin on OS 3
2026-03-23 09:54:56 +00:00
d7ff1da729 sol: identity agent, research mode, evaluator redesign, DM search
sol-config.yaml:
- added [services.kratos] with admin URL
- added research config (model, max_iterations, max_agents, max_depth)
- tool iterations bumped to 250
- updated system prompt: research mode guidance, DM search rules,
  run_script docs, room overlap explanation, silence mechanic
- time context uses {time_block} with midnight-based boundaries
- evaluator returns response_type (message/thread/react/ignore)
2026-03-23 08:47:40 +00:00
473e1ef3ab project rename
Signed-off-by: Sienna Meridian Satterwhite <sienna@sunbeam.pt>
2026-03-22 20:29:32 +00:00
a086049de6 fix: harden SeaweedFS storage and fix Drive presigned uploads
- SeaweedFS filer: Recreate strategy (prevents LevelDB lock contention),
  60s termination grace period, memory 256Mi→2Gi limit
- SeaweedFS volume: 60s termination grace period, memory 256Mi→1Gi limit
- Drive: add AWS_S3_DOMAIN_REPLACE so presigned upload URLs use
  s3.sunbeam.pt instead of internal cluster DNS
- Drive: relax liveness/readiness probes (failureThreshold 1→3,
  period 1s→10s, timeout 1s→5s) to prevent crash loops under load
2026-03-22 19:48:36 +00:00
9af3cd3c49 feat: expose admin APIs behind OIDC auth_request
Adds pingora routes for id, hydra, search, vault subdomains.
Each gated by auth_request to Hydra userinfo — only valid SSO
bearer tokens pass through. Adds new SANs to the TLS certificate.
2026-03-22 18:59:22 +00:00
fb91fcd284 sol: vault auth, gitea integration, search fixes
sol-config: added [vault] and [services.gitea] sections, fetch
allowlist (wttr.in, open-meteo, github), bumped context windows
to 200, updated system prompt with run_script docs and tool rules.

sol-deployment: added gitea admin credential env vars from
sol-secrets, automountServiceAccountToken for vault k8s auth.

vault-secrets: added gitea-admin-username and gitea-admin-password
templates to sol-secrets VSS.
2026-03-22 15:16:22 +00:00
e1e6a6bc31 update sol configmap: multi-agent architecture + conversations API
- Add db_path (/data/sol.db) for SQLite persistence
- Add memory_index, script_*, memory_extraction_enabled fields
- Add [agents] section: orchestrator model, compaction threshold, conversations API enabled
- Rewrite system prompt (687 → 150 lines): dense, few-shot, hard rules
- Add {room_context_rules} placeholder for group vs DM behavior
2026-03-21 22:25:54 +00:00
d3943c9a84 feat(monitoring): wire up full LGTM observability stack
- Prometheus: discover ServiceMonitors/PodMonitors in all namespaces,
  enable remote write receiver for Tempo metrics generator
- Tempo: enable metrics generator (service-graphs + span-metrics)
  with remote write to Prometheus
- Loki: add Grafana Alloy DaemonSet to ship container logs
- Grafana: enable dashboard sidecar, add Pingora/Loki/Tempo/OpenBao
  dashboards, add stable UIDs and cross-linking between datasources
  (Loki↔Tempo derived fields, traces→logs, traces→metrics, service map)
- Linkerd: enable proxy tracing to Alloy OTLP collector, point
  linkerd-viz at existing Prometheus instead of deploying its own
- Pingora: add OTLP rollout plan (endpoint commented out until proxy
  telemetry panic fix is deployed and Alloy is verified healthy)
2026-03-21 17:36:54 +00:00
5f923d14f9 feat(matrix): add Sol virtual librarian deployment manifests
Sol is a Matrix bot with E2EE that archives conversations to OpenSearch
and responds via Mistral AI function calling. Adds deployment, PVC,
ConfigMap (sol.toml + system prompt), VaultStaticSecret for credentials,
and production overlay image entry.
2026-03-20 21:38:48 +00:00
bfe0280732 feat(lasuite): add Projects (Planka Kanban) service
Deploy Planka-based project management at projects.DOMAIN_SUFFIX:
- ConfigMap with OIDC, S3, SMTP, La Gaufre widget config
- Deployment + Service (init container for DB migrations, Sails on 1337)
- OAuth2Client (client_secret_basic, redirect to /oidc-callback)
- VaultDynamicSecret for DATABASE_URL, VaultStaticSecret for SECRET_KEY
- Pingora route with websocket support (Socket.io)
- Image overrides in both local and production overlays
- TLS cert dnsNames updated for projects subdomain
- Integration service.json updated with Projects entry
- seaweedfs-s3-credentials rolloutRestartTargets includes projects
2026-03-20 13:41:54 +00:00
b9d9ad72fe fix(ory): enable MFA methods, fix font URL, clean up login-ui
Enable TOTP, WebAuthn, and lookup secret MFA methods in Kratos config.
Fix Monaspace Neon font CDN URL in Gitea theme ConfigMap. Remove
redundant Google Fonts preconnect from people-frontend nginx config.
Delete unused login-ui-deployment.yaml (login-ui is part of the Ory
Helm chart, not a standalone deployment).
2026-03-18 18:36:15 +00:00
3c7460f4a6 feat(lasuite): add calendars service deployment manifests
Add K8s manifests for calendars backend, frontend (Caddy), CalDAV
server, and Celery worker. Wire Pingora routing for cal.sunbeam.pt
with path-based backend/caldav/static splits. Add OAuth2Client for
OIDC, VaultDynamicSecret for DB credentials, VaultStaticSecret for
Django/CalDAV keys, and TLS cert coverage for the cal subdomain.
Register calendars in the integration service gaufre widget.
2026-03-18 18:36:05 +00:00
ccfe8b877a feat: La Suite email/messages, buildkitd, monitoring, vault and storage updates
- Add Messages (email) service: backend, frontend, MTA in/out, MPA, SOCKS
  proxy, worker, DKIM config, and theme customization
- Add Collabora deployment for document collaboration
- Add Drive frontend nginx config and values
- Add buildkitd namespace for in-cluster container builds
- Add SeaweedFS remote sync and additional S3 buckets
- Update vault secrets across namespaces (devtools, lasuite, media,
  monitoring, ory, storage) with expanded credential management
- Update monitoring: rename grafana→metrics OAuth2Client, add Prometheus
  remote write and additional scrape configs
- Update local/production overlays with resource patches
- Remove stale login-ui resource patch from production overlay
2026-03-10 19:00:57 +00:00
e5741c4df6 feat: integrate tuwunel with Ory SSO, rename chat to messages subdomain
- Add matrix to hydra-maester enabledNamespaces for OAuth2Client CRD
- Update allowed_return_urls and selfservice URLs: chat→messages
- Add Kratos verification flow, employee/external identity schemas
- Extend session lifespan to 30 days with persistent cookies
- Route messages.* to tuwunel via Pingora with WebSocket support
- Replace login-ui with kratos-admin-ui as unified auth frontend
- Update TLS certificate SANs: chat→messages, add monitoring subdomains
- Add tuwunel + La Suite images to production overlay
- Switch DDoS/scanner detection to compiled-in ensemble models (observe_only)
2026-03-10 18:52:47 +00:00
584e98316b feat(data): upgrade OpenSearch to v3 with ML Commons for neural search
- Upgrade from OpenSearch 2 to 3 (required for ML Commons pre-trained models)
- Rename PLUGINS_SECURITY_DISABLED → DISABLE_SECURITY_PLUGIN (OS3 change)
- Enable ML Commons plugin settings for on-data-node inference
- Increase memory limits (2Gi) and JVM heap for neural model inference
- Add fsGroup security context for volume permissions
2026-03-10 18:52:29 +00:00
d2148335de feat(matrix): add tuwunel Matrix homeserver deployment manifests
Kubernetes manifests for tuwunel — a Rust Matrix homeserver using RocksDB
for storage. Includes deployment, service, PVC, ConfigMap (tuwunel.toml),
Hydra OAuth2Client for SSO, and Vault secrets for credentials injection.

Key design decisions:
- enableServiceLinks: false to prevent K8s TUWUNEL_* env var conflicts
- strategy: Recreate for RocksDB exclusive lock (no rolling updates)
- Identity provider configured entirely via env vars (client_id/secret
  from hydra-maester Secret, not hardcoded)
- OpenSearch model_id injected via ConfigMap from CLI post-apply hook
- SSO-only auth (login_with_password=false, single_sso=true)
- OpenSearch hybrid neural+BM25 search (768-dim, all-mpnet-base-v2)
2026-03-10 18:52:21 +00:00
91983ddf29 feat(observability): enable OTLP tracing, fix Prometheus scraping, add proxy ServiceMonitor
- Set otlp_endpoint to Tempo HTTP receiver (port 4318) for request tracing
- Add hostNetwork to prometheusSpec so it can reach kubelet/node-exporter on node public IP
- Add ServiceMonitor for proxy metrics scrape on port 9090
- Add CORS origin and Grafana datasource config for monitoring stack
2026-03-09 08:20:42 +00:00
caefb071a8 fix(ingress): use 10.0.0.0/8 bypass for all cluster-internal traffic
Pod IPs are in 10.0.0.0/24, not 10.42.0.0/16 as assumed. Broadening
to 10.0.0.0/8 covers pods, services, and CNI overlays.
2026-03-09 08:00:46 +00:00
a101ea4b06 fix(ingress): add localhost to rate-limit bypass CIDRs
Adds 127.0.0.0/8 and ::1/128 so host-networked pods (buildkitd) are
not blocked by the detection pipeline.
2026-03-09 01:40:25 +00:00
27d3e3248c chore: added license
Signed-off-by: Sienna Meridian Satterwhite <sienna@r3t.io>
2026-03-08 21:02:21 +00:00
7c1676d2b9 feat(ingress): add detection pipeline config and metrics port
- Add DDoS, scanner, and rate limiter configuration to pingora-config
- Add kubernetes config section with configurable namespace/resource names
- Expose metrics port 9090 on deployment and service
2026-03-08 20:37:49 +00:00