Commit Graph

106 Commits

Author SHA1 Message Date
dae6ac39d8 fix(monitoring): fix inhibit_rules snake_case, suppress InfoInhibitor spam
The Prometheus operator uses snake_case (inhibit_rules) not camelCase
(inhibitRules), causing alertmanager reconciliation to fail. Also route
InfoInhibitor alerts to null to stop flooding the Matrix alerts room.
2026-04-06 17:40:56 +01:00
e4987b4c58 feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.

Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting

New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
2026-04-06 15:52:06 +01:00
f07b3353aa fix(longhorn): upgrade to v1.11.1, fix 38GB instance-manager memory leak
v1.11.0 had a critical proxy connection leak in the instance-manager
(longhorn/longhorn#12575) that consumed 38.8GB on apollo, pushing the
server to 92% memory with swap exhausted.

v1.11.1 fixes the leak. Also adds a 2Gi per-container LimitRange in
longhorn-system as a safety net against future regressions.
2026-04-06 13:33:10 +01:00
efe574f48e feat(storage): sccache S3 build cache with scoped SeaweedFS identity
Add sunbeam-sccache bucket and a dedicated sccache S3 identity scoped
to Read/Write/List/Tagging on that bucket only. Bump volume server
max from 50 to 100 (was full, blocking all new writes).
2026-04-05 21:50:46 +01:00
1206cd0fe4 fix(ingress): add missing backend to cal route, fix Pingora crash
The cal route had redirect without backend, which the config parser
requires. Pingora was in CrashLoopBackOff, taking down all services.
2026-04-04 19:10:24 +01:00
048319f70b fix(devtools): stabilize Penpot MCP, fix S3 creds, OIDC registration
MCP server:
- Replace vite build --watch + livePreview with static vite preview
  (watch mode was reloading the plugin iframe, killing WebSocket)
- Bake WS_URI at Docker build time for production WebSocket URL
- Add server-side application-level keepalive messages every 25s
- Add client-side auto-reconnect with exponential backoff
- Set Pingora route timeout to 86400s for WebSocket idle tolerance

Penpot:
- Add AWS_ACCESS_KEY_ID/SECRET env vars for S3 SDK compatibility
- Set S3 region to satisfy AWS SDK credential chain
- Enable OIDC registration (disable-registration blocks OIDC signup)
- Fix frontend port (8080 not 80)
- Add penpot bucket to seaweedfs-buckets init job
2026-04-04 15:37:45 +01:00
fcb80f1f37 feat(devtools): deploy Penpot + MCP server, wildcard TLS via DNS-01
Penpot (designer.sunbeam.pt):
- Frontend/backend/exporter deployments with OIDC-only auth via Hydra
- VSO-managed DB, S3, and app secrets from OpenBao
- PostgreSQL user/db in CNPG postInitSQL
- Hydra Maester enabledNamespaces extended to devtools

Penpot MCP server (mcp-designer.sunbeam.pt):
- Pre-built Node.js image pushed to Gitea registry
- Auth-gated via Pingora auth_request → Hydra /userinfo
- WebSocket path for browser plugin connection

Wildcard TLS:
- Switched cert-manager from HTTP-01 (per-SAN) to DNS-01 via Scaleway webhook
- Certificate collapsed to *.sunbeam.pt + sunbeam.pt
- Added scaleway-certmanager-webhook Helm chart
- VSO secret for Scaleway DNS API credentials in cert-manager namespace
- Added cert-manager to OpenBao VSO auth role
2026-04-04 12:53:27 +01:00
97628b0f6f feat(ory): OIDC group-to-team mapping, social login, Gitea OIDC-only mode
Identity permissions flow from Kratos metadata_admin.groups through
Hydra ID token claims to Gitea's OIDC group-to-team mapping:
- super-admin → site admin + Owners + Employees teams
- employee → Owners + Employees teams
- community → Contributors team (social sign-up users)

Kratos: Discord + GitHub social login providers, community identity
schema, OIDC method enabled with env-var credential injection via VSO.

Gitea: OIDC-only login (no local registration, no password form),
APP_NAME, favicon, auto-registration with account linking.

Also: messages-mta-in recreate strategy + liveness probe for milter.
2026-03-27 17:46:11 +00:00
33f0e44545 feat(build): mTLS for buildkitd + public exposure via TLS passthrough
cert-manager self-signed CA issues server and client certs for BuildKit
mTLS. Buildkitd serves TLS on its ClusterIP (hostNetwork removed) and
is publicly reachable at build.DOMAIN_SUFFIX:443 through Pingora's new
SNI-based TLS passthrough router. Clients authenticate with the client
certificate from the buildkitd-client-tls secret.
2026-03-26 14:23:56 +00:00
632099893a feat(media): deploy Element Call fork + optimize LiveKit quality
- Deploy self-hosted Element Call at call.sunbeam.pt with SSO login
- LiveKit: VP9 > AV1 > H.264 codec preferences, Opus stereo
- LiveKit: congestion_control.allow_pause=false, larger NACK buffers
- LiveKit: resources bumped to 2Gi/4CPU for VP9 SVC
- Proxy: add call.* route, TLS cert SAN for call.sunbeam.pt
2026-03-26 09:38:53 +00:00
a912331f97 feat: CNPG PodMonitor, OpenBao ServiceMonitor, CLI OIDC client CRD
- CNPG PodMonitor for PostgreSQL cluster metrics
- OpenBao ServiceMonitor for vault metrics scraping
- Sunbeam CLI OAuth2Client CRD (moved from seed to declarative)
2026-03-25 18:01:52 +00:00
9f15f5099e fix: meet external-api route, drive media proxy, alertbot, misc tweaks
- Meet: add external-api backend path, CSRF trusted origins
- Drive: fix media proxy regex for preview URLs and S3 key signing
- OpenBao: enable Prometheus telemetry
- Postgres alerts: fix metric name (cnpg_backends_total)
- Gitea: bump memory limits for mirror workloads
- Alertbot: expanded deployment config
- Kratos: add find/cal/projects to allowed return URLs, settings path
- Pingora: meet external-api route fix
- Sol: config update
2026-03-25 18:01:15 +00:00
eab91eb85d feat(monitoring): expanded dashboards for all services
Enriched dashboards for DevTools (Gitea), Identity (Hydra/Kratos),
Infrastructure (Longhorn, PostgreSQL, cert-manager, OpenBao),
Ingress (Pingora), and Storage (SeaweedFS).
2026-03-25 17:58:51 +00:00
9ee40aaa69 feat(monitoring): comprehensive OpenSearch dashboard
6 collapsible rows covering all exporter metrics:
Overview, Search & Queries, Indexing, Storage & Indices,
Circuit Breakers & Thread Pools, OS & Process.
2026-03-25 17:54:39 +00:00
7cb6bb1bd2 feat(data): OpenSearch prometheus-exporter sidecar
elasticsearch-exporter v1.7.0 runs as a sidecar, scrapes localhost:9200,
exposes elasticsearch_* metrics on :9114. ServiceMonitor re-enabled.
Alert rules updated to use elasticsearch_* metric names.
Flags: --es.all --es.indices --es.shards --collector.clustersettings
2026-03-25 17:53:59 +00:00
0a322c8a7c remove: Docs (impress) and People (desk) from La Suite
Collabora stays (Drive needs it for WOPI document editing).
Removed: Helm charts, values, nginx configs, patches, OIDC clients,
Vault secrets, S3 buckets, Pingora routes, Kratos return URLs,
overlay image overrides and resource patches, local-up.sh restarts.
2026-03-25 17:53:43 +00:00
b13555607a docs(ops): COE-2026-002 Matrix RTC / Element Call + IPv6
Documents the missing lk-jwt-service, well-known URL fix, bare domain
routing, DNS apex records, and IPv6 cert-manager self-check failure.
Includes dual-stack K3s migration plan.
2026-03-25 13:25:05 +00:00
f8e9d32a8b feat(ingress): add bare domain sunbeam.pt to TLS cert SANs
Element X resolves .well-known from the server_name (sunbeam.pt),
not the homeserver URL. The bare domain needs a valid cert.
2026-03-25 13:24:38 +00:00
fdcc15080f fix(matrix): use https:// for livekit_url in well-known
Element Call expects livekit_service_url to be an HTTPS endpoint
(lk-jwt-service), not a WebSocket URL. The client connects to LiveKit
via WSS separately after getting a JWT.
2026-03-25 13:24:12 +00:00
84c5548f2e feat(ingress): route lk-jwt-service paths + bare domain well-known
Split livekit.* requests: /sfu/get, /healthz, /get_token → lk-jwt-service,
everything else → livekit-server (WebSocket). Add sunbeam.pt bare domain
route so Element X can discover RTC foci from the server_name.
2026-03-25 13:23:59 +00:00
4837983380 feat(media): deploy lk-jwt-service for Matrix Element Call
Bridges Element Call to LiveKit by exchanging Matrix OpenID tokens for
LiveKit JWTs. Shares API credentials with livekit-server via the
existing VSO secret (removed excludeRaw so raw fields are available).
2026-03-25 13:23:48 +00:00
50a4abf94f fix(ory): harden Kratos and Hydra production security configuration
Kratos: xchacha20-poly1305 cipher for at-rest encryption, 12-char min
password with HaveIBeenPwned + similarity check, recovery/verification
switched to code (not link), anti-enumeration on unknown recipients,
15m privileged session, 24h session extend throttle, JSON structured
logging, WebAuthn passwordless enabled, additionalProperties: false on
all identity schemas, memory limits bumped to 256Mi.

Hydra: expose_internal_errors disabled, PKCE enforced for public
clients, janitor CronJob every 6h, cookie domain set explicitly,
SSRF prevention via disallow_private_ip_ranges, JSON structured
logging, Maester enabledNamespaces includes monitoring.

Also: fixed selfservice URL patch divergence (settings path, missing
allowed_return_urls), removed invalid responseTypes on Hive client.
2026-03-24 19:40:58 +00:00
4c02fe18ed fix: use Kratos session auth for observability endpoints
Observability routes (systemmetrics, systemlogs, systemtracing) use
Kratos /sessions/whoami for auth_request — validates browser session
cookies scoped to the parent domain. Admin API routes (id, hydra,
search, vault) keep Hydra /userinfo for Bearer token auth (CLI access).
2026-03-24 13:58:34 +00:00
0498d1c6b3 fix: gate systemmetrics/systemlogs/systemtracing behind OIDC auth
Prometheus, Loki, and Tempo external endpoints were publicly accessible
with no authentication. Add auth_request to all three routes using
Hydra's userinfo endpoint (same pattern as admin APIs).
2026-03-24 13:48:27 +00:00
1147b1a5aa fix: WOPI registration on restart + Collabora readiness probes
- Add readiness/liveness probes to Collabora (GET /hosting/discovery)
- Add init container to Drive backend that waits for Collabora and runs
  trigger_wopi_configuration on every pod start — fixes WOPI silently
  breaking after server restarts (chart Job only ran on sunbeam apply)
- Add OIDC_RESPONSE_MODE=query to Projects config
2026-03-24 12:22:10 +00:00
5e622ce316 feat: AlertManager Matrix integration with severity routing
Deploy matrix-alertmanager-receiver bridge (pending bot credentials in
OpenBao). Update AlertManager routing: critical → Matrix + email,
warning → Matrix only, Watchdog → null. Reduce repeat interval to 4h.
2026-03-24 12:21:29 +00:00
e8c64e6f18 feat: add ServiceMonitors and enable metrics scraping
- SeaweedFS: enable -metricsPort=9091 on master/volume/filer, add
  service labels, create ServiceMonitor
- Gitea: enable metrics in config, create ServiceMonitor
- Hydra/Kratos: standalone ServiceMonitors (chart templates require
  .Capabilities.APIVersions unavailable in kustomize helm template)
- LiveKit: add prometheus_port=6789, standalone ServiceMonitor
  (disabled in kustomization — host firewall blocks port 6789)
- OpenSearch: revert prometheus-exporter attempt (no plugin for v3.x),
  add service label for future exporter sidecar
2026-03-24 12:21:18 +00:00
3fc54c8851 feat: add PrometheusRule alerts for all services
28 alert rules across 9 PrometheusRule files covering infrastructure
(Longhorn, cert-manager), data (PostgreSQL, OpenBao, OpenSearch),
storage (SeaweedFS), devtools (Gitea), identity (Hydra, Kratos),
media (LiveKit), and mesh (Linkerd golden signals for all services).

Severity routing: critical alerts fire to Matrix + email, warnings
to Matrix only (AlertManager config updated in separate commit).
2026-03-24 12:20:55 +00:00
74bb59cfdc feat: split Grafana dashboards into per-folder ConfigMaps
Replace monolithic dashboards-configmap.yaml with 10 dedicated files,
one per Grafana folder: Ingress, Observability, Infrastructure, Storage,
Identity, DevTools, Search, Media, La Suite, Communications.

New dashboards for Longhorn, PostgreSQL/CNPG, Cert-Manager, SeaweedFS,
Hydra, Kratos, Gitea, OpenSearch, LiveKit, La Suite golden signals
(Linkerd metrics), Matrix, and Email Pipeline.
2026-03-24 12:20:42 +00:00
234fe72707 chore: updated readme
Signed-off-by: Sienna Meridian Satterwhite <sienna@sunbeam.pt>
2026-03-24 12:04:35 +00:00
8037184a9e chore: updated README
Signed-off-by: Sienna Meridian Satterwhite <sienna@sunbeam.pt>
2026-03-24 11:51:34 +00:00
330d0758ff docs: archive system-design.md — replaced by new documentation suite
Moved to docs/archive/ as historical reference. All content has been
merged into the new boujee documentation.
2026-03-24 11:48:28 +00:00
6ad3fdeac9 docs: restyle COE-2026-001 with boujee tone
Same technical rigor, more personality. Timeline reads like a story,
5 Whys have flair, closes with wine. 🍷
2026-03-24 11:48:19 +00:00
ceb038382f docs: add infrastructure conventions — House Rules, Darling 🏠
Do's and don'ts, kustomize patterns, secret management, deployment
conventions, naming conventions, AI config, domain patterns.
2026-03-24 11:47:40 +00:00
e0afd0a4d7 docs: add ops runbook — When Things Go Sideways, Gorgeous 🚨
Diagnostic ladder, COE format, runbooks for backup/restore, secret
rotation, cert renewal, database recovery, Sol☀️ restart, alerts.
2026-03-24 11:47:08 +00:00
2f7785774b docs: add production deployment guide — Serving Looks in Production 👠
Scaleway setup, k3s, kustomize structure, deployment phases, DNS,
cert-manager, backup strategy, image registry.
2026-03-24 11:46:47 +00:00
265a68d85f docs: add local dev setup guide — Setting Up Your Vanity 💄
Lima VM, k3s, mkcert, sslip.io, sunbeam CLI setup, resource budget,
differences from production, common commands, troubleshooting.
2026-03-24 11:46:40 +00:00
977972d9f3 docs: add observability documentation — Keeping an Eye on the Girlies 👁️
Prometheus, Grafana (10 dashboards), Loki, Tempo, AlertManager → Matrix,
ServiceMonitors, PrometheusRules per component.
2026-03-24 11:46:33 +00:00
97e87c6dda docs: add identity & auth documentation — The Guest List 💋
OIDC auth flow, Kratos + Hydra, client registry (12 apps), session
management, identity schemas, self-service flows, Vault integration.
2026-03-24 11:46:28 +00:00
2b05cfd383 docs: add Sunbeam CLI documentation — The Remote Control 💅
Infrastructure lifecycle, service management, build/deploy, all
service-specific subcommands, OAuth2 PKCE login, self-update.
2026-03-24 11:46:19 +00:00
66e3692c8b docs: add Pingora proxy documentation — The Bouncer 💎
Security pipeline (DDoS, scanner, rate limiting), route table, ML
models, training pipeline, static serving, TLS, auth requests, metrics.
2026-03-24 11:46:11 +00:00
cb474ce0d4 docs: add Sol☀️ documentation — Meet Sol☀️
Covers capabilities (search, memory, code, research mode, compute,
web search, identity), engagement pipeline, multi-model orchestration,
integration depth, and deployment. They/them throughout.
2026-03-24 11:46:05 +00:00
041ef98b65 docs: add architectural overview — What's In The Box, Babe? 💅
Full tour of the SBBB stack: Pingora proxy, Ory identity, La Suite
apps, Linkerd mesh, OpenBao secrets, data layer, monitoring, Matrix,
Sol☀️, and the platform itself.
2026-03-24 11:45:56 +00:00
e1fbaa445d docs: rewrite README as the front door to The Super Boujee Business Box
Full rewrite with boujee tone — app inventory, architecture diagram,
custom components (Sol☀️, Pingora, Sunbeam CLI), team bios, and links
to the new documentation suite.
2026-03-24 11:45:39 +00:00
fe6634e9c2 docs: COE-2026-001 vault root token loss postmortem
Root token and unseal key were lost when a placeholder manifest
overwrote the openbao-keys Secret. Documents root cause, timeline,
5 whys, remediation actions, and monitoring requirements.
2026-03-23 13:43:51 +00:00
dc95e1d8ec sol v1.1.0: SearXNG web search, evaluator redesign, research agents
- SearXNG deployment in data namespace (free, no-tracking web search)
- sol-config: SearXNG URL, research config, identity agent, updated
  system prompt (DM search rules, research mode, silence, hard rules)
- sol-deployment: debug logging (RUST_LOG=sol=debug), full image path
- opensearch: tolerate missing prometheus-exporter plugin on OS 3
2026-03-23 09:54:56 +00:00
d7ff1da729 sol: identity agent, research mode, evaluator redesign, DM search
sol-config.yaml:
- added [services.kratos] with admin URL
- added research config (model, max_iterations, max_agents, max_depth)
- tool iterations bumped to 250
- updated system prompt: research mode guidance, DM search rules,
  run_script docs, room overlap explanation, silence mechanic
- time context uses {time_block} with midnight-based boundaries
- evaluator returns response_type (message/thread/react/ignore)
2026-03-23 08:47:40 +00:00
473e1ef3ab project rename
Signed-off-by: Sienna Meridian Satterwhite <sienna@sunbeam.pt>
2026-03-22 20:29:32 +00:00
a086049de6 fix: harden SeaweedFS storage and fix Drive presigned uploads
- SeaweedFS filer: Recreate strategy (prevents LevelDB lock contention),
  60s termination grace period, memory 256Mi→2Gi limit
- SeaweedFS volume: 60s termination grace period, memory 256Mi→1Gi limit
- Drive: add AWS_S3_DOMAIN_REPLACE so presigned upload URLs use
  s3.sunbeam.pt instead of internal cluster DNS
- Drive: relax liveness/readiness probes (failureThreshold 1→3,
  period 1s→10s, timeout 1s→5s) to prevent crash loops under load
2026-03-22 19:48:36 +00:00
9af3cd3c49 feat: expose admin APIs behind OIDC auth_request
Adds pingora routes for id, hydra, search, vault subdomains.
Each gated by auth_request to Hydra userinfo — only valid SSO
bearer tokens pass through. Adds new SANs to the TLS certificate.
2026-03-22 18:59:22 +00:00