Files
sbbb/docs/archive/system-design.md

978 lines
42 KiB
Markdown
Raw Permalink Normal View History

# Sunbeam Studio — Infrastructure Design Document
**Version:** 0.1.0-draft
**Date:** 2026-02-28
**Author:** Sienna Satterthwaite, Chief Engineer
**Status:** Planning
---
## 1. Overview
Sunbeam is a three-person game studio founded by Sienna, Lonni, and Amber. This document describes the self-hosted collaboration and development infrastructure that supports studio operations — document editing, video calls, email, version control, AI tooling, and game asset management.
**Guiding principles:**
- **One box, one bill.** Single Scaleway Elastic Metal server in Paris. No multi-vendor sprawl.
- **European data sovereignty.** All data resides in France, GDPR-compliant by default.
- **Self-hosted, open source.** No per-seat SaaS fees. MIT-licensed where possible.
- **Consistent experience.** Unified authentication, shared design language, single login across all tools.
- **Operationally honest.** The stack is architecturally rich but the operational surface is small: three users, one node, one cluster.
---
## 2. Platform
### 2.1 Compute
| Property | Value |
|---|---|
| Provider | Scaleway Elastic Metal |
| Region | Paris (PAR1/PAR2) |
| RAM | 64 GB minimum |
| Storage | Local NVMe (k3s + OS + SeaweedFS volumes) |
| Network | Public IPv4, configurable reverse DNS |
### 2.2 Orchestration
k3s — single-node Kubernetes. Traefik disabled at install (replaced by custom Pingora proxy):
```bash
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--disable=traefik" sh -
```
### 2.3 External Scaleway Services
| Service | Purpose | Estimated Cost |
|---|---|---|
| Object Storage | PostgreSQL backups (barman), cold asset overflow | ~€510/mo |
| Transactional Email (TEM) | Outbound SMTP relay for notifications | ~€1/mo |
| Generative APIs | AI inference for all La Suite components | ~€15/mo |
---
## 3. Namespace Layout
```
k3s cluster
├── ory/ Identity & auth (Kratos, Hydra, Login UI)
├── lasuite/ Docs, Meet, Drive, Messages, Conversations, People, Hive
├── media/ LiveKit server + TURN
├── storage/ SeaweedFS (master, volume, filer)
├── data/ CloudNativePG, Redis, OpenSearch
├── devtools/ Gitea
├── mesh/ Linkerd control plane
└── ingress/ Pingora edge proxy
```
---
## 4. Core Infrastructure
### 4.1 Authentication — Ory Kratos + Hydra
Replaces the Keycloak default from La Suite's French government deployments. No JVM, no XML — lightweight Go binaries that fit k3s cleanly.
| Component | Role |
|---|---|
| **Kratos** | Identity management (registration, login, profile, recovery) |
| **Hydra** | OAuth2 / OpenID Connect provider |
| **Login UI** | Sunbeam-branded login and consent pages |
Every La Suite app authenticates via `mozilla-django-oidc`. Each app registers as an OIDC client in Hydra with a client ID, secret, and redirect URI. Swapping Keycloak for Hydra is transparent at the app level.
**Auth flow:**
```
User → any *.sunbeam.pt app
→ 302 to auth.sunbeam.pt
→ Hydra → Kratos login UI
→ authenticate
→ Hydra issues OIDC token
→ 302 back to app
→ app validates via mozilla-django-oidc
→ session established
```
### 4.2 Database — CloudNativePG
Single PostgreSQL cluster via CloudNativePG operator. One cluster, multiple logical databases:
```
PostgreSQL (CloudNativePG)
├── kratos_db
├── hydra_db
├── docs_db
├── meet_db
├── drive_db
├── messages_db
├── conversations_db
├── people_db
├── gitea_db
└── hive_db
```
### 4.3 Object Storage — SeaweedFS
S3-compatible distributed storage. Apache 2.0 licensed (chosen over MinIO post-AGPL relicensing).
**Components:** master (metadata/topology), volume servers (data on local NVMe), filer (S3 API gateway).
**S3 endpoint:** `http://seaweedfs-filer.storage.svc:8333` (cluster-internal). For local dev access outside the cluster, expose via ingress at `s3.sunbeam.pt` or `kubectl port-forward`.
**Buckets:**
| Bucket | Consumer | Contents |
|---|---|---|
| `sunbeam-docs` | Docs | Document content, images, exports |
| `sunbeam-meet` | Meet | Recordings (if enabled) |
| `sunbeam-drive` | Drive | Uploaded/shared files |
| `sunbeam-messages` | Messages | Email attachments |
| `sunbeam-conversations` | Conversations | Chat attachments |
| `sunbeam-git-lfs` | Gitea | Git LFS objects (game assets) |
| `sunbeam-game-assets` | Hive | Game assets synced between Drive and S3 |
### 4.4 Cache — Redis
Single Redis instance in `data` namespace. Shared by Messages (Celery broker), Conversations (session/cache), Meet (LiveKit ephemeral state).
### 4.5 Search — OpenSearch
Required by Messages for full-text email search. Single-node deployment in `data` namespace.
### 4.6 Edge Proxy — Pingora (Custom Rust Binary)
Custom proxy built on Cloudflare's Pingora framework. A few hundred lines of Rust handling:
- **HTTPS termination** — Let's Encrypt certs via `rustls-acme` compiled into the proxy binary
- **Hostname routing** — static mapping of `*.sunbeam.pt` hostnames to backend ClusterIP:port
- **WebSocket passthrough** — LiveKit signaling (Meet), Y.js CRDT sync (Docs)
- **Raw UDP forwarding** — TURN relay ports (3478 + 4915249252). Forwards bytes, not protocol. LiveKit handles TURN/STUN internally per RFC 5766. 100 relay ports is vastly more than three users need.
Seven hostnames, rarely changes. No dynamic service discovery required.
### 4.7 Service Mesh — Linkerd
mTLS between all pods with zero application changes. Sidecar injection provides:
- Mutual TLS on all internal east-west traffic
- Automatic certificate rotation
- Per-route observability (request rate, success rate, latency)
Rust-based data plane — lightweight on a single node.
---
## 5. La Suite Numérique Applications
All La Suite apps share a common pattern: Django backend, React frontend, PostgreSQL, S3 storage, OIDC auth. Independent services, not a monolith.
### 5.1 Docs — `docs.sunbeam.pt`
Collaborative document editing. GDD, lore bibles, specs, meeting notes.
| Property | Detail |
|---|---|
| Editor | BlockNote (Tiptap-based) |
| Realtime | Y.js CRDT over WebSocket |
| AI | BlockNote XL AI extension — rephrase, summarize, translate, fix typos, freeform prompts. Available via formatting toolbar and `/ai` slash command. |
| Export | .odt, .docx, .pdf |
BlockNote XL packages (AI, PDF export) are GPL-licensed. Fine for internal use — GPL triggers on distribution, not deployment.
### 5.2 Meet — `meet.sunbeam.pt`
Video conferencing. Standups, playtests, partner calls.
| Property | Detail |
|---|---|
| Backend | LiveKit (self-hosted, Apache 2.0) |
| Media | DTLS-SRTP encrypted WebRTC |
| TURN | LiveKit built-in, UDP ports exposed through Pingora |
### 5.3 Drive — `drive.sunbeam.pt`
File sharing and document management. Game assets, reference material, shared resources.
Granular access control, workspace organization, linked to Messages for email attachments and Docs for file references.
### 5.4 Messages — `mail.sunbeam.pt`
Full email platform with team and personal mailboxes.
**Architecture:**
```
Inbound: Internet → MX → Pingora → Postfix MTA-in → Rspamd → Django MDA → Postgres + OpenSearch
Outbound: User → Django → Postfix MTA-out (DKIM) → Scaleway TEM relay → recipient
```
**Mailboxes:**
- Personal: `sienna@`, `lonni@`, `amber@sunbeam.pt`
- Shared: `hello@sunbeam.pt` (all three see incoming business email)
**AI features:** Thread summaries, compose assistance, auto-labelling.
**Limitation:** No IMAP/POP3 — web UI only. Deliberate upstream design choice. Acceptable for a three-person studio living in the browser.
**DNS requirements:** MX, SPF, DKIM, DMARC, PTR (reverse DNS configurable in Scaleway console).
### 5.5 Conversations — `chat.sunbeam.pt`
AI chatbot / team assistant.
| Property | Detail |
|---|---|
| AI Framework | Pydantic AI (backend), Vercel AI SDK (frontend streaming) |
| Tools | Extensible agent tools — wire into Docs search, Drive queries, Messages summaries |
| Attachments | PDF and image upload for analysis |
| Helm | Official chart at `suitenumerique.github.io/conversations/` |
Primary force multiplier. Custom tools can search GDD content, query shared files, and summarize email threads.
### 5.6 People — `people.sunbeam.pt`
Centralized user and team management. Creates users/teams and propagates permissions across all La Suite apps. Interoperates with dimail (Messages email backend) for mailbox provisioning.
Admin-facing, not a daily-use interface.
### 5.7 La Suite Integration Layer
Apps share a unified experience through:
- **`@gouvfr-lasuite/integration`** — npm package providing the shared navigation bar, header, branding. Fork/configure for Sunbeam logo, colors, and nav links.
- **`lasuite-django`** — shared Python library for OIDC helpers and common Django patterns.
- Per-app env vars for branding: `DJANGO_EMAIL_BRAND_NAME=Sunbeam`, `DJANGO_EMAIL_LOGO_IMG`, etc.
---
## 6. Development Tools
### 6.1 Gitea — `src.sunbeam.pt`
Self-hosted Git with issue tracking, wiki, and CI.
| Property | Detail |
|---|---|
| Runtime | Single Go binary |
| Auth | OIDC via Hydra (same login as everything else) |
| LFS | Built-in Git LFS, S3 backend → SeaweedFS `sunbeam-git-lfs` bucket |
| CI | Gitea Actions (GitHub Actions compatible YAML). Lightweight jobs: compiles, tests, linting. Platform-specific builds offloaded to external providers. |
| Theming | `custom/` directory for Sunbeam logo, colors, CSS |
Replaces GitHub for private repos and eliminates GitHub LFS bandwidth costs. Game assets (textures, models, audio) flow through LFS into SeaweedFS.
### 6.2 Hive — Asset Sync Service (Custom Rust Binary)
Bidirectional sync between Drive and a dedicated S3 bucket (`sunbeam-game-assets`). Lonni and Amber manage game assets through Drive's UI; the build pipeline and Sienna's tooling address the same assets via S3. Hive keeps both views consistent.
**Architecture:**
```
Drive REST API SeaweedFS S3
(Game Assets workspace) (sunbeam-game-assets bucket)
│ │
└──────────► Hive ◄────────────────────┘
PostgreSQL
(hive_db)
```
**Reconciliation loop** (configurable, default 30s):
1. Poll Drive API — list files in watched workspace (IDs, paths, modified timestamps)
2. Poll S3 — `ListObjectsV2` on game assets bucket (keys, ETags, LastModified)
3. Diff both sides against Hive's state in `hive_db`
4. For each difference:
- New in Drive → download from Drive, upload to S3, record state
- New in S3 → download from S3, upload to Drive, record state
- Drive newer → overwrite S3, update state
- S3 newer → overwrite Drive, update state
- Deleted from Drive → delete from S3, remove state
- Deleted from S3 → delete from Drive, remove state
**Conflict resolution:** Last-write-wins by timestamp. For three users this is sufficient. Log a warning when both sides change the same file within the same poll interval.
**Path mapping:** Direct 1:1. Drive workspace folder structure maps to S3 key prefixes. `Game Assets/textures/hero_sprite.png` in Drive becomes `textures/hero_sprite.png` in S3 (workspace root stripped). Lonni creates a folder in Drive, it appears as an S3 prefix. Sienna runs `aws s3 cp` into a prefix, it appears in Drive's folder.
**State table (`hive_db`):**
| Column | Type | Purpose |
|---|---|---|
| `id` | UUID | Primary key |
| `drive_file_id` | TEXT | Drive's internal file ID |
| `drive_path` | TEXT | Human-readable path in Drive |
| `s3_key` | TEXT | S3 object key |
| `drive_modified_at` | TIMESTAMPTZ | Last modification on Drive side |
| `s3_etag` | TEXT | S3 object ETag |
| `s3_last_modified` | TIMESTAMPTZ | Last modification on S3 side |
| `last_synced_at` | TIMESTAMPTZ | When Hive last reconciled this file |
| `sync_source` | TEXT | Which side was source of truth (`drive` or `s3`) |
**Large file handling:** Files over 50 MB stream to a temp file before uploading to the other side. Multipart upload for S3 targets. No large files held in memory.
**Authentication:** OIDC client credentials via Hydra (same as every other service). Registered as client `hive` in the OIDC registry.
**Crate dependencies:**
| Crate | Purpose |
|---|---|
| `reqwest` | HTTP client for Drive REST API |
| `aws-sdk-s3` | S3 client for SeaweedFS |
| `sqlx` | Async PostgreSQL driver |
| `tokio` | Async runtime |
| `serde` / `serde_json` | Serialization |
| `tracing` | Structured logging |
**Configuration:**
```toml
[drive]
base_url = "https://drive.sunbeam.pt"
workspace = "Game Assets"
oidc_client_id = "hive"
oidc_client_secret_file = "/run/secrets/hive-oidc"
oidc_token_url = "https://auth.sunbeam.pt/oauth2/token"
[s3]
endpoint = "http://seaweedfs-filer.storage.svc:8333"
bucket = "sunbeam-game-assets"
region = "us-east-1"
access_key_file = "/run/secrets/seaweedfs-key"
secret_key_file = "/run/secrets/seaweedfs-secret"
[postgres]
url_file = "/run/secrets/hive-db-url"
[sync]
interval_seconds = 30
temp_dir = "/tmp/hive"
large_file_threshold_mb = 50
```
**Deployment:** Single pod in `lasuite` namespace. No PVC needed — state lives in PostgreSQL, temp files are ephemeral. OIDC credentials and S3 keys via Kubernetes secrets.
**Size estimate:** ~8001200 lines of Rust. Reconciliation logic is the bulk; Drive API and S3 clients are mostly configuration of existing crates.
---
## 7. AI Integration
All AI features across the stack share a single backend.
### 7.1 Backend
**Scaleway Generative APIs** — hosted in Paris, GDPR-compliant. Fully OpenAI-compatible endpoint. Prompts and outputs are not read, reused, or analyzed by Scaleway.
### 7.2 Model
**`mistral-small-3.2-24b-instruct-2506`**
| Property | Value |
|---|---|
| Input | €0.15 / M tokens |
| Output | €0.35 / M tokens |
| Capabilities | Chat + Vision |
| Strengths | Summarization, rephrasing, translation, instruction following |
Estimated 25M tokens/month for three users ≈ €12/month after the 1M free tier.
**Upgrade path:** If Conversations needs heavier reasoning, route it to `qwen3-235b-a22b-instruct` (€0.75/€2.25 per M tokens) while keeping Docs and Messages on Mistral Small.
### 7.3 Configuration
Three env vars, identical across all components:
```bash
AI_BASE_URL=https://api.scaleway.ai/v1/
AI_API_KEY=<SCW_SECRET_KEY>
AI_MODEL=mistral-small-3.2-24b-instruct-2506
```
### 7.4 Capabilities by Component
| Component | What AI Does |
|---|---|
| Docs | Rephrase, summarize, fix typos, translate, freeform prompts on selected text |
| Messages | Thread summaries, compose assistance, auto-labelling |
| Conversations | Full chat interface, extensible agent tools, attachment analysis |
---
## 8. DNS Map
All A records point to the Elastic Metal public IP. TLS terminated by Pingora.
| Hostname | Backend |
|---|---|
| `docs.sunbeam.pt` | Docs |
| `meet.sunbeam.pt` | Meet |
| `drive.sunbeam.pt` | Drive |
| `mail.sunbeam.pt` | Messages |
| `chat.sunbeam.pt` | Conversations |
| `people.sunbeam.pt` | People |
| `src.sunbeam.pt` | Gitea |
| `auth.sunbeam.pt` | Ory Hydra + Login UI |
| `s3.sunbeam.pt` | SeaweedFS S3 endpoint (dev access) |
**Email DNS (sunbeam.pt zone):**
| Record | Value |
|---|---|
| MX | → Elastic Metal IP |
| TXT (SPF) | `v=spf1 ip4:<EM_IP> include:tem.scaleway.com ~all` |
| TXT (DKIM) | Generated by Postfix/Messages |
| TXT (DMARC) | `v=DMARC1; p=quarantine; rua=mailto:dmarc@sunbeam.pt` |
| PTR | Configured in Scaleway console |
---
## 9. OIDC Client Registry
Each application registered in Ory Hydra:
| Client | Redirect URI | Scopes |
|---|---|---|
| Docs | `https://docs.sunbeam.pt/oidc/callback/` | `openid profile email` |
| Meet | `https://meet.sunbeam.pt/oidc/callback/` | `openid profile email` |
| Drive | `https://drive.sunbeam.pt/oidc/callback/` | `openid profile email` |
| Messages | `https://mail.sunbeam.pt/oidc/callback/` | `openid profile email` |
| Conversations | `https://chat.sunbeam.pt/oidc/callback/` | `openid profile email` |
| People | `https://people.sunbeam.pt/oidc/callback/` | `openid profile email` |
| Gitea | `https://src.sunbeam.pt/user/oauth2/sunbeam/callback` | `openid profile email` |
| Hive | Client credentials grant (no redirect URI) | `openid` |
---
## 10. Local Development Environment
### 10.1 Goal
The local dev stack is **structurally identical** to production. Same k3s orchestrator, same namespaces, same manifests, same service DNS, same Linkerd mesh, same Pingora edge proxy, same TLS termination, same OIDC flows. The only differences are resource limits, the TLS cert source (mkcert vs Let's Encrypt), and the domain suffix (sslip.io vs sunbeam.pt). Traffic flows through the same path locally as it does in production: browser → Pingora → Linkerd sidecar → app → Linkerd sidecar → data stores. Bugs caught locally are bugs that would have happened in production.
### 10.2 Platform
| Property | Value |
|---|---|
| Machine | MacBook Pro M1 Pro, 10-core, 32 GB RAM |
| VM | Lima (lightweight Linux VM, virtiofs, Apple Virtualization.framework) |
| Orchestration | k3s inside Lima VM (`--disable=traefik`, identical to production) |
| Architecture | arm64 native (no Rosetta overhead) |
```bash
# Install Lima + k3s
brew install lima mkcert
# Create Lima VM with sufficient resources for the full stack
limactl start --name=sunbeam template://k3s \
--memory=12 \
--cpus=6 \
--disk=60 \
--vm-type=vz \
--mount-type=virtiofs
# Confirm
limactl shell sunbeam kubectl get nodes
```
12 GB VM allocation covers the full stack (~6 GB pods + kubelet/OS overhead) and leaves 20 GB for macOS, IDE, browser, and builds.
### 10.3 What Stays the Same
Everything:
- **Namespace layout** — all namespaces identical: `ory/`, `lasuite/`, `media/`, `storage/`, `data/`, `devtools/`, `mesh/`, `ingress/`
- **Kubernetes manifests** — same Deployments, Services, ConfigMaps, Secrets. Applied with `kubectl apply` or Helm.
- **Service DNS** — `seaweedfs-filer.storage.svc`, `kratos.ory.svc`, `hydra.ory.svc`, etc. Apps resolve the same internal names.
- **Service mesh** — Linkerd injected into all application namespaces. mTLS between all pods. Same topology as production.
- **Edge proxy** — Pingora runs in `ingress/` namespace, routes by hostname, terminates TLS. Same binary, same routing config (different cert source).
- **Database structure** — same CloudNativePG operator, same logical databases, same schemas.
- **S3 bucket structure** — same SeaweedFS filer, same bucket names.
- **OIDC flow** — same Kratos + Hydra, same client registrations. Redirect URIs point at sslip.io hostnames instead of `sunbeam.pt`.
- **AI configuration** — same `AI_BASE_URL` / `AI_API_KEY` / `AI_MODEL` env vars, same Scaleway endpoint.
- **Hive sync** — same reconciliation loop against local Drive and SeaweedFS.
- **TURN/UDP** — Pingora forwards UDP to LiveKit on the same port range (4915249252).
### 10.4 Local DNS — sslip.io
[sslip.io](https://sslip.io) provides wildcard DNS that embeds the IP address in the hostname. The Lima VM gets a routable IP on the host (e.g., `192.168.5.2`), and all services resolve through it:
| Production | Local |
|---|---|
| `docs.sunbeam.pt` | `docs.192.168.5.2.sslip.io` |
| `meet.sunbeam.pt` | `meet.192.168.5.2.sslip.io` |
| `drive.sunbeam.pt` | `drive.192.168.5.2.sslip.io` |
| `mail.sunbeam.pt` | `mail.192.168.5.2.sslip.io` |
| `chat.sunbeam.pt` | `chat.192.168.5.2.sslip.io` |
| `people.sunbeam.pt` | `people.192.168.5.2.sslip.io` |
| `src.sunbeam.pt` | `src.192.168.5.2.sslip.io` |
| `auth.sunbeam.pt` | `auth.192.168.5.2.sslip.io` |
| `s3.sunbeam.pt` | `s3.192.168.5.2.sslip.io` |
Pingora hostname routing works identically — it just matches on `docs.*`, `meet.*`, etc. regardless of the domain suffix. The domain suffix is the only thing that changes between overlays.
```bash
# Get the Lima VM IP
LIMA_IP=$(limactl shell sunbeam hostname -I | awk '{print $1}')
echo "Local base domain: ${LIMA_IP}.sslip.io"
```
### 10.5 Local TLS — mkcert
Production uses `rustls-acme` with Let's Encrypt. Locally, Pingora loads a self-signed wildcard cert generated by [mkcert](https://github.com/FiloSottile/mkcert), which installs a local CA trusted by the system and browsers:
```bash
brew install mkcert
mkcert -install # Trust the local CA
LIMA_IP=$(limactl shell sunbeam hostname -I | awk '{print $1}')
mkcert "*.${LIMA_IP}.sslip.io"
# Creates: _wildcard.<IP>.sslip.io.pem + _wildcard.<IP>.sslip.io-key.pem
```
The certs are mounted into the Pingora pod via a Secret. The local Pingora config differs from production only in the cert source — file path to the mkcert cert instead of `rustls-acme` ACME negotiation. All other routing logic is identical.
### 10.6 What Changes (Local Overrides)
Managed via `values-local.yaml` overlays per component. The list is intentionally short:
| Concern | Production | Local |
|---|---|---|
| **Resource limits** | Sized for 64 GB server | Capped tight (see §10.7) |
| **TLS cert source** | `rustls-acme` + Let's Encrypt | mkcert wildcard cert mounted as Secret |
| **Domain suffix** | `sunbeam.pt` | `<LIMA_IP>.sslip.io` |
| **OIDC redirect URIs** | `https://*.sunbeam.pt/...` | `https://*.sslip.io/...` |
| **Pingora listen** | Bound to public IP, ports 80/443/4915249252 | hostPort on Lima VM |
| **Backups** | barman → Scaleway Object Storage | Disabled |
| **Email DNS** | MX, SPF, DKIM, DMARC, PTR | Not applicable (no inbound email) |
Everything else — mesh injection, mTLS, proxy routing, service discovery, OIDC flows, S3 paths, AI integration — is the same.
### 10.7 Resource Limits (Local)
Target: **~68 GB total** for the full stack including mesh and edge, leaving 24+ GB for IDE, browser, builds.
| Component | Memory Limit | Notes |
|---|---|---|
| **Mesh + Edge** | | |
| Linkerd control plane | 128 MB | destination, identity, proxy-injector combined |
| Linkerd proxies (sidecars) | ~15 MB each | ~20 injected pods ≈ 300 MB total |
| Pingora | 64 MB | Rust binary, lightweight |
| **Data** | | |
| PostgreSQL (CloudNativePG) | 512 MB | Handles all 10 databases fine at this scale |
| Redis | 64 MB | |
| OpenSearch | 512 MB | `ES_JAVA_OPTS=-Xms256m -Xmx512m` |
| **Storage** | | |
| SeaweedFS (master) | 64 MB | Metadata only |
| SeaweedFS (volume) | 256 MB | Actual data storage |
| SeaweedFS (filer) | 256 MB | S3 API gateway |
| **Auth** | | |
| Ory Kratos | 64 MB | Go binary, tiny footprint |
| Ory Hydra | 64 MB | Go binary, tiny footprint |
| Login UI | 64 MB | |
| **Apps** | | |
| Docs (Django) | 256 MB | |
| Docs (Next.js) | 256 MB | |
| Meet | 128 MB | |
| LiveKit | 128 MB | |
| Drive (Django) | 256 MB | |
| Drive (Next.js) | 256 MB | |
| Messages (Django + MDA) | 256 MB | |
| Messages (Next.js) | 256 MB | |
| Postfix MTA-in/out | 64 MB each | |
| Rspamd | 128 MB | |
| Conversations (Django) | 256 MB | |
| Conversations (Next.js) | 256 MB | |
| People (Django) | 128 MB | |
| **Dev Tools** | | |
| Gitea | 256 MB | Go binary |
| Hive | 64 MB | Rust binary, tiny |
| **Total** | **~5.5 GB** | Including mesh overhead. Well within budget. |
The Linkerd sidecar proxies add ~300 MB across all pods. Still leaves plenty of headroom on 32 GB. You don't need to run everything simultaneously — working on Hive? Skip Meet, Messages, Conversations. Testing the email flow? Skip Meet, Gitea, Hive. But you *can* run it all if you want to.
### 10.8 Access Pattern
Traffic flows through Pingora, exactly like production. Browser hits `https://docs.<LIMA_IP>.sslip.io` → Pingora terminates TLS → routes to Docs service → Linkerd sidecar handles mTLS to backend.
```bash
# After deploying the local stack:
LIMA_IP=$(limactl shell sunbeam hostname -I | awk '{print $1}')
echo "Docs: https://docs.${LIMA_IP}.sslip.io"
echo "Meet: https://meet.${LIMA_IP}.sslip.io"
echo "Drive: https://drive.${LIMA_IP}.sslip.io"
echo "Mail: https://mail.${LIMA_IP}.sslip.io"
echo "Chat: https://chat.${LIMA_IP}.sslip.io"
echo "People: https://people.${LIMA_IP}.sslip.io"
echo "Source: https://src.${LIMA_IP}.sslip.io"
echo "Auth: https://auth.${LIMA_IP}.sslip.io"
echo "S3: https://s3.${LIMA_IP}.sslip.io"
echo "Linkerd: kubectl port-forward -n mesh svc/linkerd-viz 8084:8084"
```
Direct `kubectl port-forward` is still available as a fallback for debugging individual services, but the normal workflow goes through the edge — same as production.
### 10.9 Manifest Organization
```
sunbeam-infra/ ← Gitea repo (and GitHub mirror)
├── base/ ← Shared manifests (both environments)
│ ├── mesh/
│ ├── ingress/
│ ├── ory/
│ ├── lasuite/
│ ├── media/
│ ├── storage/
│ ├── data/
│ └── devtools/
├── overlays/
│ ├── production/ ← Production-specific values
│ │ ├── values-ory.yaml (sunbeam.pt redirect URIs)
│ │ ├── values-pingora.yaml (rustls-acme, LE certs)
│ │ ├── values-docs.yaml
│ │ ├── values-linkerd.yaml
│ │ └── ...
│ └── local/ ← Local dev overrides
│ ├── values-domain.yaml (sslip.io suffix, mkcert cert path)
│ ├── values-ory.yaml (sslip.io redirect URIs)
│ ├── values-pingora.yaml (mkcert TLS, hostPort binding)
│ ├── values-resources.yaml (global memory caps)
│ └── ...
├── secrets/
│ ├── production/ ← Sealed Secrets or SOPS-encrypted
│ └── local/ ← Plaintext (gitignored), includes mkcert certs
└── scripts/
├── local-up.sh ← Start Lima VM, deploy full stack
├── local-down.sh ← Tear down
├── local-certs.sh ← Generate mkcert wildcard for current Lima IP
└── local-urls.sh ← Print all https://*.sslip.io URLs
```
Deploy to either environment:
```bash
# Local
kubectl apply -k overlays/local/
# Production
kubectl apply -k overlays/production/
```
Same base manifests. Same mesh. Same edge. Different certs and domain suffix. One repo.
---
## 11. Deployment Sequence (Production)
### Phase 0: Local Validation (MacBook k3s)
Every phase below is first deployed and tested on the local Lima + k3s stack before touching production. The workflow:
1. Apply manifests to local k3s using `kubectl apply -k overlays/local/`
2. Verify the component starts, passes health checks, and integrates with dependencies
3. Run the phase's integration test through the full edge path (`https://*.sslip.io` — same Pingora routing, same Linkerd mesh, same OIDC flows)
4. Commit manifests to `sunbeam-infra` repo
5. Apply to production using `kubectl apply -k overlays/production/`
6. Verify on production
This catches misconfigurations, missing env vars, broken OIDC flows, and service connectivity issues before they hit production. The local stack is structurally identical — same namespaces, same service DNS, same manifests — so a successful local deploy is a high-confidence signal for production.
### Phase 1: Foundation
1. Provision Elastic Metal, install k3s (`--disable=traefik`)
2. Deploy Linkerd service mesh
3. Deploy CloudNativePG operator + PostgreSQL cluster
4. Deploy Redis
5. Deploy OpenSearch
6. Deploy SeaweedFS (master + volume + filer)
7. Deploy Pingora with TLS for `*.sunbeam.pt`
### Phase 2: Authentication
8. Deploy Ory Kratos + Hydra
9. Deploy Sunbeam-branded login UI at `auth.sunbeam.pt`
10. Create initial identities (Sienna, Lonni, Amber)
11. Verify OIDC flow end-to-end
### Phase 3: Core Apps
12. Deploy Docs → verify Y.js WebSocket, AI slash command
13. Deploy Meet → verify WebSocket signaling + TURN/UDP
14. Deploy Drive → verify S3 uploads
15. Deploy People → verify user/team management
16. For each: create database, create S3 bucket, register OIDC client, deploy, verify
### Phase 4: Communication
17. Configure email DNS (MX, SPF, DKIM, DMARC, PTR)
18. Deploy Messages (Postfix MTA-in/out, Rspamd, Django MDA)
19. Provision mailboxes via People: personal + `hello@` shared inbox
20. Test send/receive with external addresses
### Phase 5: AI + Dev Tools
21. Generate Scaleway Generative APIs key
22. Set `AI_BASE_URL` / `AI_API_KEY` / `AI_MODEL` across all components
23. Deploy Conversations → verify chat, tool calls, streaming
24. Deploy Gitea → configure OIDC, LFS → SeaweedFS S3 backend
25. Apply Sunbeam theming to Gitea
26. Create "Game Assets" workspace in Drive
27. Deploy Hive → configure Drive workspace, S3 bucket, OIDC client credentials
28. Verify bidirectional sync: upload file in Drive → appears in S3, `aws s3 cp` to bucket → appears in Drive
### Phase 6: Hardening
29. Configure CloudNativePG backups → Scaleway Object Storage (barman)
30. Configure SeaweedFS replication for critical buckets
31. Create `sunbeam-studio` GitHub org, create private mirror repos
32. Add `GITHUB_MIRROR_TOKEN` secret to Gitea, deploy mirror workflow to all repos
33. Verify nightly mirror: check GitHub repos reflect Gitea state
34. Full integration smoke test: create user → log in → create doc → send email → push code → upload asset in Drive → verify in S3 → ask AI
35. Enable Linkerd dashboard + Scaleway Cockpit for monitoring
---
## 12. Backup & Replication Strategy
### 12.1 Offsite Replication — Scaleway Object Storage
SeaweedFS runs on local NVMe (single node). Scaleway Object Storage in Paris serves as the offsite replication target for disaster recovery.
**Scaleway Object Storage pricing (Paris):**
| Tier | Cost | Use Case |
|---|---|---|
| Standard Multi-AZ | ~€0.015/GB/month | Critical data (barman backups, active game assets) |
| Standard One Zone | ~€0.008/GB/month | Less critical replicas |
| Glacier | ~€0.003/GB/month | Deep archive (old builds, historical assets) |
| Egress | 75 GB free/month, then €0.01/GB | |
| Requests + Ingress | Included | |
**Estimated replication cost:** 100 GB on Multi-AZ ≈ €1.50/month. Even 500 GB Multi-AZ ≈ €7.50/month. Glacier for deep archive of old builds is essentially free.
### 12.2 Code Backup — GitHub Mirror
All Gitea repositories are mirrored daily to private GitHub repos as an offsite code backup. This is **code only** — Git LFS objects are excluded (covered by SeaweedFS → Scaleway Object Storage replication above).
**Implementation:** Gitea Actions cron job, runs nightly at 03:00 UTC.
```yaml
# .gitea/workflows/github-mirror.yaml (placed in each repo)
name: Mirror to GitHub
on:
schedule:
- cron: '0 3 * * *'
jobs:
mirror:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
lfs: false
- name: Push mirror
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_MIRROR_TOKEN }}
run: |
git remote add github "https://${GITHUB_TOKEN}@github.com/sunbeam-studio/${{ github.event.repository.name }}.git" 2>/dev/null || true
git push github --all --force
git push github --tags --force
```
**GitHub org:** `sunbeam-studio` (all repos private, free tier covers unlimited private repos).
**Mirrored repos:** `sunbeam-infra`, `pingora-proxy`, `hive`, `game`, and any future Sunbeam repositories. **Not mirrored:** Git LFS objects (game assets, large binaries) and secrets (never in Git).
This gives triple redundancy on source code: Gitea on Elastic Metal, GitHub mirror, and every developer's local clone. If the server and all Scaleway backups vanish simultaneously, the code is still safe.
### 12.3 Backup Schedule
| What | Method | Destination | Frequency | Retention |
|---|---|---|---|---|
| PostgreSQL (all DBs) | CloudNativePG barmanObjectStore | Scaleway Object Storage (Multi-AZ) | Continuous WAL + daily base | 30 days PITR, 90 days base |
| SeaweedFS (all buckets) | Nightly sync to Scaleway Object Storage | Scaleway Object Storage (One Zone) | Nightly | 30 days |
| Git repositories (code) | Gitea Actions → GitHub mirror | GitHub (`sunbeam-studio` org, private) | Nightly 03:00 UTC | Indefinite |
| Git repositories (local) | Distributed by nature (every clone) | Developer machines | Every push | Indefinite |
| Git LFS objects | In SeaweedFS → covered by SeaweedFS sync | Scaleway Object Storage | Per SeaweedFS schedule | 30 days |
| Cluster config (manifests, Helm values) | Committed to Gitea (mirrored to GitHub) | Distributed + GitHub | Every commit | Indefinite |
| Ory config | Committed to Gitea (secrets via Sealed Secrets or Scaleway Secret Manager) | Distributed + GitHub | Every commit | Indefinite |
| Pingora config | Committed to Gitea (mirrored to GitHub) | Distributed + GitHub | Every commit | Indefinite |
**Monthly verification:** Restore a random database to a scratch namespace, verify integrity and app startup. Spot-check a GitHub mirror repo against Gitea (compare `git log --oneline -5` on both remotes). Automate via Gitea Actions cron job.
---
## 13. Operational Runbooks
### 13.1 Add a New User
1. Create identity in Kratos (via People UI or Kratos admin API)
2. People propagates permissions to La Suite apps
3. Messages provisions personal mailbox (`name@sunbeam.pt`)
4. Gitea account auto-provisions on first OIDC login
5. User visits any `*.sunbeam.pt` URL, authenticates once, has access everywhere
### 13.2 Deploy a New La Suite Component
1. Create logical database in CloudNativePG
2. Create S3 bucket in SeaweedFS
3. Register OIDC client in Hydra (ID, secret, redirect URIs)
4. Deploy to `lasuite` namespace with standard env vars:
- `DJANGO_DATABASE_URL`, `AWS_S3_ENDPOINT_URL`, `AWS_S3_BUCKET_NAME`
- `OIDC_RP_CLIENT_ID`, `OIDC_RP_CLIENT_SECRET`
- `AI_BASE_URL`, `AI_API_KEY`, `AI_MODEL`
5. Add hostname route in Pingora
6. Verify auth flow, S3 access, AI connectivity
### 13.3 Restore PostgreSQL from Backup
**Full cluster:** CloudNativePG bootstraps new cluster from barman backup in Scaleway Object Storage. Specify `recoveryTarget.targetTime` for PITR. Verify integrity, swap service endpoints.
**Single database:** `pg_dump` from recovered cluster → `pg_restore` into production.
### 13.4 Recover from Elastic Metal Failure
1. Provision new Elastic Metal instance
2. Install k3s, deploy Linkerd
3. Restore CloudNativePG from barman (Scaleway Object Storage)
4. Restore SeaweedFS data from Scaleway Object Storage replicas
5. Re-deploy all manifests from Gitea (every developer has a clone)
6. Update DNS A records to new IP
7. Update PTR record in Scaleway console
8. Verify OIDC, email, TURN, AI connectivity
### 13.5 Troubleshoot LiveKit TURN
Symptoms: Users connect to Meet but have no audio/video.
1. Verify UDP 3478 + 4915249252 reachable from outside
2. Check Pingora UDP forwarding is active
3. Check LiveKit logs for TURN allocation failures
4. Verify Elastic Metal firewall rules
5. Test with external STUN/TURN tester
### 13.6 Certificate Renewal Failure
1. Check Pingora logs for ACME errors
2. Verify port 80 reachable for HTTP-01 challenge (or DNS-01 if configured)
3. Restart Pingora to force `rustls-acme` renewal retry
---
## 14. Maintenance Schedule
### Weekly
- Check CloudNativePG backup status (latest successful timestamp)
- Glance at Linkerd dashboard for error rate anomalies
- Review Scaleway billing for unexpected charges
### Monthly
- Apply k3s patch releases if available
- Check suitenumerique GitHub for new La Suite releases, review changelogs
- Update container images one at a time, verify after each
- Review SeaweedFS storage utilization
- Run backup restore test (random database → scratch namespace)
### Quarterly
- **La Suite upstream sync:** Test new releases in local Docker Compose before deploying. One component at a time.
- **Ory updates:** Kratos/Hydra migrations may involve schema changes. Always backup first.
- **Linkerd updates:** Follow upgrade guide. Data plane sidecars roll automatically.
- **Security audit:** Review exposed ports, DNS, TLS config. Run `testssl.sh` against all endpoints. Check CVEs in deployed images.
- **Storage rebalance:** Evaluate SeaweedFS vs Scaleway Object Storage split. Move cold game assets to Scaleway if NVMe is filling.
- **AI model review:** Check Scaleway for new models. Evaluate cost/performance. Test in Conversations before switching.
### Annually
- Review Elastic Metal spec — more RAM, more disk?
- Evaluate new La Suite components
- Domain renewal for `sunbeam.pt`
- Full disaster recovery drill: simulate Elastic Metal loss, restore everything to a fresh instance from backups
---
## 15. Cost Estimate
| Item | Monthly |
|---|---|
| Scaleway Elastic Metal (64GB, NVMe) | ~€80120 |
| Scaleway Object Storage (backups + replication) | ~€210 |
| Scaleway Transactional Email | ~€1 |
| Scaleway Generative APIs | ~€15 |
| Domain (amortized) | ~€2 |
| **Total** | **~€86138** |
For comparison: Google Workspace (€12/user × 3) + Zoom (€13) + Notion (€8/user × 3) + GitHub Team (€4/user × 3) + Linear (€8/user × 3) + email hosting ≈ €130+/month — with no data control, no customization, per-seat scaling.
---
## 16. Architecture Diagram (Text)
```
Internet
┌──────────┴──────────┐
│ Pingora Edge │
│ HTTPS + WS + UDP │
└──────────┬──────────┘
┌──────────┴──────────┐
│ Linkerd mTLS mesh │
└──────────┬──────────┘
┌────────┬───────┬───┴───┬────────┬────────┐
│ │ │ │ │ │
┌──┴──┐ ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ ┌───┴──┐ ┌──┴──┐
│Docs │ │Meet │ │Drive│ │Msgs │ │Convos│ │Gitea│
└──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └───┬──┘ └──┬──┘
│ │ │ │ │ │
│ ┌──┴──┐ │ ┌──┴──┐ │ │
│ │Live │ │ │Post │ │ │
│ │Kit │ │ │fix │ │ │
│ └─────┘ │ └─────┘ │ │
│ │ │ │
│ ┌──┴──┐ │ │
│ │Hive │ ◄── sync ──►│ │
│ └──┬──┘ │ │
│ │ │ │
┌─────┴───────────────┴────────────────┴───────┴─────┐
│ │
┌───┴────┐ ┌─────────┐ ┌───────┐ ┌──────────────────┐ │
│Postgres│ │SeaweedFS│ │ Redis │ │ OpenSearch │ │
│ (CNPG) │ │ (S3) │ │ │ │ │ │
└────────┘ └─────────┘ └───────┘ └──────────────────┘ │
│ │
│ ┌──────────────────────┐ │
│ │ Ory Kratos/Hydra │◄───── all apps ────┘
│ │ (auth.sunbeam.*) │ via OIDC
│ └──────────────────────┘
└──── barman ──── Scaleway Object Storage (backups)
Scaleway Generative APIs (AI)
│ HTTPS
└── Docs, Messages, Conversations
```
---
## 17. Open Questions
- **Game build pipeline details** — Gitea Actions handles lightweight CI (compiles, tests, linting). Platform-specific builds (console SDKs, platform cert signing) offloaded to external providers. All build artifacts land in SeaweedFS. Exact pipeline TBD as game toolchain solidifies.
- **Drive REST API surface** — Hive's Drive client depends on Drive's exact file list/upload/download endpoints. Need to read Drive source to confirm: pagination strategy, file version handling, multipart upload support, how folder hierarchy is represented in API responses.
---
## Appendix: Repository References
| Component | Repository | License |
|---|---|---|
| Docs | `github.com/suitenumerique/docs` | MIT |
| Meet | `github.com/suitenumerique/meet` | MIT |
| Drive | `github.com/suitenumerique/drive` | MIT |
| Messages | `github.com/suitenumerique/messages` | MIT |
| Conversations | `github.com/suitenumerique/conversations` | MIT |
| People | `github.com/suitenumerique/people` | MIT |
| Integration bar | `github.com/suitenumerique/integration` | MIT |
| Django shared lib | `github.com/suitenumerique/django-lasuite` | MIT |
| Ory Kratos | `github.com/ory/kratos` | Apache 2.0 |
| Ory Hydra | `github.com/ory/hydra` | Apache 2.0 |
| SeaweedFS | `github.com/seaweedfs/seaweedfs` | Apache 2.0 |
| CloudNativePG | `github.com/cloudnative-pg/cloudnative-pg` | Apache 2.0 |
| Linkerd | `github.com/linkerd/linkerd2` | Apache 2.0 |
| Pingora | `github.com/cloudflare/pingora` | Apache 2.0 |
| Gitea | `github.com/go-gitea/gitea` | MIT |
| LiveKit | `github.com/livekit/livekit` | Apache 2.0 |