# When Things Go Sideways, Gorgeous 🚨 Things break. Servers misbehave. Secrets expire. It's not a matter of if — it's a matter of having a plan when it happens. This is the ops playbook for The Super Boujee Business Box ✨. --- ## The Diagnostic Ladder When something's wrong, check in this order: ```bash # 1. What's actually running? sunbeam status # 2. What's unhealthy? sunbeam check # 3. What are the logs saying? sunbeam logs {namespace}/{service} -f # 4. Is OpenBao sealed? (breaks everything if yes) sunbeam vault status # 5. Is the database up? sunbeam check data # 6. Check AlertManager for firing alerts sunbeam mon grafana alerts # 7. Check recent logs across everything sunbeam mon loki logs '{level="error"}' ``` If OpenBao is sealed, unseal it first — almost everything depends on it for credentials. --- ## COE Format (Cause of Error) When incidents happen, we write them up. Not as blame — as learning. We've all broken prod at 2am, darling, the point is we don't break it the same way twice. COEs live in `docs/ops/` and follow this format: ```markdown # COE-YYYY-NNN: Brief Title **Date:** YYYY-MM-DD **Duration:** How long it lasted **Severity:** P1/P2/P3 **Author:** Who wrote it up ## Summary One paragraph: what happened, who was affected, how it was resolved. ## Timeline Chronological events with timestamps. ## Root Cause What actually went wrong (not symptoms — the cause). ## Resolution What was done to fix it. ## Lessons Learned What we'll do differently. ## Action Items - [ ] Specific follow-up tasks with owners ``` **Existing COEs:** - [COE-2026-001: Vault Root Token Loss](ops/COE-2026-001-vault-root-token-loss.md) --- ## Common Runbooks ### Backup & Restore (PostgreSQL) **Backup:** Automated via barman → Scaleway Object Storage. 30-day retention. ```bash # Check backup status sunbeam k8s get pods -n data -l app=barman # Manual backup trigger sunbeam k8s exec -n data barman-0 -- barman backup postgres ``` **Restore:** ```bash # CloudNativePG handles point-in-time recovery # Edit the Cluster resource to specify a recovery target # See CloudNativePG docs for PITR configuration ``` > **Known gap:** 30-day retention doesn't meet 7-year cold storage requirements for financial records. This needs addressing — see backup retention project notes. --- ### Secret Rotation **Dynamic secrets (automatic):** Database credentials rotate every 5 minutes via Vault Secrets Operator. No manual intervention. The operator: 1. Requests new creds from OpenBao 2. Updates the K8s Secret 3. Triggers rollout restart on affected Deployments **Static secrets (manual when needed):** ```bash # Rotate a static secret sunbeam vault kv put secret/{app}/{key} value="new-value" # VSO picks up changes within 30 seconds # Rotate Django secret key (causes session invalidation) sunbeam vault kv put secret/messages/django-secret DJANGO_SECRET_KEY="$(openssl rand -hex 32)" ``` **OpenBao root token:** The root token is critical. If lost, you need to reinitialize (see COE-2026-001). Keep it somewhere safe and offline. ```bash # Check seal status sunbeam vault status # Unseal (needs 3 of 5 keys) sunbeam vault unseal sunbeam vault unseal sunbeam vault unseal ``` --- ### Certificate Renewal **Automatic:** cert-manager renews Let's Encrypt certs before expiry. Pingora watches the K8s Secret and hot-reloads. Zero downtime. **If auto-renewal fails:** ```bash # Check cert-manager logs sunbeam logs cert-manager/cert-manager # Check certificate status sunbeam k8s get certificates -A sunbeam k8s describe certificate -n ingress pingora-tls # Force renewal sunbeam k8s delete certificate -n ingress pingora-tls # cert-manager will recreate it ``` **ACME challenge issues:** Pingora routes HTTP-01 challenges by watching Ingress objects. If challenges fail, check that port 80 is open and Pingora is routing `.well-known/acme-challenge/` correctly. --- ### Database Recovery **Pod restart (transient failure):** ```bash sunbeam restart data/postgres ``` **Data corruption (use backups):** CloudNativePG supports point-in-time recovery from barman backups. Edit the Cluster resource with a recovery section specifying the target time. **Connection exhaustion:** Check connection counts in Grafana (`dashboards-infrastructure`). All apps use connection pooling — if connections are exhausted, something is leaking. --- ### SeaweedFS Recovery **Volume server down:** ```bash sunbeam logs storage/seaweedfs-volume sunbeam restart storage/seaweedfs-volume ``` **Filer issues (S3 API):** ```bash sunbeam check storage sunbeam logs storage/seaweedfs-filer ``` **Data integrity:** Volumes live on local NVMe. No replication on single-node — if the disk dies, data is gone. Keep critical assets in multiple places (Drive + S3 via Hive, Git LFS in Gitea). Redundancy is self-care. 💅 --- ### Sol☀️ Restart / Conversation Reset Sol☀️ uses SQLite for conversation state. On restart: 1. Sol☀️ backfills recent messages from the OpenSearch archive 2. Recreates the orchestrator agent if the system prompt has changed 3. Sends `*sneezes*` to all rooms to signal the hiccup ```bash # Normal restart sunbeam restart matrix/sol # Check Sol☀️ is healthy sunbeam logs matrix/sol -f sunbeam check matrix ``` If conversations seem confused after restart, Sol☀️ will auto-compact and recover. The `*sneezes*` message is intentional — it tells the team "I restarted, give me a moment." They bounce back fast. --- ## Alerts → Matrix AlertManager routes all alerts to a Matrix room via the `matrix-alertmanager-receiver`: ``` Prometheus alert fires → AlertManager evaluates severity + routing → Webhook POST to matrix-alertmanager-receiver → Bot formats alert and posts to Matrix room → Team sees it in chat ``` The bot account credentials are in `matrix-bot-secret`. If alerts stop appearing in Matrix, check: ```bash sunbeam logs monitoring/matrix-alertmanager-receiver sunbeam check monitoring ``` --- ## Emergency Contacts | Situation | What to do | |-----------|-----------| | Server unreachable | Check Scaleway console, reboot if needed | | OpenBao sealed | Unseal with 3 of 5 keys | | Database down | Check CloudNativePG operator logs, restart if needed | | All apps 502 | Check Pingora + Linkerd | | Email not sending | Check Postfix MTA-out, Scaleway TEM status | | Sol☀️ unresponsive | Restart, check Mistral API connectivity | | Certs expired | Delete cert resource, let cert-manager recreate |