docs: add ops runbook — When Things Go Sideways, Gorgeous 🚨

Diagnostic ladder, COE format, runbooks for backup/restore, secret rotation, cert renewal, database recovery, Sol☀️ restart, alerts.
2026-03-24 11:47:08 +00:00
parent 2f7785774b
commit e0afd0a4d7
1 changed files with 239 additions and 0 deletions
--- a/docs/ops.md
+++ b/docs/ops.md
@@ -0,0 +1,239 @@
+# When Things Go Sideways, Gorgeous 🚨
+
+Things break. Servers misbehave. Secrets expire. It's not a matter of if — it's a matter of having a plan when it happens. This is the ops playbook for The Super Boujee Business Box ✨.
+
+---
+
+## The Diagnostic Ladder
+
+When something's wrong, check in this order:
+
+```bash
+# 1. What's actually running?
+sunbeam status
+
+# 2. What's unhealthy?
+sunbeam check
+
+# 3. What are the logs saying?
+sunbeam logs {namespace}/{service} -f
+
+# 4. Is OpenBao sealed? (breaks everything if yes)
+sunbeam vault status
+
+# 5. Is the database up?
+sunbeam check data
+
+# 6. Check AlertManager for firing alerts
+sunbeam mon grafana alerts
+
+# 7. Check recent logs across everything
+sunbeam mon loki logs '{level="error"}'
+```
+
+If OpenBao is sealed, unseal it first — almost everything depends on it for credentials.
+
+---
+
+## COE Format (Cause of Error)
+
+When incidents happen, we write them up. Not as blame — as learning. We've all broken prod at 2am, darling, the point is we don't break it the same way twice. COEs live in `docs/ops/` and follow this format:
+
+```markdown
+# COE-YYYY-NNN: Brief Title
+
+**Date:** YYYY-MM-DD
+**Duration:** How long it lasted
+**Severity:** P1/P2/P3
+**Author:** Who wrote it up
+
+## Summary
+One paragraph: what happened, who was affected, how it was resolved.
+
+## Timeline
+Chronological events with timestamps.
+
+## Root Cause
+What actually went wrong (not symptoms — the cause).
+
+## Resolution
+What was done to fix it.
+
+## Lessons Learned
+What we'll do differently.
+
+## Action Items
+- [ ] Specific follow-up tasks with owners
+```
+
+**Existing COEs:**
+- [COE-2026-001: Vault Root Token Loss](ops/COE-2026-001-vault-root-token-loss.md)
+
+---
+
+## Common Runbooks
+
+### Backup & Restore (PostgreSQL)
+
+**Backup:** Automated via barman → Scaleway Object Storage. 30-day retention.
+
+```bash
+# Check backup status
+sunbeam k8s get pods -n data -l app=barman
+
+# Manual backup trigger
+sunbeam k8s exec -n data barman-0 -- barman backup postgres
+```
+
+**Restore:**
+```bash
+# CloudNativePG handles point-in-time recovery
+# Edit the Cluster resource to specify a recovery target
+# See CloudNativePG docs for PITR configuration
+```
+
+> **Known gap:** 30-day retention doesn't meet 7-year cold storage requirements for financial records. This needs addressing — see backup retention project notes.
+
+---
+
+### Secret Rotation
+
+**Dynamic secrets (automatic):**
+Database credentials rotate every 5 minutes via Vault Secrets Operator. No manual intervention. The operator:
+1. Requests new creds from OpenBao
+2. Updates the K8s Secret
+3. Triggers rollout restart on affected Deployments
+
+**Static secrets (manual when needed):**
+```bash
+# Rotate a static secret
+sunbeam vault kv put secret/{app}/{key} value="new-value"
+# VSO picks up changes within 30 seconds
+
+# Rotate Django secret key (causes session invalidation)
+sunbeam vault kv put secret/messages/django-secret DJANGO_SECRET_KEY="$(openssl rand -hex 32)"
+```
+
+**OpenBao root token:**
+The root token is critical. If lost, you need to reinitialize (see COE-2026-001). Keep it somewhere safe and offline.
+
+```bash
+# Check seal status
+sunbeam vault status
+
+# Unseal (needs 3 of 5 keys)
+sunbeam vault unseal <key1>
+sunbeam vault unseal <key2>
+sunbeam vault unseal <key3>
+```
+
+---
+
+### Certificate Renewal
+
+**Automatic:** cert-manager renews Let's Encrypt certs before expiry. Pingora watches the K8s Secret and hot-reloads. Zero downtime.
+
+**If auto-renewal fails:**
+```bash
+# Check cert-manager logs
+sunbeam logs cert-manager/cert-manager
+
+# Check certificate status
+sunbeam k8s get certificates -A
+sunbeam k8s describe certificate -n ingress pingora-tls
+
+# Force renewal
+sunbeam k8s delete certificate -n ingress pingora-tls
+# cert-manager will recreate it
+```
+
+**ACME challenge issues:** Pingora routes HTTP-01 challenges by watching Ingress objects. If challenges fail, check that port 80 is open and Pingora is routing `.well-known/acme-challenge/` correctly.
+
+---
+
+### Database Recovery
+
+**Pod restart (transient failure):**
+```bash
+sunbeam restart data/postgres
+```
+
+**Data corruption (use backups):**
+CloudNativePG supports point-in-time recovery from barman backups. Edit the Cluster resource with a recovery section specifying the target time.
+
+**Connection exhaustion:**
+Check connection counts in Grafana (`dashboards-infrastructure`). All apps use connection pooling — if connections are exhausted, something is leaking.
+
+---
+
+### SeaweedFS Recovery
+
+**Volume server down:**
+```bash
+sunbeam logs storage/seaweedfs-volume
+sunbeam restart storage/seaweedfs-volume
+```
+
+**Filer issues (S3 API):**
+```bash
+sunbeam check storage
+sunbeam logs storage/seaweedfs-filer
+```
+
+**Data integrity:** Volumes live on local NVMe. No replication on single-node — if the disk dies, data is gone. Keep critical assets in multiple places (Drive + S3 via Hive, Git LFS in Gitea). Redundancy is self-care. 💅
+
+---
+
+### Sol☀️ Restart / Conversation Reset
+
+Sol☀️ uses SQLite for conversation state. On restart:
+
+1. Sol☀️ backfills recent messages from the OpenSearch archive
+2. Recreates the orchestrator agent if the system prompt has changed
+3. Sends `*sneezes*` to all rooms to signal the hiccup
+
+```bash
+# Normal restart
+sunbeam restart matrix/sol
+
+# Check Sol☀️ is healthy
+sunbeam logs matrix/sol -f
+sunbeam check matrix
+```
+
+If conversations seem confused after restart, Sol☀️ will auto-compact and recover. The `*sneezes*` message is intentional — it tells the team "I restarted, give me a moment." They bounce back fast.
+
+---
+
+## Alerts → Matrix
+
+AlertManager routes all alerts to a Matrix room via the `matrix-alertmanager-receiver`:
+
+```
+Prometheus alert fires
+  → AlertManager evaluates severity + routing
+  → Webhook POST to matrix-alertmanager-receiver
+  → Bot formats alert and posts to Matrix room
+  → Team sees it in chat
+```
+
+The bot account credentials are in `matrix-bot-secret`. If alerts stop appearing in Matrix, check:
+
+```bash
+sunbeam logs monitoring/matrix-alertmanager-receiver
+sunbeam check monitoring
+```
+
+---
+
+## Emergency Contacts
+
+| Situation | What to do |
+|-----------|-----------|
+| Server unreachable | Check Scaleway console, reboot if needed |
+| OpenBao sealed | Unseal with 3 of 5 keys |
+| Database down | Check CloudNativePG operator logs, restart if needed |
+| All apps 502 | Check Pingora + Linkerd |
+| Email not sending | Check Postfix MTA-out, Scaleway TEM status |
+| Sol☀️ unresponsive | Restart, check Mistral API connectivity |
+| Certs expired | Delete cert resource, let cert-manager recreate |