From e0afd0a4d757eb78cdfde1b894bf0dba84e0047b Mon Sep 17 00:00:00 2001 From: Sienna Meridian Satterwhite Date: Tue, 24 Mar 2026 11:47:08 +0000 Subject: [PATCH] =?UTF-8?q?docs:=20add=20ops=20runbook=20=E2=80=94=20When?= =?UTF-8?q?=20Things=20Go=20Sideways,=20Gorgeous=20=F0=9F=9A=A8?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Diagnostic ladder, COE format, runbooks for backup/restore, secret rotation, cert renewal, database recovery, Sol☀️ restart, alerts. --- docs/ops.md | 239 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 239 insertions(+) create mode 100644 docs/ops.md diff --git a/docs/ops.md b/docs/ops.md new file mode 100644 index 0000000..5296c2f --- /dev/null +++ b/docs/ops.md @@ -0,0 +1,239 @@ +# When Things Go Sideways, Gorgeous 🚨 + +Things break. Servers misbehave. Secrets expire. It's not a matter of if — it's a matter of having a plan when it happens. This is the ops playbook for The Super Boujee Business Box ✨. + +--- + +## The Diagnostic Ladder + +When something's wrong, check in this order: + +```bash +# 1. What's actually running? +sunbeam status + +# 2. What's unhealthy? +sunbeam check + +# 3. What are the logs saying? +sunbeam logs {namespace}/{service} -f + +# 4. Is OpenBao sealed? (breaks everything if yes) +sunbeam vault status + +# 5. Is the database up? +sunbeam check data + +# 6. Check AlertManager for firing alerts +sunbeam mon grafana alerts + +# 7. Check recent logs across everything +sunbeam mon loki logs '{level="error"}' +``` + +If OpenBao is sealed, unseal it first — almost everything depends on it for credentials. + +--- + +## COE Format (Cause of Error) + +When incidents happen, we write them up. Not as blame — as learning. We've all broken prod at 2am, darling, the point is we don't break it the same way twice. COEs live in `docs/ops/` and follow this format: + +```markdown +# COE-YYYY-NNN: Brief Title + +**Date:** YYYY-MM-DD +**Duration:** How long it lasted +**Severity:** P1/P2/P3 +**Author:** Who wrote it up + +## Summary +One paragraph: what happened, who was affected, how it was resolved. + +## Timeline +Chronological events with timestamps. + +## Root Cause +What actually went wrong (not symptoms — the cause). + +## Resolution +What was done to fix it. + +## Lessons Learned +What we'll do differently. + +## Action Items +- [ ] Specific follow-up tasks with owners +``` + +**Existing COEs:** +- [COE-2026-001: Vault Root Token Loss](ops/COE-2026-001-vault-root-token-loss.md) + +--- + +## Common Runbooks + +### Backup & Restore (PostgreSQL) + +**Backup:** Automated via barman → Scaleway Object Storage. 30-day retention. + +```bash +# Check backup status +sunbeam k8s get pods -n data -l app=barman + +# Manual backup trigger +sunbeam k8s exec -n data barman-0 -- barman backup postgres +``` + +**Restore:** +```bash +# CloudNativePG handles point-in-time recovery +# Edit the Cluster resource to specify a recovery target +# See CloudNativePG docs for PITR configuration +``` + +> **Known gap:** 30-day retention doesn't meet 7-year cold storage requirements for financial records. This needs addressing — see backup retention project notes. + +--- + +### Secret Rotation + +**Dynamic secrets (automatic):** +Database credentials rotate every 5 minutes via Vault Secrets Operator. No manual intervention. The operator: +1. Requests new creds from OpenBao +2. Updates the K8s Secret +3. Triggers rollout restart on affected Deployments + +**Static secrets (manual when needed):** +```bash +# Rotate a static secret +sunbeam vault kv put secret/{app}/{key} value="new-value" +# VSO picks up changes within 30 seconds + +# Rotate Django secret key (causes session invalidation) +sunbeam vault kv put secret/messages/django-secret DJANGO_SECRET_KEY="$(openssl rand -hex 32)" +``` + +**OpenBao root token:** +The root token is critical. If lost, you need to reinitialize (see COE-2026-001). Keep it somewhere safe and offline. + +```bash +# Check seal status +sunbeam vault status + +# Unseal (needs 3 of 5 keys) +sunbeam vault unseal +sunbeam vault unseal +sunbeam vault unseal +``` + +--- + +### Certificate Renewal + +**Automatic:** cert-manager renews Let's Encrypt certs before expiry. Pingora watches the K8s Secret and hot-reloads. Zero downtime. + +**If auto-renewal fails:** +```bash +# Check cert-manager logs +sunbeam logs cert-manager/cert-manager + +# Check certificate status +sunbeam k8s get certificates -A +sunbeam k8s describe certificate -n ingress pingora-tls + +# Force renewal +sunbeam k8s delete certificate -n ingress pingora-tls +# cert-manager will recreate it +``` + +**ACME challenge issues:** Pingora routes HTTP-01 challenges by watching Ingress objects. If challenges fail, check that port 80 is open and Pingora is routing `.well-known/acme-challenge/` correctly. + +--- + +### Database Recovery + +**Pod restart (transient failure):** +```bash +sunbeam restart data/postgres +``` + +**Data corruption (use backups):** +CloudNativePG supports point-in-time recovery from barman backups. Edit the Cluster resource with a recovery section specifying the target time. + +**Connection exhaustion:** +Check connection counts in Grafana (`dashboards-infrastructure`). All apps use connection pooling — if connections are exhausted, something is leaking. + +--- + +### SeaweedFS Recovery + +**Volume server down:** +```bash +sunbeam logs storage/seaweedfs-volume +sunbeam restart storage/seaweedfs-volume +``` + +**Filer issues (S3 API):** +```bash +sunbeam check storage +sunbeam logs storage/seaweedfs-filer +``` + +**Data integrity:** Volumes live on local NVMe. No replication on single-node — if the disk dies, data is gone. Keep critical assets in multiple places (Drive + S3 via Hive, Git LFS in Gitea). Redundancy is self-care. 💅 + +--- + +### Sol☀️ Restart / Conversation Reset + +Sol☀️ uses SQLite for conversation state. On restart: + +1. Sol☀️ backfills recent messages from the OpenSearch archive +2. Recreates the orchestrator agent if the system prompt has changed +3. Sends `*sneezes*` to all rooms to signal the hiccup + +```bash +# Normal restart +sunbeam restart matrix/sol + +# Check Sol☀️ is healthy +sunbeam logs matrix/sol -f +sunbeam check matrix +``` + +If conversations seem confused after restart, Sol☀️ will auto-compact and recover. The `*sneezes*` message is intentional — it tells the team "I restarted, give me a moment." They bounce back fast. + +--- + +## Alerts → Matrix + +AlertManager routes all alerts to a Matrix room via the `matrix-alertmanager-receiver`: + +``` +Prometheus alert fires + → AlertManager evaluates severity + routing + → Webhook POST to matrix-alertmanager-receiver + → Bot formats alert and posts to Matrix room + → Team sees it in chat +``` + +The bot account credentials are in `matrix-bot-secret`. If alerts stop appearing in Matrix, check: + +```bash +sunbeam logs monitoring/matrix-alertmanager-receiver +sunbeam check monitoring +``` + +--- + +## Emergency Contacts + +| Situation | What to do | +|-----------|-----------| +| Server unreachable | Check Scaleway console, reboot if needed | +| OpenBao sealed | Unseal with 3 of 5 keys | +| Database down | Check CloudNativePG operator logs, restart if needed | +| All apps 502 | Check Pingora + Linkerd | +| Email not sending | Check Postfix MTA-out, Scaleway TEM status | +| Sol☀️ unresponsive | Restart, check Mistral API connectivity | +| Certs expired | Delete cert resource, let cert-manager recreate |