Files
sbbb/docs/ops.md
Sienna Meridian Satterwhite e0afd0a4d7 docs: add ops runbook — When Things Go Sideways, Gorgeous 🚨
Diagnostic ladder, COE format, runbooks for backup/restore, secret
rotation, cert renewal, database recovery, Sol☀️ restart, alerts.
2026-03-24 11:47:08 +00:00

240 lines
6.4 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# When Things Go Sideways, Gorgeous 🚨
Things break. Servers misbehave. Secrets expire. It's not a matter of if — it's a matter of having a plan when it happens. This is the ops playbook for The Super Boujee Business Box ✨.
---
## The Diagnostic Ladder
When something's wrong, check in this order:
```bash
# 1. What's actually running?
sunbeam status
# 2. What's unhealthy?
sunbeam check
# 3. What are the logs saying?
sunbeam logs {namespace}/{service} -f
# 4. Is OpenBao sealed? (breaks everything if yes)
sunbeam vault status
# 5. Is the database up?
sunbeam check data
# 6. Check AlertManager for firing alerts
sunbeam mon grafana alerts
# 7. Check recent logs across everything
sunbeam mon loki logs '{level="error"}'
```
If OpenBao is sealed, unseal it first — almost everything depends on it for credentials.
---
## COE Format (Cause of Error)
When incidents happen, we write them up. Not as blame — as learning. We've all broken prod at 2am, darling, the point is we don't break it the same way twice. COEs live in `docs/ops/` and follow this format:
```markdown
# COE-YYYY-NNN: Brief Title
**Date:** YYYY-MM-DD
**Duration:** How long it lasted
**Severity:** P1/P2/P3
**Author:** Who wrote it up
## Summary
One paragraph: what happened, who was affected, how it was resolved.
## Timeline
Chronological events with timestamps.
## Root Cause
What actually went wrong (not symptoms — the cause).
## Resolution
What was done to fix it.
## Lessons Learned
What we'll do differently.
## Action Items
- [ ] Specific follow-up tasks with owners
```
**Existing COEs:**
- [COE-2026-001: Vault Root Token Loss](ops/COE-2026-001-vault-root-token-loss.md)
---
## Common Runbooks
### Backup & Restore (PostgreSQL)
**Backup:** Automated via barman → Scaleway Object Storage. 30-day retention.
```bash
# Check backup status
sunbeam k8s get pods -n data -l app=barman
# Manual backup trigger
sunbeam k8s exec -n data barman-0 -- barman backup postgres
```
**Restore:**
```bash
# CloudNativePG handles point-in-time recovery
# Edit the Cluster resource to specify a recovery target
# See CloudNativePG docs for PITR configuration
```
> **Known gap:** 30-day retention doesn't meet 7-year cold storage requirements for financial records. This needs addressing — see backup retention project notes.
---
### Secret Rotation
**Dynamic secrets (automatic):**
Database credentials rotate every 5 minutes via Vault Secrets Operator. No manual intervention. The operator:
1. Requests new creds from OpenBao
2. Updates the K8s Secret
3. Triggers rollout restart on affected Deployments
**Static secrets (manual when needed):**
```bash
# Rotate a static secret
sunbeam vault kv put secret/{app}/{key} value="new-value"
# VSO picks up changes within 30 seconds
# Rotate Django secret key (causes session invalidation)
sunbeam vault kv put secret/messages/django-secret DJANGO_SECRET_KEY="$(openssl rand -hex 32)"
```
**OpenBao root token:**
The root token is critical. If lost, you need to reinitialize (see COE-2026-001). Keep it somewhere safe and offline.
```bash
# Check seal status
sunbeam vault status
# Unseal (needs 3 of 5 keys)
sunbeam vault unseal <key1>
sunbeam vault unseal <key2>
sunbeam vault unseal <key3>
```
---
### Certificate Renewal
**Automatic:** cert-manager renews Let's Encrypt certs before expiry. Pingora watches the K8s Secret and hot-reloads. Zero downtime.
**If auto-renewal fails:**
```bash
# Check cert-manager logs
sunbeam logs cert-manager/cert-manager
# Check certificate status
sunbeam k8s get certificates -A
sunbeam k8s describe certificate -n ingress pingora-tls
# Force renewal
sunbeam k8s delete certificate -n ingress pingora-tls
# cert-manager will recreate it
```
**ACME challenge issues:** Pingora routes HTTP-01 challenges by watching Ingress objects. If challenges fail, check that port 80 is open and Pingora is routing `.well-known/acme-challenge/` correctly.
---
### Database Recovery
**Pod restart (transient failure):**
```bash
sunbeam restart data/postgres
```
**Data corruption (use backups):**
CloudNativePG supports point-in-time recovery from barman backups. Edit the Cluster resource with a recovery section specifying the target time.
**Connection exhaustion:**
Check connection counts in Grafana (`dashboards-infrastructure`). All apps use connection pooling — if connections are exhausted, something is leaking.
---
### SeaweedFS Recovery
**Volume server down:**
```bash
sunbeam logs storage/seaweedfs-volume
sunbeam restart storage/seaweedfs-volume
```
**Filer issues (S3 API):**
```bash
sunbeam check storage
sunbeam logs storage/seaweedfs-filer
```
**Data integrity:** Volumes live on local NVMe. No replication on single-node — if the disk dies, data is gone. Keep critical assets in multiple places (Drive + S3 via Hive, Git LFS in Gitea). Redundancy is self-care. 💅
---
### Sol☀ Restart / Conversation Reset
Sol☀ uses SQLite for conversation state. On restart:
1. Sol☀ backfills recent messages from the OpenSearch archive
2. Recreates the orchestrator agent if the system prompt has changed
3. Sends `*sneezes*` to all rooms to signal the hiccup
```bash
# Normal restart
sunbeam restart matrix/sol
# Check Sol☀ is healthy
sunbeam logs matrix/sol -f
sunbeam check matrix
```
If conversations seem confused after restart, Sol☀ will auto-compact and recover. The `*sneezes*` message is intentional — it tells the team "I restarted, give me a moment." They bounce back fast.
---
## Alerts → Matrix
AlertManager routes all alerts to a Matrix room via the `matrix-alertmanager-receiver`:
```
Prometheus alert fires
→ AlertManager evaluates severity + routing
→ Webhook POST to matrix-alertmanager-receiver
→ Bot formats alert and posts to Matrix room
→ Team sees it in chat
```
The bot account credentials are in `matrix-bot-secret`. If alerts stop appearing in Matrix, check:
```bash
sunbeam logs monitoring/matrix-alertmanager-receiver
sunbeam check monitoring
```
---
## Emergency Contacts
| Situation | What to do |
|-----------|-----------|
| Server unreachable | Check Scaleway console, reboot if needed |
| OpenBao sealed | Unseal with 3 of 5 keys |
| Database down | Check CloudNativePG operator logs, restart if needed |
| All apps 502 | Check Pingora + Linkerd |
| Email not sending | Check Postfix MTA-out, Scaleway TEM status |
| Sol☀ unresponsive | Restart, check Mistral API connectivity |
| Certs expired | Delete cert resource, let cert-manager recreate |