Files

Sienna Meridian Satterwhite e0afd0a4d7

docs: add ops runbook — When Things Go Sideways, Gorgeous 🚨

Diagnostic ladder, COE format, runbooks for backup/restore, secret
rotation, cert renewal, database recovery, Sol☀️ restart, alerts.

2026-03-24 11:47:08 +00:00

6.4 KiB

Raw Blame History

When Things Go Sideways, Gorgeous 🚨

Things break. Servers misbehave. Secrets expire. It's not a matter of if — it's a matter of having a plan when it happens. This is the ops playbook for The Super Boujee Business Box ✨.

The Diagnostic Ladder

When something's wrong, check in this order:

# 1. What's actually running?
sunbeam status

# 2. What's unhealthy?
sunbeam check

# 3. What are the logs saying?
sunbeam logs {namespace}/{service} -f

# 4. Is OpenBao sealed? (breaks everything if yes)
sunbeam vault status

# 5. Is the database up?
sunbeam check data

# 6. Check AlertManager for firing alerts
sunbeam mon grafana alerts

# 7. Check recent logs across everything
sunbeam mon loki logs '{level="error"}'

If OpenBao is sealed, unseal it first — almost everything depends on it for credentials.

COE Format (Cause of Error)

When incidents happen, we write them up. Not as blame — as learning. We've all broken prod at 2am, darling, the point is we don't break it the same way twice. COEs live in docs/ops/ and follow this format:

# COE-YYYY-NNN: Brief Title

**Date:** YYYY-MM-DD
**Duration:** How long it lasted
**Severity:** P1/P2/P3
**Author:** Who wrote it up

## Summary
One paragraph: what happened, who was affected, how it was resolved.

## Timeline
Chronological events with timestamps.

## Root Cause
What actually went wrong (not symptoms — the cause).

## Resolution
What was done to fix it.

## Lessons Learned
What we'll do differently.

## Action Items
- [ ] Specific follow-up tasks with owners

Existing COEs:

COE-2026-001: Vault Root Token Loss

Common Runbooks

Backup & Restore (PostgreSQL)

Backup: Automated via barman → Scaleway Object Storage. 30-day retention.

# Check backup status
sunbeam k8s get pods -n data -l app=barman

# Manual backup trigger
sunbeam k8s exec -n data barman-0 -- barman backup postgres

Restore:

# CloudNativePG handles point-in-time recovery
# Edit the Cluster resource to specify a recovery target
# See CloudNativePG docs for PITR configuration

Known gap: 30-day retention doesn't meet 7-year cold storage requirements for financial records. This needs addressing — see backup retention project notes.

Secret Rotation

Dynamic secrets (automatic): Database credentials rotate every 5 minutes via Vault Secrets Operator. No manual intervention. The operator:

Requests new creds from OpenBao
Updates the K8s Secret
Triggers rollout restart on affected Deployments

Static secrets (manual when needed):

# Rotate a static secret
sunbeam vault kv put secret/{app}/{key} value="new-value"
# VSO picks up changes within 30 seconds

# Rotate Django secret key (causes session invalidation)
sunbeam vault kv put secret/messages/django-secret DJANGO_SECRET_KEY="$(openssl rand -hex 32)"

OpenBao root token: The root token is critical. If lost, you need to reinitialize (see COE-2026-001). Keep it somewhere safe and offline.

# Check seal status
sunbeam vault status

# Unseal (needs 3 of 5 keys)
sunbeam vault unseal <key1>
sunbeam vault unseal <key2>
sunbeam vault unseal <key3>

Certificate Renewal

Automatic: cert-manager renews Let's Encrypt certs before expiry. Pingora watches the K8s Secret and hot-reloads. Zero downtime.

If auto-renewal fails:

# Check cert-manager logs
sunbeam logs cert-manager/cert-manager

# Check certificate status
sunbeam k8s get certificates -A
sunbeam k8s describe certificate -n ingress pingora-tls

# Force renewal
sunbeam k8s delete certificate -n ingress pingora-tls
# cert-manager will recreate it

ACME challenge issues: Pingora routes HTTP-01 challenges by watching Ingress objects. If challenges fail, check that port 80 is open and Pingora is routing .well-known/acme-challenge/ correctly.

Database Recovery

Pod restart (transient failure):

sunbeam restart data/postgres

Data corruption (use backups): CloudNativePG supports point-in-time recovery from barman backups. Edit the Cluster resource with a recovery section specifying the target time.

Connection exhaustion: Check connection counts in Grafana (dashboards-infrastructure). All apps use connection pooling — if connections are exhausted, something is leaking.

SeaweedFS Recovery

Volume server down:

sunbeam logs storage/seaweedfs-volume
sunbeam restart storage/seaweedfs-volume

Filer issues (S3 API):

sunbeam check storage
sunbeam logs storage/seaweedfs-filer

Data integrity: Volumes live on local NVMe. No replication on single-node — if the disk dies, data is gone. Keep critical assets in multiple places (Drive + S3 via Hive, Git LFS in Gitea). Redundancy is self-care. 💅

Sol☀️ Restart / Conversation Reset

Sol☀️ uses SQLite for conversation state. On restart:

Sol☀️ backfills recent messages from the OpenSearch archive
Recreates the orchestrator agent if the system prompt has changed
Sends *sneezes* to all rooms to signal the hiccup

# Normal restart
sunbeam restart matrix/sol

# Check Sol☀️ is healthy
sunbeam logs matrix/sol -f
sunbeam check matrix

If conversations seem confused after restart, Sol☀️ will auto-compact and recover. The *sneezes* message is intentional — it tells the team "I restarted, give me a moment." They bounce back fast.

Alerts → Matrix

AlertManager routes all alerts to a Matrix room via the matrix-alertmanager-receiver:

Prometheus alert fires
  → AlertManager evaluates severity + routing
  → Webhook POST to matrix-alertmanager-receiver
  → Bot formats alert and posts to Matrix room
  → Team sees it in chat

The bot account credentials are in matrix-bot-secret. If alerts stop appearing in Matrix, check:

sunbeam logs monitoring/matrix-alertmanager-receiver
sunbeam check monitoring

Emergency Contacts

Situation	What to do
Server unreachable	Check Scaleway console, reboot if needed
OpenBao sealed	Unseal with 3 of 5 keys
Database down	Check CloudNativePG operator logs, restart if needed
All apps 502	Check Pingora + Linkerd
Email not sending	Check Postfix MTA-out, Scaleway TEM status
Sol☀️ unresponsive	Restart, check Mistral API connectivity
Certs expired	Delete cert resource, let cert-manager recreate

6.4 KiB Raw Blame History Unescape Escape