240 lines
6.4 KiB
Markdown
240 lines
6.4 KiB
Markdown
|
|
# When Things Go Sideways, Gorgeous 🚨
|
|||
|
|
|
|||
|
|
Things break. Servers misbehave. Secrets expire. It's not a matter of if — it's a matter of having a plan when it happens. This is the ops playbook for The Super Boujee Business Box ✨.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## The Diagnostic Ladder
|
|||
|
|
|
|||
|
|
When something's wrong, check in this order:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. What's actually running?
|
|||
|
|
sunbeam status
|
|||
|
|
|
|||
|
|
# 2. What's unhealthy?
|
|||
|
|
sunbeam check
|
|||
|
|
|
|||
|
|
# 3. What are the logs saying?
|
|||
|
|
sunbeam logs {namespace}/{service} -f
|
|||
|
|
|
|||
|
|
# 4. Is OpenBao sealed? (breaks everything if yes)
|
|||
|
|
sunbeam vault status
|
|||
|
|
|
|||
|
|
# 5. Is the database up?
|
|||
|
|
sunbeam check data
|
|||
|
|
|
|||
|
|
# 6. Check AlertManager for firing alerts
|
|||
|
|
sunbeam mon grafana alerts
|
|||
|
|
|
|||
|
|
# 7. Check recent logs across everything
|
|||
|
|
sunbeam mon loki logs '{level="error"}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
If OpenBao is sealed, unseal it first — almost everything depends on it for credentials.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## COE Format (Cause of Error)
|
|||
|
|
|
|||
|
|
When incidents happen, we write them up. Not as blame — as learning. We've all broken prod at 2am, darling, the point is we don't break it the same way twice. COEs live in `docs/ops/` and follow this format:
|
|||
|
|
|
|||
|
|
```markdown
|
|||
|
|
# COE-YYYY-NNN: Brief Title
|
|||
|
|
|
|||
|
|
**Date:** YYYY-MM-DD
|
|||
|
|
**Duration:** How long it lasted
|
|||
|
|
**Severity:** P1/P2/P3
|
|||
|
|
**Author:** Who wrote it up
|
|||
|
|
|
|||
|
|
## Summary
|
|||
|
|
One paragraph: what happened, who was affected, how it was resolved.
|
|||
|
|
|
|||
|
|
## Timeline
|
|||
|
|
Chronological events with timestamps.
|
|||
|
|
|
|||
|
|
## Root Cause
|
|||
|
|
What actually went wrong (not symptoms — the cause).
|
|||
|
|
|
|||
|
|
## Resolution
|
|||
|
|
What was done to fix it.
|
|||
|
|
|
|||
|
|
## Lessons Learned
|
|||
|
|
What we'll do differently.
|
|||
|
|
|
|||
|
|
## Action Items
|
|||
|
|
- [ ] Specific follow-up tasks with owners
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Existing COEs:**
|
|||
|
|
- [COE-2026-001: Vault Root Token Loss](ops/COE-2026-001-vault-root-token-loss.md)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Common Runbooks
|
|||
|
|
|
|||
|
|
### Backup & Restore (PostgreSQL)
|
|||
|
|
|
|||
|
|
**Backup:** Automated via barman → Scaleway Object Storage. 30-day retention.
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Check backup status
|
|||
|
|
sunbeam k8s get pods -n data -l app=barman
|
|||
|
|
|
|||
|
|
# Manual backup trigger
|
|||
|
|
sunbeam k8s exec -n data barman-0 -- barman backup postgres
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Restore:**
|
|||
|
|
```bash
|
|||
|
|
# CloudNativePG handles point-in-time recovery
|
|||
|
|
# Edit the Cluster resource to specify a recovery target
|
|||
|
|
# See CloudNativePG docs for PITR configuration
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
> **Known gap:** 30-day retention doesn't meet 7-year cold storage requirements for financial records. This needs addressing — see backup retention project notes.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Secret Rotation
|
|||
|
|
|
|||
|
|
**Dynamic secrets (automatic):**
|
|||
|
|
Database credentials rotate every 5 minutes via Vault Secrets Operator. No manual intervention. The operator:
|
|||
|
|
1. Requests new creds from OpenBao
|
|||
|
|
2. Updates the K8s Secret
|
|||
|
|
3. Triggers rollout restart on affected Deployments
|
|||
|
|
|
|||
|
|
**Static secrets (manual when needed):**
|
|||
|
|
```bash
|
|||
|
|
# Rotate a static secret
|
|||
|
|
sunbeam vault kv put secret/{app}/{key} value="new-value"
|
|||
|
|
# VSO picks up changes within 30 seconds
|
|||
|
|
|
|||
|
|
# Rotate Django secret key (causes session invalidation)
|
|||
|
|
sunbeam vault kv put secret/messages/django-secret DJANGO_SECRET_KEY="$(openssl rand -hex 32)"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**OpenBao root token:**
|
|||
|
|
The root token is critical. If lost, you need to reinitialize (see COE-2026-001). Keep it somewhere safe and offline.
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Check seal status
|
|||
|
|
sunbeam vault status
|
|||
|
|
|
|||
|
|
# Unseal (needs 3 of 5 keys)
|
|||
|
|
sunbeam vault unseal <key1>
|
|||
|
|
sunbeam vault unseal <key2>
|
|||
|
|
sunbeam vault unseal <key3>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Certificate Renewal
|
|||
|
|
|
|||
|
|
**Automatic:** cert-manager renews Let's Encrypt certs before expiry. Pingora watches the K8s Secret and hot-reloads. Zero downtime.
|
|||
|
|
|
|||
|
|
**If auto-renewal fails:**
|
|||
|
|
```bash
|
|||
|
|
# Check cert-manager logs
|
|||
|
|
sunbeam logs cert-manager/cert-manager
|
|||
|
|
|
|||
|
|
# Check certificate status
|
|||
|
|
sunbeam k8s get certificates -A
|
|||
|
|
sunbeam k8s describe certificate -n ingress pingora-tls
|
|||
|
|
|
|||
|
|
# Force renewal
|
|||
|
|
sunbeam k8s delete certificate -n ingress pingora-tls
|
|||
|
|
# cert-manager will recreate it
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**ACME challenge issues:** Pingora routes HTTP-01 challenges by watching Ingress objects. If challenges fail, check that port 80 is open and Pingora is routing `.well-known/acme-challenge/` correctly.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Database Recovery
|
|||
|
|
|
|||
|
|
**Pod restart (transient failure):**
|
|||
|
|
```bash
|
|||
|
|
sunbeam restart data/postgres
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Data corruption (use backups):**
|
|||
|
|
CloudNativePG supports point-in-time recovery from barman backups. Edit the Cluster resource with a recovery section specifying the target time.
|
|||
|
|
|
|||
|
|
**Connection exhaustion:**
|
|||
|
|
Check connection counts in Grafana (`dashboards-infrastructure`). All apps use connection pooling — if connections are exhausted, something is leaking.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### SeaweedFS Recovery
|
|||
|
|
|
|||
|
|
**Volume server down:**
|
|||
|
|
```bash
|
|||
|
|
sunbeam logs storage/seaweedfs-volume
|
|||
|
|
sunbeam restart storage/seaweedfs-volume
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Filer issues (S3 API):**
|
|||
|
|
```bash
|
|||
|
|
sunbeam check storage
|
|||
|
|
sunbeam logs storage/seaweedfs-filer
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Data integrity:** Volumes live on local NVMe. No replication on single-node — if the disk dies, data is gone. Keep critical assets in multiple places (Drive + S3 via Hive, Git LFS in Gitea). Redundancy is self-care. 💅
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Sol☀️ Restart / Conversation Reset
|
|||
|
|
|
|||
|
|
Sol☀️ uses SQLite for conversation state. On restart:
|
|||
|
|
|
|||
|
|
1. Sol☀️ backfills recent messages from the OpenSearch archive
|
|||
|
|
2. Recreates the orchestrator agent if the system prompt has changed
|
|||
|
|
3. Sends `*sneezes*` to all rooms to signal the hiccup
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Normal restart
|
|||
|
|
sunbeam restart matrix/sol
|
|||
|
|
|
|||
|
|
# Check Sol☀️ is healthy
|
|||
|
|
sunbeam logs matrix/sol -f
|
|||
|
|
sunbeam check matrix
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
If conversations seem confused after restart, Sol☀️ will auto-compact and recover. The `*sneezes*` message is intentional — it tells the team "I restarted, give me a moment." They bounce back fast.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Alerts → Matrix
|
|||
|
|
|
|||
|
|
AlertManager routes all alerts to a Matrix room via the `matrix-alertmanager-receiver`:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Prometheus alert fires
|
|||
|
|
→ AlertManager evaluates severity + routing
|
|||
|
|
→ Webhook POST to matrix-alertmanager-receiver
|
|||
|
|
→ Bot formats alert and posts to Matrix room
|
|||
|
|
→ Team sees it in chat
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The bot account credentials are in `matrix-bot-secret`. If alerts stop appearing in Matrix, check:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
sunbeam logs monitoring/matrix-alertmanager-receiver
|
|||
|
|
sunbeam check monitoring
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Emergency Contacts
|
|||
|
|
|
|||
|
|
| Situation | What to do |
|
|||
|
|
|-----------|-----------|
|
|||
|
|
| Server unreachable | Check Scaleway console, reboot if needed |
|
|||
|
|
| OpenBao sealed | Unseal with 3 of 5 keys |
|
|||
|
|
| Database down | Check CloudNativePG operator logs, restart if needed |
|
|||
|
|
| All apps 502 | Check Pingora + Linkerd |
|
|||
|
|
| Email not sending | Check Postfix MTA-out, Scaleway TEM status |
|
|||
|
|
| Sol☀️ unresponsive | Restart, check Mistral API connectivity |
|
|||
|
|
| Certs expired | Delete cert resource, let cert-manager recreate |
|