Same technical rigor, more personality. Timeline reads like a story, 5 Whys have flair, closes with wine. 🍷✨
166 lines
10 KiB
Markdown
166 lines
10 KiB
Markdown
# COE-2026-001: The One Where We Almost Lost the Keys to the House 🔑
|
||
|
||
**Date:** 2026-03-23
|
||
**Severity:** Critical (like, "one pod restart away from total blackout" critical)
|
||
**Author:** Sienna Satterwhite
|
||
**Status:** Resolved ✨
|
||
|
||
---
|
||
|
||
## What Happened
|
||
|
||
On 2026-03-23, during routine CLI development, we discovered that the OpenBao (Vault) root token and unseal key had vanished. Gone. Poof. The `openbao-keys` Kubernetes Secret in the `data` namespace had been silently overwritten with empty data by a placeholder manifest during a previous `sunbeam apply data` run.
|
||
|
||
Here's the terrifying part: the vault was still working — but only because the `openbao-0` pod hadn't restarted. The unseal state was living in memory like a ghost. One pod restart, one node reboot, one sneeze from Kubernetes, and the vault would have sealed itself permanently with no way back in. That would have taken down *everything* — SSO, identity, Git, Drive, Messages, Meet, Calendar, Sol☀️, monitoring. The whole house. 💀
|
||
|
||
We fixed it in ~2 hours: implemented an encrypted local keystore (so this can never happen again), re-initialized the vault with fresh keys, and re-seeded every credential in the system. Nobody lost any data. The platform had about 5 minutes of "please log in again" disruption while SSO sessions refreshed.
|
||
|
||
**How long were we exposed?** Days to weeks. We don't know exactly when the placeholder overwrote the secret. That's the scary part — and exactly why we're writing this up.
|
||
|
||
---
|
||
|
||
## The Damage Report
|
||
|
||
- **Direct impact:** Root token permanently gone. No way to write new secrets, rotate credentials, or configure vault policies.
|
||
- **What would've happened if the pod restarted:** Total platform outage. Every service that reads from Vault (which is all of them) would have lost access to their credentials. Full blackout.
|
||
- **What actually happened to users:** Brief SSO session invalidation during the fix. Everyone had to log in again. 5 minutes of mild inconvenience.
|
||
- **Data loss:** Zero. All application data — databases, messages, repos, files, search indices — completely untouched. Only vault secrets were regenerated.
|
||
|
||
---
|
||
|
||
## Timeline
|
||
|
||
All times UTC. Grab a drink, this is a ride. 🥂
|
||
|
||
| Time | What went down |
|
||
|------|---------------|
|
||
| Unknown (days prior) | `sunbeam apply data` runs and applies `openbao-keys-placeholder.yaml`, which quietly overwrites the `openbao-keys` Secret with... nothing. Empty. Blank. |
|
||
| Unknown | The auto-unseal sidecar's volume mount refreshes. The key file disappears from `/openbao/unseal/`. |
|
||
| Unknown | Vault stays unsealed because the pod hasn't restarted — seal state is held in memory. The house of cards stands. |
|
||
| ~11:30 | During CLI testing, `sunbeam seed` says "No root token available — skipping KV seeding." Sienna's eyebrows go up. |
|
||
| ~11:40 | `sunbeam k8s get secret openbao-keys -n data -o yaml` — the Secret exists but has zero data fields. The keys are just... gone. |
|
||
| ~11:45 | `sunbeam k8s exec -n data openbao-0 -- bao status` confirms vault is initialized and unsealed (in memory). One restart away from disaster. |
|
||
| ~11:50 | Frantic search through local files, Claude Code transcripts, shell history. No copy of the root token anywhere. Keys are confirmed permanently lost. |
|
||
| ~12:00 | Deep breath. Decision: build a proper encrypted keystore *before* reinitializing, so this literally cannot happen again. |
|
||
| ~12:30 | `vault_keystore.rs` implemented — AES-256-GCM encryption, Argon2id KDF, 26 unit tests passing. Built under pressure, built right. |
|
||
| ~13:00 | Keystore wired into seed flow. `vault reinit/keys/export-keys` CLI commands added. Placeholder YAML deleted from infra manifests forever. |
|
||
| 13:10 | All secrets from all namespaces backed up to `/tmp/sunbeam-secrets-backup/` (75 files, 304K). Belt and suspenders. |
|
||
| 13:12 | `sunbeam vault reinit` — vault storage wiped, new root token and unseal key generated. The moment of truth. |
|
||
| 13:13 | New keys saved to local encrypted keystore at `~/Library/Application Support/sunbeam/vault/sunbeam.pt.enc`. Never again. |
|
||
| 13:14 | `sunbeam seed` completes — all 19 KV paths written, database engine configured, K8s Secrets created, policies set. |
|
||
| 13:15 | Sol☀️'s manual secrets (matrix-access-token, matrix-device-id, mistral-api-key) restored from backup. |
|
||
| 13:20 | All service deployments restarted across ory, devtools, lasuite, matrix, media, monitoring namespaces. |
|
||
| 13:25 | All critical services confirmed running. Platform operational. Sienna pours a glass of wine. 🍷 |
|
||
|
||
---
|
||
|
||
## By the Numbers
|
||
|
||
- **Time to detect:** Unknown (days to weeks — the empty secret had no monitoring)
|
||
- **Time to resolve (from detection):** ~2 hours
|
||
- **Services affected during resolution:** All (brief SSO session invalidation)
|
||
- **Data loss:** None
|
||
- **Secrets regenerated:** 19 KV paths, ~75 K8s Secrets across 8 namespaces
|
||
- **Manual secrets requiring restore:** 3 (Sol☀️'s matrix-access-token, matrix-device-id, mistral-api-key)
|
||
|
||
---
|
||
|
||
## The Autopsy 🔍
|
||
|
||
**Q: How was this caught?**
|
||
During routine `sunbeam seed` testing. The command said "no root token" and we went "excuse me?"
|
||
|
||
**Q: Why didn't we catch it sooner?**
|
||
No monitoring on the `openbao-keys` Secret. The vault pod hadn't restarted, so everything kept working on cached credentials. Silent and deadly.
|
||
|
||
**Q: What was the single point of failure?**
|
||
The root token and unseal key existed in exactly one place — a K8s Secret — with no local backup, no external copy, and no integrity monitoring. One location. One overwrite. One loss.
|
||
|
||
**Q: Was any data exposed?**
|
||
No. This was about losing access, not unauthorized access. The vault was still sealed-in-memory with valid credentials.
|
||
|
||
---
|
||
|
||
## 5 Whys (or: How We Got Here)
|
||
|
||
**Why was the root token lost?**
|
||
The `openbao-keys` K8s Secret was overwritten with empty data.
|
||
|
||
**Why was it overwritten?**
|
||
A manifest called `openbao-keys-placeholder.yaml` was in `sbbb/base/data/kustomization.yaml` and got applied during `sunbeam apply data`, replacing the real Secret with an empty one.
|
||
|
||
**Why was there a placeholder in the manifests?**
|
||
It was added so the auto-unseal sidecar's volume mount would succeed even before the first `sunbeam seed` run. The assumption was that server-side apply with no `data` field would leave existing data alone. That assumption was wrong. 💅
|
||
|
||
**Why was there no backup?**
|
||
The CLI stored keys exclusively in the K8s Secret. No local backup, no external copy, no validation on subsequent runs.
|
||
|
||
**Why was there no monitoring?**
|
||
Vault key integrity wasn't part of the observability setup. We were watching service health, not infrastructure credential integrity. Lesson learned.
|
||
|
||
---
|
||
|
||
## What We Fixed
|
||
|
||
| # | What | Severity | Status |
|
||
|---|------|----------|--------|
|
||
| 1 | Encrypted local vault keystore (`vault_keystore.rs`) | Critical | **Done** ✅ |
|
||
| | AES-256-GCM, Argon2id KDF, 26 unit tests. Keys at `~/.local/share/sunbeam/vault/{domain}.enc`. | | |
|
||
| 2 | Keystore wired into seed flow | Critical | **Done** ✅ |
|
||
| | Save after init, load as fallback, backfill from cluster, restore from local. | | |
|
||
| 3 | `vault reinit/keys/export-keys` CLI commands | Critical | **Done** ✅ |
|
||
| | Recovery, inspection, and migration tools. | | |
|
||
| 4 | Removed `openbao-keys-placeholder.yaml` from manifests | Critical | **Done** ✅ |
|
||
| | Eliminated the overwrite vector. Auto-unseal volume mount now has `optional: true`. | | |
|
||
| 5 | `sunbeam.dev/managed-by: sunbeam` label on programmatic secrets | High | **Done** ✅ |
|
||
| | Prevents future manifest overwrites of seed-managed secrets. | | |
|
||
| 6 | `kv_put` fallback when `kv_patch` returns 404 | Medium | **Done** ✅ |
|
||
| | Handles fresh vault initialization where KV paths don't exist yet. | | |
|
||
|
||
---
|
||
|
||
## Monitoring We Need (So This Never Happens Again)
|
||
|
||
### Vault Seal Status Alert
|
||
If OpenBao reports `sealed: true` for more than 60 seconds, something is very wrong. The auto-unseal sidecar should handle it in seconds — if it doesn't, the key is missing or corrupt.
|
||
- **How:** PrometheusRule against `/v1/sys/health` (returns 503 when sealed). `for: 1m`, severity: critical.
|
||
- **Runbook:** Check `openbao-keys` Secret for the `key` field. If empty, restore from local keystore via `sunbeam vault keys` / `sunbeam seed`.
|
||
|
||
### Vault Key Secret Integrity Alert
|
||
This is the exact failure mode that bit us — alert when `openbao-keys` has zero data fields or is missing `key`/`root-token`.
|
||
- **How:** CronJob or Alloy scraper checking the Secret contents. Alert if empty.
|
||
- **Runbook:** Run `sunbeam seed` (restores from local keystore). If no local keystore exists either, escalate immediately.
|
||
|
||
### Local Keystore Sync Check
|
||
On every `sunbeam seed` and `sunbeam vault status`, verify local keystore matches cluster state.
|
||
- **How:** Already implemented in `verify_vault_keys()` and the seed flow's backfill logic. Emits warnings on mismatch.
|
||
- **Runbook:** Determine which copy is authoritative (usually local — cluster may have been overwritten) and re-sync.
|
||
|
||
### VSO Secret Sync Failure Alert
|
||
If Vault Secrets Operator can't sync for more than 5 minutes, app secrets go stale.
|
||
- **How:** PrometheusRule on `vso_secret_sync_errors_total` or `vso_secret_last_sync_timestamp` exceeding threshold.
|
||
- **Runbook:** Check VSO logs. Verify vault is unsealed and the `vso-reader` policy/role exists.
|
||
|
||
### Pod Restart After Credential Rotation
|
||
After `sunbeam seed`, warn if any deployment hasn't been restarted within 10 minutes.
|
||
- **How:** Compare seed completion timestamp with deployment restart annotations. Could be a post-seed CLI check.
|
||
- **Runbook:** `sunbeam k8s rollout restart -n <namespace> deployment/<name>` for each stale deployment.
|
||
|
||
### Node Memory Pressure
|
||
During this investigation, we found the node at 95% memory (Longhorn instance-manager leaked 38GB), which crashed PostgreSQL. Memory pressure can cascade into vault pod restarts, triggering the sealed-vault failure mode.
|
||
- **How:** PrometheusRule: warning at 85% used, critical at 95%.
|
||
- **Runbook:** `sunbeam k8s top pods -A --sort-by=memory`. Restart Longhorn instance-manager if above 10GB.
|
||
|
||
---
|
||
|
||
## Related
|
||
|
||
- **Security audit:** Conducted 2026-03-22, identified 22 findings across proxy, identity, monitoring, storage, matrix, and Kubernetes layers.
|
||
- **Local keystore:** `~/Library/Application Support/sunbeam/vault/sunbeam.pt.enc`
|
||
- **Backup location:** `/tmp/sunbeam-secrets-backup/` — plaintext backup taken before reinit (temporary — move to secure storage)
|
||
- **Longhorn memory leak:** COE-2026-002 (pending) — the 38GB leak that added extra spice to this day
|
||
|
||
---
|
||
|
||
*Lessons: back up your keys, monitor your secrets, don't trust placeholder YAMLs, and always have wine on hand for incident response.* 🍷✨
|