Same technical rigor, more personality. Timeline reads like a story, 5 Whys have flair, closes with wine. 🍷✨
10 KiB
COE-2026-001: The One Where We Almost Lost the Keys to the House 🔑
Date: 2026-03-23 Severity: Critical (like, "one pod restart away from total blackout" critical) Author: Sienna Satterwhite Status: Resolved ✨
What Happened
On 2026-03-23, during routine CLI development, we discovered that the OpenBao (Vault) root token and unseal key had vanished. Gone. Poof. The openbao-keys Kubernetes Secret in the data namespace had been silently overwritten with empty data by a placeholder manifest during a previous sunbeam apply data run.
Here's the terrifying part: the vault was still working — but only because the openbao-0 pod hadn't restarted. The unseal state was living in memory like a ghost. One pod restart, one node reboot, one sneeze from Kubernetes, and the vault would have sealed itself permanently with no way back in. That would have taken down everything — SSO, identity, Git, Drive, Messages, Meet, Calendar, Sol☀️, monitoring. The whole house. 💀
We fixed it in ~2 hours: implemented an encrypted local keystore (so this can never happen again), re-initialized the vault with fresh keys, and re-seeded every credential in the system. Nobody lost any data. The platform had about 5 minutes of "please log in again" disruption while SSO sessions refreshed.
How long were we exposed? Days to weeks. We don't know exactly when the placeholder overwrote the secret. That's the scary part — and exactly why we're writing this up.
The Damage Report
- Direct impact: Root token permanently gone. No way to write new secrets, rotate credentials, or configure vault policies.
- What would've happened if the pod restarted: Total platform outage. Every service that reads from Vault (which is all of them) would have lost access to their credentials. Full blackout.
- What actually happened to users: Brief SSO session invalidation during the fix. Everyone had to log in again. 5 minutes of mild inconvenience.
- Data loss: Zero. All application data — databases, messages, repos, files, search indices — completely untouched. Only vault secrets were regenerated.
Timeline
All times UTC. Grab a drink, this is a ride. 🥂
| Time | What went down |
|---|---|
| Unknown (days prior) | sunbeam apply data runs and applies openbao-keys-placeholder.yaml, which quietly overwrites the openbao-keys Secret with... nothing. Empty. Blank. |
| Unknown | The auto-unseal sidecar's volume mount refreshes. The key file disappears from /openbao/unseal/. |
| Unknown | Vault stays unsealed because the pod hasn't restarted — seal state is held in memory. The house of cards stands. |
| ~11:30 | During CLI testing, sunbeam seed says "No root token available — skipping KV seeding." Sienna's eyebrows go up. |
| ~11:40 | sunbeam k8s get secret openbao-keys -n data -o yaml — the Secret exists but has zero data fields. The keys are just... gone. |
| ~11:45 | sunbeam k8s exec -n data openbao-0 -- bao status confirms vault is initialized and unsealed (in memory). One restart away from disaster. |
| ~11:50 | Frantic search through local files, Claude Code transcripts, shell history. No copy of the root token anywhere. Keys are confirmed permanently lost. |
| ~12:00 | Deep breath. Decision: build a proper encrypted keystore before reinitializing, so this literally cannot happen again. |
| ~12:30 | vault_keystore.rs implemented — AES-256-GCM encryption, Argon2id KDF, 26 unit tests passing. Built under pressure, built right. |
| ~13:00 | Keystore wired into seed flow. vault reinit/keys/export-keys CLI commands added. Placeholder YAML deleted from infra manifests forever. |
| 13:10 | All secrets from all namespaces backed up to /tmp/sunbeam-secrets-backup/ (75 files, 304K). Belt and suspenders. |
| 13:12 | sunbeam vault reinit — vault storage wiped, new root token and unseal key generated. The moment of truth. |
| 13:13 | New keys saved to local encrypted keystore at ~/Library/Application Support/sunbeam/vault/sunbeam.pt.enc. Never again. |
| 13:14 | sunbeam seed completes — all 19 KV paths written, database engine configured, K8s Secrets created, policies set. |
| 13:15 | Sol☀️'s manual secrets (matrix-access-token, matrix-device-id, mistral-api-key) restored from backup. |
| 13:20 | All service deployments restarted across ory, devtools, lasuite, matrix, media, monitoring namespaces. |
| 13:25 | All critical services confirmed running. Platform operational. Sienna pours a glass of wine. 🍷 |
By the Numbers
- Time to detect: Unknown (days to weeks — the empty secret had no monitoring)
- Time to resolve (from detection): ~2 hours
- Services affected during resolution: All (brief SSO session invalidation)
- Data loss: None
- Secrets regenerated: 19 KV paths, ~75 K8s Secrets across 8 namespaces
- Manual secrets requiring restore: 3 (Sol☀️'s matrix-access-token, matrix-device-id, mistral-api-key)
The Autopsy 🔍
Q: How was this caught?
During routine sunbeam seed testing. The command said "no root token" and we went "excuse me?"
Q: Why didn't we catch it sooner?
No monitoring on the openbao-keys Secret. The vault pod hadn't restarted, so everything kept working on cached credentials. Silent and deadly.
Q: What was the single point of failure? The root token and unseal key existed in exactly one place — a K8s Secret — with no local backup, no external copy, and no integrity monitoring. One location. One overwrite. One loss.
Q: Was any data exposed? No. This was about losing access, not unauthorized access. The vault was still sealed-in-memory with valid credentials.
5 Whys (or: How We Got Here)
Why was the root token lost?
The openbao-keys K8s Secret was overwritten with empty data.
Why was it overwritten?
A manifest called openbao-keys-placeholder.yaml was in sbbb/base/data/kustomization.yaml and got applied during sunbeam apply data, replacing the real Secret with an empty one.
Why was there a placeholder in the manifests?
It was added so the auto-unseal sidecar's volume mount would succeed even before the first sunbeam seed run. The assumption was that server-side apply with no data field would leave existing data alone. That assumption was wrong. 💅
Why was there no backup? The CLI stored keys exclusively in the K8s Secret. No local backup, no external copy, no validation on subsequent runs.
Why was there no monitoring? Vault key integrity wasn't part of the observability setup. We were watching service health, not infrastructure credential integrity. Lesson learned.
What We Fixed
| # | What | Severity | Status |
|---|---|---|---|
| 1 | Encrypted local vault keystore (vault_keystore.rs) |
Critical | Done ✅ |
AES-256-GCM, Argon2id KDF, 26 unit tests. Keys at ~/.local/share/sunbeam/vault/{domain}.enc. |
|||
| 2 | Keystore wired into seed flow | Critical | Done ✅ |
| Save after init, load as fallback, backfill from cluster, restore from local. | |||
| 3 | vault reinit/keys/export-keys CLI commands |
Critical | Done ✅ |
| Recovery, inspection, and migration tools. | |||
| 4 | Removed openbao-keys-placeholder.yaml from manifests |
Critical | Done ✅ |
Eliminated the overwrite vector. Auto-unseal volume mount now has optional: true. |
|||
| 5 | sunbeam.dev/managed-by: sunbeam label on programmatic secrets |
High | Done ✅ |
| Prevents future manifest overwrites of seed-managed secrets. | |||
| 6 | kv_put fallback when kv_patch returns 404 |
Medium | Done ✅ |
| Handles fresh vault initialization where KV paths don't exist yet. |
Monitoring We Need (So This Never Happens Again)
Vault Seal Status Alert
If OpenBao reports sealed: true for more than 60 seconds, something is very wrong. The auto-unseal sidecar should handle it in seconds — if it doesn't, the key is missing or corrupt.
- How: PrometheusRule against
/v1/sys/health(returns 503 when sealed).for: 1m, severity: critical. - Runbook: Check
openbao-keysSecret for thekeyfield. If empty, restore from local keystore viasunbeam vault keys/sunbeam seed.
Vault Key Secret Integrity Alert
This is the exact failure mode that bit us — alert when openbao-keys has zero data fields or is missing key/root-token.
- How: CronJob or Alloy scraper checking the Secret contents. Alert if empty.
- Runbook: Run
sunbeam seed(restores from local keystore). If no local keystore exists either, escalate immediately.
Local Keystore Sync Check
On every sunbeam seed and sunbeam vault status, verify local keystore matches cluster state.
- How: Already implemented in
verify_vault_keys()and the seed flow's backfill logic. Emits warnings on mismatch. - Runbook: Determine which copy is authoritative (usually local — cluster may have been overwritten) and re-sync.
VSO Secret Sync Failure Alert
If Vault Secrets Operator can't sync for more than 5 minutes, app secrets go stale.
- How: PrometheusRule on
vso_secret_sync_errors_totalorvso_secret_last_sync_timestampexceeding threshold. - Runbook: Check VSO logs. Verify vault is unsealed and the
vso-readerpolicy/role exists.
Pod Restart After Credential Rotation
After sunbeam seed, warn if any deployment hasn't been restarted within 10 minutes.
- How: Compare seed completion timestamp with deployment restart annotations. Could be a post-seed CLI check.
- Runbook:
sunbeam k8s rollout restart -n <namespace> deployment/<name>for each stale deployment.
Node Memory Pressure
During this investigation, we found the node at 95% memory (Longhorn instance-manager leaked 38GB), which crashed PostgreSQL. Memory pressure can cascade into vault pod restarts, triggering the sealed-vault failure mode.
- How: PrometheusRule: warning at 85% used, critical at 95%.
- Runbook:
sunbeam k8s top pods -A --sort-by=memory. Restart Longhorn instance-manager if above 10GB.
Related
- Security audit: Conducted 2026-03-22, identified 22 findings across proxy, identity, monitoring, storage, matrix, and Kubernetes layers.
- Local keystore:
~/Library/Application Support/sunbeam/vault/sunbeam.pt.enc - Backup location:
/tmp/sunbeam-secrets-backup/— plaintext backup taken before reinit (temporary — move to secure storage) - Longhorn memory leak: COE-2026-002 (pending) — the 38GB leak that added extra spice to this day
Lessons: back up your keys, monitor your secrets, don't trust placeholder YAMLs, and always have wine on hand for incident response. 🍷✨