docs: restyle COE-2026-001 with boujee tone

Same technical rigor, more personality. Timeline reads like a story,
5 Whys have flair, closes with wine. 🍷
This commit is contained in:
2026-03-24 11:48:19 +00:00
parent ceb038382f
commit 6ad3fdeac9

View File

@@ -1,173 +1,165 @@
# COE-2026-001: OpenBao Vault Root Token and Unseal Key Loss # COE-2026-001: The One Where We Almost Lost the Keys to the House 🔑
**Date:** 2026-03-23 **Date:** 2026-03-23
**Severity:** Critical **Severity:** Critical (like, "one pod restart away from total blackout" critical)
**Author:** Sienna Satterwhite **Author:** Sienna Satterwhite
**Status:** Resolved **Status:** Resolved
--- ---
## Summary ## What Happened
On 2026-03-23, during routine CLI development and infrastructure testing, the OpenBao (Vault) root token and unseal key were discovered to be missing from the `openbao-keys` Kubernetes Secret in the `data` namespace. The secret had been overwritten with empty data by a placeholder manifest (`openbao-keys-placeholder.yaml`) during a previous `sunbeam apply data` operation. The root token — the sole administrative credential for the vault — was permanently lost, with no local backup or external copy. On 2026-03-23, during routine CLI development, we discovered that the OpenBao (Vault) root token and unseal key had vanished. Gone. Poof. The `openbao-keys` Kubernetes Secret in the `data` namespace had been silently overwritten with empty data by a placeholder manifest during a previous `sunbeam apply data` run.
The vault remained operational only because the `openbao-0` pod had not restarted since the secret was wiped (the process held the unseal state in memory). A pod restart would have sealed the vault permanently with no way to unseal or authenticate, causing a total platform outage affecting all services that depend on vault-managed secrets (Hydra, Kratos, Gitea, all La Suite services, Matrix/Tuwunel, LiveKit, monitoring). Here's the terrifying part: the vault was still working — but only because the `openbao-0` pod hadn't restarted. The unseal state was living in memory like a ghost. One pod restart, one node reboot, one sneeze from Kubernetes, and the vault would have sealed itself permanently with no way back in. That would have taken down *everything* — SSO, identity, Git, Drive, Messages, Meet, Calendar, Sol☀, monitoring. The whole house. 💀
The incident was resolved by re-initializing the vault with new keys, implementing an encrypted local keystore to prevent future key loss, and re-seeding all service credentials. We fixed it in ~2 hours: implemented an encrypted local keystore (so this can never happen again), re-initialized the vault with fresh keys, and re-seeded every credential in the system. Nobody lost any data. The platform had about 5 minutes of "please log in again" disruption while SSO sessions refreshed.
**Duration of exposure:** Unknown — estimated days to weeks. The placeholder overwrite likely occurred during a prior `sunbeam apply data` run. **How long were we exposed?** Days to weeks. We don't know exactly when the placeholder overwrote the secret. That's the scary part — and exactly why we're writing this up.
**Duration of resolution:** ~2 hours (keystore implementation, reinit, reseed, service restart).
--- ---
## Impact ## The Damage Report
- **Direct impact:** Root token permanently lost. No ability to write new vault secrets, rotate credentials, or configure new vault policies. - **Direct impact:** Root token permanently gone. No way to write new secrets, rotate credentials, or configure vault policies.
- **Blast radius (if pod had restarted):** Total platform outage — all services using vault-managed secrets (SSO, identity, git hosting, file storage, messaging, video conferencing, calendars, email, monitoring) would have lost access to their credentials. - **What would've happened if the pod restarted:** Total platform outage. Every service that reads from Vault (which is all of them) would have lost access to their credentials. Full blackout.
- **Actual user impact:** None during the exposure window (vault pod did not restart). During resolution, all users were logged out of SSO and had to re-authenticate (~5 minutes of login disruption). - **What actually happened to users:** Brief SSO session invalidation during the fix. Everyone had to log in again. 5 minutes of mild inconvenience.
- **Data loss:** Zero. All application data (PostgreSQL databases, Matrix messages, Git repositories, S3 files, OpenSearch indices) was unaffected. Only vault KV secrets were regenerated. - **Data loss:** Zero. All application data databases, messages, repos, files, search indices — completely untouched. Only vault secrets were regenerated.
--- ---
## Timeline ## Timeline
All times UTC. All times UTC. Grab a drink, this is a ride. 🥂
| Time | Event | | Time | What went down |
|------|-------| |------|---------------|
| Unknown (days prior) | `sunbeam apply data` runs, applying `openbao-keys-placeholder.yaml` which overwrites the `openbao-keys` Secret with empty data. | | Unknown (days prior) | `sunbeam apply data` runs and applies `openbao-keys-placeholder.yaml`, which quietly overwrites the `openbao-keys` Secret with... nothing. Empty. Blank. |
| Unknown | Auto-unseal sidecar's volume mount refreshes from the now-empty Secret. The `key` file disappears from `/openbao/unseal/`. | | Unknown | The auto-unseal sidecar's volume mount refreshes. The key file disappears from `/openbao/unseal/`. |
| Unknown | The vault remains unsealed because the `openbao-0` process has not restarted — seal state is held in memory. | | Unknown | Vault stays unsealed because the pod hasn't restarted — seal state is held in memory. The house of cards stands. |
| 2026-03-23 ~11:30 | During CLI testing, `sunbeam seed` reports "No root token available — skipping KV seeding." Investigation begins. | | ~11:30 | During CLI testing, `sunbeam seed` says "No root token available — skipping KV seeding." Sienna's eyebrows go up. |
| 2026-03-23 ~11:40 | `sunbeam k8s get secret openbao-keys -n data -o yaml` reveals the Secret exists but has zero data fields. | | ~11:40 | `sunbeam k8s get secret openbao-keys -n data -o yaml` the Secret exists but has zero data fields. The keys are just... gone. |
| 2026-03-23 ~11:45 | `sunbeam k8s exec -n data openbao-0 -- bao status` confirms vault is initialized and unsealed (in memory). | | ~11:45 | `sunbeam k8s exec -n data openbao-0 -- bao status` confirms vault is initialized and unsealed (in memory). One restart away from disaster. |
| 2026-03-23 ~11:50 | Search of local files, Claude Code transcripts, and shell history finds no copy of the root token or unseal key. Keys are confirmed permanently lost. | | ~11:50 | Frantic search through local files, Claude Code transcripts, shell history. No copy of the root token anywhere. Keys are confirmed permanently lost. |
| 2026-03-23 ~12:00 | Decision made to implement a local encrypted keystore before reinitializing, to prevent recurrence. | | ~12:00 | Deep breath. Decision: build a proper encrypted keystore *before* reinitializing, so this literally cannot happen again. |
| 2026-03-23 ~12:30 | `vault_keystore.rs` module implemented — AES-256-GCM encryption with Argon2id KDF, 26 unit tests passing. | | ~12:30 | `vault_keystore.rs` implemented — AES-256-GCM encryption, Argon2id KDF, 26 unit tests passing. Built under pressure, built right. |
| 2026-03-23 ~13:00 | Keystore wired into seed flow, `vault reinit/keys/export-keys` CLI commands added, placeholder YAML removed from infra manifests. | | ~13:00 | Keystore wired into seed flow. `vault reinit/keys/export-keys` CLI commands added. Placeholder YAML deleted from infra manifests forever. |
| 2026-03-23 ~13:10 | All secrets from all namespaces backed up to `/tmp/sunbeam-secrets-backup/` (75 files, 304K). | | 13:10 | All secrets from all namespaces backed up to `/tmp/sunbeam-secrets-backup/` (75 files, 304K). Belt and suspenders. |
| 2026-03-23 13:12 | `sunbeam vault reinit` executed — vault storage wiped, new root token and unseal key generated. | | 13:12 | `sunbeam vault reinit` — vault storage wiped, new root token and unseal key generated. The moment of truth. |
| 2026-03-23 13:13 | New keys saved to local encrypted keystore at `~/Library/Application Support/sunbeam/vault/sunbeam.pt.enc`. | | 13:13 | New keys saved to local encrypted keystore at `~/Library/Application Support/sunbeam/vault/sunbeam.pt.enc`. Never again. |
| 2026-03-23 13:14 | `sunbeam seed` completes — all 19 KV paths written, database engine configured, K8s Secrets created, policies set. | | 13:14 | `sunbeam seed` completes — all 19 KV paths written, database engine configured, K8s Secrets created, policies set. |
| 2026-03-23 13:15 | Sol's manual secrets (matrix-access-token, matrix-device-id, mistral-api-key) restored from backup. | | 13:15 | Sol☀️'s manual secrets (matrix-access-token, matrix-device-id, mistral-api-key) restored from backup. |
| 2026-03-23 13:20 | All service deployments restarted across ory, devtools, lasuite, matrix, media, monitoring namespaces. | | 13:20 | All service deployments restarted across ory, devtools, lasuite, matrix, media, monitoring namespaces. |
| 2026-03-23 13:25 | All critical services confirmed running (Hydra, Kratos, Gitea, Drive, Tuwunel, Sol). Platform operational. | | 13:25 | All critical services confirmed running. Platform operational. Sienna pours a glass of wine. 🍷 |
--- ---
## Metrics ## By the Numbers
- **Time to detect:** Unknown (days to weeks — the empty secret was not monitored). - **Time to detect:** Unknown (days to weeks — the empty secret had no monitoring)
- **Time to resolve (from detection):** ~2 hours. - **Time to resolve (from detection):** ~2 hours
- **Services affected during resolution:** All (brief SSO session invalidation). - **Services affected during resolution:** All (brief SSO session invalidation)
- **Data loss:** None. - **Data loss:** None
- **Secrets regenerated:** 19 KV paths, ~75 K8s Secrets across 8 namespaces. - **Secrets regenerated:** 19 KV paths, ~75 K8s Secrets across 8 namespaces
- **Manual secrets requiring restore:** 3 (Sol's matrix-access-token, matrix-device-id, mistral-api-key). - **Manual secrets requiring restore:** 3 (Sol☀️'s matrix-access-token, matrix-device-id, mistral-api-key)
--- ---
## Incident Questions ## The Autopsy 🔍
**Q: How was the incident detected?** **Q: How was this caught?**
A: During routine `sunbeam seed` testing, the command reported "No root token available." Manual inspection revealed the empty K8s Secret. During routine `sunbeam seed` testing. The command said "no root token" and we went "excuse me?"
**Q: Why wasn't this detected earlier?** **Q: Why didn't we catch it sooner?**
A: No monitoring or alerting on the `openbao-keys` Secret contents. The vault pod hadn't restarted, so all services continued operating normally on cached credentials. No monitoring on the `openbao-keys` Secret. The vault pod hadn't restarted, so everything kept working on cached credentials. Silent and deadly.
**Q: What was the single point of failure?** **Q: What was the single point of failure?**
A: The root token and unseal key were stored in exactly one location — a K8s Secret — with no local backup, no external copy, and no integrity monitoring. The root token and unseal key existed in exactly one place — a K8s Secret — with no local backup, no external copy, and no integrity monitoring. One location. One overwrite. One loss.
**Q: Was any data exposed?** **Q: Was any data exposed?**
A: No. The vault was still sealed-in-memory with valid credentials. The risk was total loss of access (not unauthorized access). No. This was about losing access, not unauthorized access. The vault was still sealed-in-memory with valid credentials.
--- ---
## 5 Whys ## 5 Whys (or: How We Got Here)
**Why was the root token lost?** **Why was the root token lost?**
The `openbao-keys` K8s Secret was overwritten with empty data. The `openbao-keys` K8s Secret was overwritten with empty data.
**Why was it overwritten?** **Why was it overwritten?**
The infrastructure manifest `openbao-keys-placeholder.yaml` was included in `sbbb/base/data/kustomization.yaml` and applied during `sunbeam apply data`, replacing the populated Secret. A manifest called `openbao-keys-placeholder.yaml` was in `sbbb/base/data/kustomization.yaml` and got applied during `sunbeam apply data`, replacing the real Secret with an empty one.
**Why was a placeholder in the manifests?** **Why was there a placeholder in the manifests?**
It was added to ensure the auto-unseal sidecar's volume mount would succeed even before the first `sunbeam seed` run. The intention was that server-side apply with no `data` field would leave existing data untouched, but this assumption was incorrect. It was added so the auto-unseal sidecar's volume mount would succeed even before the first `sunbeam seed` run. The assumption was that server-side apply with no `data` field would leave existing data alone. That assumption was wrong. 💅
**Why was there no backup of the keys?** **Why was there no backup?**
The CLI's seed flow stored keys exclusively in the K8s Secret. There was no local encrypted backup, no external backup, and no validation check on subsequent operations. The CLI stored keys exclusively in the K8s Secret. No local backup, no external copy, no validation on subsequent runs.
**Why was there no monitoring for this?** **Why was there no monitoring?**
Vault key integrity was not considered in the operational monitoring setup. The platform's observability focused on service health, not infrastructure credential integrity. Vault key integrity wasn't part of the observability setup. We were watching service health, not infrastructure credential integrity. Lesson learned.
--- ---
## Action Items ## What We Fixed
| # | Action | Severity | Status | Notes | | # | What | Severity | Status |
|---|--------|----------|--------|-------| |---|------|----------|--------|
| 1 | Implement encrypted local vault keystore | Critical | **Done** | `vault_keystore.rs` — AES-256-GCM, Argon2id KDF, 26 unit tests. Keys stored at `~/.local/share/sunbeam/vault/{domain}.enc`. | | 1 | Encrypted local vault keystore (`vault_keystore.rs`) | Critical | **Done** |
| 2 | Wire keystore into seed flow | Critical | **Done** | Save after init, load as fallback, backfill from cluster, restore K8s Secret from local. | | | AES-256-GCM, Argon2id KDF, 26 unit tests. Keys at `~/.local/share/sunbeam/vault/{domain}.enc`. | | |
| 3 | Add `vault reinit/keys/export-keys` CLI commands | Critical | **Done** | Recovery, inspection, and migration tools. | | 2 | Keystore wired into seed flow | Critical | **Done** |
| 4 | Remove `openbao-keys-placeholder.yaml` from infra manifests | Critical | **Done** | Eliminated the overwrite vector. Auto-unseal volume mount has `optional: true`. | | | Save after init, load as fallback, backfill from cluster, restore from local. | | |
| 5 | Add `sunbeam.dev/managed-by: sunbeam` label to programmatic secrets | High | **Done** | Prevents future manifest overwrites of seed-managed secrets. | | 3 | `vault reinit/keys/export-keys` CLI commands | Critical | **Done** |
| 6 | Use `kv_put` fallback when `kv_patch` returns 404 | Medium | **Done** | Handles fresh vault initialization where KV paths don't exist yet. | | | Recovery, inspection, and migration tools. | | |
| 4 | Removed `openbao-keys-placeholder.yaml` from manifests | Critical | **Done** ✅ |
| | Eliminated the overwrite vector. Auto-unseal volume mount now has `optional: true`. | | |
| 5 | `sunbeam.dev/managed-by: sunbeam` label on programmatic secrets | High | **Done** ✅ |
| | Prevents future manifest overwrites of seed-managed secrets. | | |
| 6 | `kv_put` fallback when `kv_patch` returns 404 | Medium | **Done** ✅ |
| | Handles fresh vault initialization where KV paths don't exist yet. | | |
--- ---
## Operational Monitoring and Alerting Requirements ## Monitoring We Need (So This Never Happens Again)
The following monitoring and alerting must be implemented to detect similar incidents before they become critical.
### Vault Seal Status Alert ### Vault Seal Status Alert
If OpenBao reports `sealed: true` for more than 60 seconds, something is very wrong. The auto-unseal sidecar should handle it in seconds — if it doesn't, the key is missing or corrupt.
**What:** Alert when OpenBao reports `sealed: true` for more than 60 seconds. - **How:** PrometheusRule against `/v1/sys/health` (returns 503 when sealed). `for: 1m`, severity: critical.
**Why:** A sealed vault means no service can read secrets. The auto-unseal sidecar should unseal within seconds — if it doesn't, the unseal key is missing or corrupt. - **Runbook:** Check `openbao-keys` Secret for the `key` field. If empty, restore from local keystore via `sunbeam vault keys` / `sunbeam seed`.
**How:** Prometheus query against the OpenBao metrics endpoint (`/v1/sys/health` returns HTTP 503 when sealed). PrometheusRule with `for: 1m` and Severity: critical.
**Runbook:** Check `openbao-keys` Secret for the `key` field. If empty, restore from local keystore via `sunbeam vault keys` / `sunbeam seed`.
### Vault Key Secret Integrity Alert ### Vault Key Secret Integrity Alert
This is the exact failure mode that bit us — alert when `openbao-keys` has zero data fields or is missing `key`/`root-token`.
**What:** Alert when the `openbao-keys` Secret in the `data` namespace has zero data fields or is missing the `key` or `root-token` fields. - **How:** CronJob or Alloy scraper checking the Secret contents. Alert if empty.
**Why:** This is the exact failure mode that caused this incident — the secret was silently overwritten with empty data. - **Runbook:** Run `sunbeam seed` (restores from local keystore). If no local keystore exists either, escalate immediately.
**How:** A CronJob or Alloy scraper that periodically checks `kubectl get secret openbao-keys -n data -o jsonpath='{.data.key}'` and alerts if empty. Alternatively, a custom Prometheus exporter or a `sunbeam check` probe.
**Runbook:** If empty, run `sunbeam seed` which restores from the local keystore. If `sunbeam vault keys` shows no local keystore either, escalate immediately — this is the scenario we just recovered from.
### Local Keystore Sync Check ### Local Keystore Sync Check
On every `sunbeam seed` and `sunbeam vault status`, verify local keystore matches cluster state.
**What:** On every `sunbeam seed` and `sunbeam vault status` invocation, verify local keystore matches cluster state. - **How:** Already implemented in `verify_vault_keys()` and the seed flow's backfill logic. Emits warnings on mismatch.
**Why:** Drift between local and cluster keys means one copy may be stale or corrupt. - **Runbook:** Determine which copy is authoritative (usually local cluster may have been overwritten) and re-sync.
**How:** Already implemented in the `verify_vault_keys()` function and the seed flow's backfill/restore logic. Emits warnings on mismatch.
**Runbook:** If mismatch, determine which is authoritative (usually local — cluster may have been overwritten) and re-sync.
### VSO Secret Sync Failure Alert ### VSO Secret Sync Failure Alert
If Vault Secrets Operator can't sync for more than 5 minutes, app secrets go stale.
- **How:** PrometheusRule on `vso_secret_sync_errors_total` or `vso_secret_last_sync_timestamp` exceeding threshold.
- **Runbook:** Check VSO logs. Verify vault is unsealed and the `vso-reader` policy/role exists.
**What:** Alert when Vault Secrets Operator fails to sync secrets for more than 5 minutes. ### Pod Restart After Credential Rotation
**Why:** VSO sync failures mean application K8s Secrets go stale. After credential rotation (like this reinit), services won't pick up new creds. After `sunbeam seed`, warn if any deployment hasn't been restarted within 10 minutes.
**How:** VSO exposes metrics. PrometheusRule on `vso_secret_sync_errors_total` increasing or `vso_secret_last_sync_timestamp` older than threshold. - **How:** Compare seed completion timestamp with deployment restart annotations. Could be a post-seed CLI check.
**Runbook:** Check VSO logs (`sunbeam k8s logs -n vault-secrets-operator deploy/vault-secrets-operator-controller-manager`). Verify vault is unsealed and the `vso-reader` policy/role exists. - **Runbook:** `sunbeam k8s rollout restart -n <namespace> deployment/<name>` for each stale deployment.
### Pod Restart After Credential Rotation Alert ### Node Memory Pressure
During this investigation, we found the node at 95% memory (Longhorn instance-manager leaked 38GB), which crashed PostgreSQL. Memory pressure can cascade into vault pod restarts, triggering the sealed-vault failure mode.
**What:** After `sunbeam seed` completes, warn if any deployment hasn't been restarted within 10 minutes. - **How:** PrometheusRule: warning at 85% used, critical at 95%.
**Why:** Services running on old credentials will fail when the old K8s Secrets are overwritten by VSO sync. - **Runbook:** `sunbeam k8s top pods -A --sort-by=memory`. Restart Longhorn instance-manager if above 10GB.
**How:** Compare `sunbeam seed` completion timestamp with deployment `spec.template.metadata.annotations.restartedAt`. Can be a post-seed check in the CLI itself.
**Runbook:** Run `sunbeam k8s rollout restart -n <namespace> deployment/<name>` for each stale deployment.
### Node Memory Pressure Alert (Related)
**What:** Alert when node memory exceeds 85%.
**Why:** During this incident investigation, we discovered the node was at 95% memory (Longhorn instance-manager leaked 38GB), which caused PostgreSQL to crash. Memory pressure can cascade into vault pod restarts, triggering the sealed-vault failure mode.
**How:** PrometheusRule on `node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.15` for 5 minutes. Severity: warning at 85%, critical at 95%.
**Runbook:** Check `sunbeam k8s top pods -A --sort-by=memory` for the largest consumers. Restart Longhorn instance-manager if it's above 10GB. Scale down non-critical workloads if needed.
--- ---
## Related Items ## Related
- **Plan file:** `~/.claude/plans/delightful-meandering-thacker.md` — full security hardening plan (22 steps).
- **Security audit:** Conducted 2026-03-22, identified 22 findings across proxy, identity, monitoring, storage, matrix, and Kubernetes layers. - **Security audit:** Conducted 2026-03-22, identified 22 findings across proxy, identity, monitoring, storage, matrix, and Kubernetes layers.
- **Backup location:** `/tmp/sunbeam-secrets-backup/` — plaintext backup of all 75 K8s Secrets taken before reinit (temporary — move to secure storage). - **Local keystore:** `~/Library/Application Support/sunbeam/vault/sunbeam.pt.enc`
- **Local keystore:** `~/Library/Application Support/sunbeam/vault/sunbeam.pt.enc` — encrypted vault keys. - **Backup location:** `/tmp/sunbeam-secrets-backup/` — plaintext backup taken before reinit (temporary — move to secure storage)
- **Commits:** `sunbeam-sdk` vault_keystore module, seeding integration, CLI commands, kube label, infra placeholder removal. - **Longhorn memory leak:** COE-2026-002 (pending) — the 38GB leak that added extra spice to this day
- **Longhorn memory leak:** COE-2026-002 (pending) — Longhorn instance-manager consumed 38GB causing PostgreSQL crash during the same session.
---
*Lessons: back up your keys, monitor your secrets, don't trust placeholder YAMLs, and always have wine on hand for incident response.* 🍷✨