diff --git a/docs/ops/COE-2026-001-vault-root-token-loss.md b/docs/ops/COE-2026-001-vault-root-token-loss.md index 5b82c79..ee16c44 100644 --- a/docs/ops/COE-2026-001-vault-root-token-loss.md +++ b/docs/ops/COE-2026-001-vault-root-token-loss.md @@ -1,173 +1,165 @@ -# COE-2026-001: OpenBao Vault Root Token and Unseal Key Loss +# COE-2026-001: The One Where We Almost Lost the Keys to the House 🔑 **Date:** 2026-03-23 -**Severity:** Critical +**Severity:** Critical (like, "one pod restart away from total blackout" critical) **Author:** Sienna Satterwhite -**Status:** Resolved +**Status:** Resolved ✨ --- -## Summary +## What Happened -On 2026-03-23, during routine CLI development and infrastructure testing, the OpenBao (Vault) root token and unseal key were discovered to be missing from the `openbao-keys` Kubernetes Secret in the `data` namespace. The secret had been overwritten with empty data by a placeholder manifest (`openbao-keys-placeholder.yaml`) during a previous `sunbeam apply data` operation. The root token — the sole administrative credential for the vault — was permanently lost, with no local backup or external copy. +On 2026-03-23, during routine CLI development, we discovered that the OpenBao (Vault) root token and unseal key had vanished. Gone. Poof. The `openbao-keys` Kubernetes Secret in the `data` namespace had been silently overwritten with empty data by a placeholder manifest during a previous `sunbeam apply data` run. -The vault remained operational only because the `openbao-0` pod had not restarted since the secret was wiped (the process held the unseal state in memory). A pod restart would have sealed the vault permanently with no way to unseal or authenticate, causing a total platform outage affecting all services that depend on vault-managed secrets (Hydra, Kratos, Gitea, all La Suite services, Matrix/Tuwunel, LiveKit, monitoring). +Here's the terrifying part: the vault was still working — but only because the `openbao-0` pod hadn't restarted. The unseal state was living in memory like a ghost. One pod restart, one node reboot, one sneeze from Kubernetes, and the vault would have sealed itself permanently with no way back in. That would have taken down *everything* — SSO, identity, Git, Drive, Messages, Meet, Calendar, Sol☀️, monitoring. The whole house. 💀 -The incident was resolved by re-initializing the vault with new keys, implementing an encrypted local keystore to prevent future key loss, and re-seeding all service credentials. +We fixed it in ~2 hours: implemented an encrypted local keystore (so this can never happen again), re-initialized the vault with fresh keys, and re-seeded every credential in the system. Nobody lost any data. The platform had about 5 minutes of "please log in again" disruption while SSO sessions refreshed. -**Duration of exposure:** Unknown — estimated days to weeks. The placeholder overwrite likely occurred during a prior `sunbeam apply data` run. - -**Duration of resolution:** ~2 hours (keystore implementation, reinit, reseed, service restart). +**How long were we exposed?** Days to weeks. We don't know exactly when the placeholder overwrote the secret. That's the scary part — and exactly why we're writing this up. --- -## Impact +## The Damage Report -- **Direct impact:** Root token permanently lost. No ability to write new vault secrets, rotate credentials, or configure new vault policies. -- **Blast radius (if pod had restarted):** Total platform outage — all services using vault-managed secrets (SSO, identity, git hosting, file storage, messaging, video conferencing, calendars, email, monitoring) would have lost access to their credentials. -- **Actual user impact:** None during the exposure window (vault pod did not restart). During resolution, all users were logged out of SSO and had to re-authenticate (~5 minutes of login disruption). -- **Data loss:** Zero. All application data (PostgreSQL databases, Matrix messages, Git repositories, S3 files, OpenSearch indices) was unaffected. Only vault KV secrets were regenerated. +- **Direct impact:** Root token permanently gone. No way to write new secrets, rotate credentials, or configure vault policies. +- **What would've happened if the pod restarted:** Total platform outage. Every service that reads from Vault (which is all of them) would have lost access to their credentials. Full blackout. +- **What actually happened to users:** Brief SSO session invalidation during the fix. Everyone had to log in again. 5 minutes of mild inconvenience. +- **Data loss:** Zero. All application data — databases, messages, repos, files, search indices — completely untouched. Only vault secrets were regenerated. --- ## Timeline -All times UTC. +All times UTC. Grab a drink, this is a ride. 🥂 -| Time | Event | -|------|-------| -| Unknown (days prior) | `sunbeam apply data` runs, applying `openbao-keys-placeholder.yaml` which overwrites the `openbao-keys` Secret with empty data. | -| Unknown | Auto-unseal sidecar's volume mount refreshes from the now-empty Secret. The `key` file disappears from `/openbao/unseal/`. | -| Unknown | The vault remains unsealed because the `openbao-0` process has not restarted — seal state is held in memory. | -| 2026-03-23 ~11:30 | During CLI testing, `sunbeam seed` reports "No root token available — skipping KV seeding." Investigation begins. | -| 2026-03-23 ~11:40 | `sunbeam k8s get secret openbao-keys -n data -o yaml` reveals the Secret exists but has zero data fields. | -| 2026-03-23 ~11:45 | `sunbeam k8s exec -n data openbao-0 -- bao status` confirms vault is initialized and unsealed (in memory). | -| 2026-03-23 ~11:50 | Search of local files, Claude Code transcripts, and shell history finds no copy of the root token or unseal key. Keys are confirmed permanently lost. | -| 2026-03-23 ~12:00 | Decision made to implement a local encrypted keystore before reinitializing, to prevent recurrence. | -| 2026-03-23 ~12:30 | `vault_keystore.rs` module implemented — AES-256-GCM encryption with Argon2id KDF, 26 unit tests passing. | -| 2026-03-23 ~13:00 | Keystore wired into seed flow, `vault reinit/keys/export-keys` CLI commands added, placeholder YAML removed from infra manifests. | -| 2026-03-23 ~13:10 | All secrets from all namespaces backed up to `/tmp/sunbeam-secrets-backup/` (75 files, 304K). | -| 2026-03-23 13:12 | `sunbeam vault reinit` executed — vault storage wiped, new root token and unseal key generated. | -| 2026-03-23 13:13 | New keys saved to local encrypted keystore at `~/Library/Application Support/sunbeam/vault/sunbeam.pt.enc`. | -| 2026-03-23 13:14 | `sunbeam seed` completes — all 19 KV paths written, database engine configured, K8s Secrets created, policies set. | -| 2026-03-23 13:15 | Sol's manual secrets (matrix-access-token, matrix-device-id, mistral-api-key) restored from backup. | -| 2026-03-23 13:20 | All service deployments restarted across ory, devtools, lasuite, matrix, media, monitoring namespaces. | -| 2026-03-23 13:25 | All critical services confirmed running (Hydra, Kratos, Gitea, Drive, Tuwunel, Sol). Platform operational. | +| Time | What went down | +|------|---------------| +| Unknown (days prior) | `sunbeam apply data` runs and applies `openbao-keys-placeholder.yaml`, which quietly overwrites the `openbao-keys` Secret with... nothing. Empty. Blank. | +| Unknown | The auto-unseal sidecar's volume mount refreshes. The key file disappears from `/openbao/unseal/`. | +| Unknown | Vault stays unsealed because the pod hasn't restarted — seal state is held in memory. The house of cards stands. | +| ~11:30 | During CLI testing, `sunbeam seed` says "No root token available — skipping KV seeding." Sienna's eyebrows go up. | +| ~11:40 | `sunbeam k8s get secret openbao-keys -n data -o yaml` — the Secret exists but has zero data fields. The keys are just... gone. | +| ~11:45 | `sunbeam k8s exec -n data openbao-0 -- bao status` confirms vault is initialized and unsealed (in memory). One restart away from disaster. | +| ~11:50 | Frantic search through local files, Claude Code transcripts, shell history. No copy of the root token anywhere. Keys are confirmed permanently lost. | +| ~12:00 | Deep breath. Decision: build a proper encrypted keystore *before* reinitializing, so this literally cannot happen again. | +| ~12:30 | `vault_keystore.rs` implemented — AES-256-GCM encryption, Argon2id KDF, 26 unit tests passing. Built under pressure, built right. | +| ~13:00 | Keystore wired into seed flow. `vault reinit/keys/export-keys` CLI commands added. Placeholder YAML deleted from infra manifests forever. | +| 13:10 | All secrets from all namespaces backed up to `/tmp/sunbeam-secrets-backup/` (75 files, 304K). Belt and suspenders. | +| 13:12 | `sunbeam vault reinit` — vault storage wiped, new root token and unseal key generated. The moment of truth. | +| 13:13 | New keys saved to local encrypted keystore at `~/Library/Application Support/sunbeam/vault/sunbeam.pt.enc`. Never again. | +| 13:14 | `sunbeam seed` completes — all 19 KV paths written, database engine configured, K8s Secrets created, policies set. | +| 13:15 | Sol☀️'s manual secrets (matrix-access-token, matrix-device-id, mistral-api-key) restored from backup. | +| 13:20 | All service deployments restarted across ory, devtools, lasuite, matrix, media, monitoring namespaces. | +| 13:25 | All critical services confirmed running. Platform operational. Sienna pours a glass of wine. 🍷 | --- -## Metrics +## By the Numbers -- **Time to detect:** Unknown (days to weeks — the empty secret was not monitored). -- **Time to resolve (from detection):** ~2 hours. -- **Services affected during resolution:** All (brief SSO session invalidation). -- **Data loss:** None. -- **Secrets regenerated:** 19 KV paths, ~75 K8s Secrets across 8 namespaces. -- **Manual secrets requiring restore:** 3 (Sol's matrix-access-token, matrix-device-id, mistral-api-key). +- **Time to detect:** Unknown (days to weeks — the empty secret had no monitoring) +- **Time to resolve (from detection):** ~2 hours +- **Services affected during resolution:** All (brief SSO session invalidation) +- **Data loss:** None +- **Secrets regenerated:** 19 KV paths, ~75 K8s Secrets across 8 namespaces +- **Manual secrets requiring restore:** 3 (Sol☀️'s matrix-access-token, matrix-device-id, mistral-api-key) --- -## Incident Questions +## The Autopsy 🔍 -**Q: How was the incident detected?** -A: During routine `sunbeam seed` testing, the command reported "No root token available." Manual inspection revealed the empty K8s Secret. +**Q: How was this caught?** +During routine `sunbeam seed` testing. The command said "no root token" and we went "excuse me?" -**Q: Why wasn't this detected earlier?** -A: No monitoring or alerting on the `openbao-keys` Secret contents. The vault pod hadn't restarted, so all services continued operating normally on cached credentials. +**Q: Why didn't we catch it sooner?** +No monitoring on the `openbao-keys` Secret. The vault pod hadn't restarted, so everything kept working on cached credentials. Silent and deadly. **Q: What was the single point of failure?** -A: The root token and unseal key were stored in exactly one location — a K8s Secret — with no local backup, no external copy, and no integrity monitoring. +The root token and unseal key existed in exactly one place — a K8s Secret — with no local backup, no external copy, and no integrity monitoring. One location. One overwrite. One loss. **Q: Was any data exposed?** -A: No. The vault was still sealed-in-memory with valid credentials. The risk was total loss of access (not unauthorized access). +No. This was about losing access, not unauthorized access. The vault was still sealed-in-memory with valid credentials. --- -## 5 Whys +## 5 Whys (or: How We Got Here) **Why was the root token lost?** The `openbao-keys` K8s Secret was overwritten with empty data. **Why was it overwritten?** -The infrastructure manifest `openbao-keys-placeholder.yaml` was included in `sbbb/base/data/kustomization.yaml` and applied during `sunbeam apply data`, replacing the populated Secret. +A manifest called `openbao-keys-placeholder.yaml` was in `sbbb/base/data/kustomization.yaml` and got applied during `sunbeam apply data`, replacing the real Secret with an empty one. -**Why was a placeholder in the manifests?** -It was added to ensure the auto-unseal sidecar's volume mount would succeed even before the first `sunbeam seed` run. The intention was that server-side apply with no `data` field would leave existing data untouched, but this assumption was incorrect. +**Why was there a placeholder in the manifests?** +It was added so the auto-unseal sidecar's volume mount would succeed even before the first `sunbeam seed` run. The assumption was that server-side apply with no `data` field would leave existing data alone. That assumption was wrong. 💅 -**Why was there no backup of the keys?** -The CLI's seed flow stored keys exclusively in the K8s Secret. There was no local encrypted backup, no external backup, and no validation check on subsequent operations. +**Why was there no backup?** +The CLI stored keys exclusively in the K8s Secret. No local backup, no external copy, no validation on subsequent runs. -**Why was there no monitoring for this?** -Vault key integrity was not considered in the operational monitoring setup. The platform's observability focused on service health, not infrastructure credential integrity. +**Why was there no monitoring?** +Vault key integrity wasn't part of the observability setup. We were watching service health, not infrastructure credential integrity. Lesson learned. --- -## Action Items +## What We Fixed -| # | Action | Severity | Status | Notes | -|---|--------|----------|--------|-------| -| 1 | Implement encrypted local vault keystore | Critical | **Done** | `vault_keystore.rs` — AES-256-GCM, Argon2id KDF, 26 unit tests. Keys stored at `~/.local/share/sunbeam/vault/{domain}.enc`. | -| 2 | Wire keystore into seed flow | Critical | **Done** | Save after init, load as fallback, backfill from cluster, restore K8s Secret from local. | -| 3 | Add `vault reinit/keys/export-keys` CLI commands | Critical | **Done** | Recovery, inspection, and migration tools. | -| 4 | Remove `openbao-keys-placeholder.yaml` from infra manifests | Critical | **Done** | Eliminated the overwrite vector. Auto-unseal volume mount has `optional: true`. | -| 5 | Add `sunbeam.dev/managed-by: sunbeam` label to programmatic secrets | High | **Done** | Prevents future manifest overwrites of seed-managed secrets. | -| 6 | Use `kv_put` fallback when `kv_patch` returns 404 | Medium | **Done** | Handles fresh vault initialization where KV paths don't exist yet. | +| # | What | Severity | Status | +|---|------|----------|--------| +| 1 | Encrypted local vault keystore (`vault_keystore.rs`) | Critical | **Done** ✅ | +| | AES-256-GCM, Argon2id KDF, 26 unit tests. Keys at `~/.local/share/sunbeam/vault/{domain}.enc`. | | | +| 2 | Keystore wired into seed flow | Critical | **Done** ✅ | +| | Save after init, load as fallback, backfill from cluster, restore from local. | | | +| 3 | `vault reinit/keys/export-keys` CLI commands | Critical | **Done** ✅ | +| | Recovery, inspection, and migration tools. | | | +| 4 | Removed `openbao-keys-placeholder.yaml` from manifests | Critical | **Done** ✅ | +| | Eliminated the overwrite vector. Auto-unseal volume mount now has `optional: true`. | | | +| 5 | `sunbeam.dev/managed-by: sunbeam` label on programmatic secrets | High | **Done** ✅ | +| | Prevents future manifest overwrites of seed-managed secrets. | | | +| 6 | `kv_put` fallback when `kv_patch` returns 404 | Medium | **Done** ✅ | +| | Handles fresh vault initialization where KV paths don't exist yet. | | | --- -## Operational Monitoring and Alerting Requirements - -The following monitoring and alerting must be implemented to detect similar incidents before they become critical. +## Monitoring We Need (So This Never Happens Again) ### Vault Seal Status Alert - -**What:** Alert when OpenBao reports `sealed: true` for more than 60 seconds. -**Why:** A sealed vault means no service can read secrets. The auto-unseal sidecar should unseal within seconds — if it doesn't, the unseal key is missing or corrupt. -**How:** Prometheus query against the OpenBao metrics endpoint (`/v1/sys/health` returns HTTP 503 when sealed). PrometheusRule with `for: 1m` and Severity: critical. -**Runbook:** Check `openbao-keys` Secret for the `key` field. If empty, restore from local keystore via `sunbeam vault keys` / `sunbeam seed`. +If OpenBao reports `sealed: true` for more than 60 seconds, something is very wrong. The auto-unseal sidecar should handle it in seconds — if it doesn't, the key is missing or corrupt. +- **How:** PrometheusRule against `/v1/sys/health` (returns 503 when sealed). `for: 1m`, severity: critical. +- **Runbook:** Check `openbao-keys` Secret for the `key` field. If empty, restore from local keystore via `sunbeam vault keys` / `sunbeam seed`. ### Vault Key Secret Integrity Alert - -**What:** Alert when the `openbao-keys` Secret in the `data` namespace has zero data fields or is missing the `key` or `root-token` fields. -**Why:** This is the exact failure mode that caused this incident — the secret was silently overwritten with empty data. -**How:** A CronJob or Alloy scraper that periodically checks `kubectl get secret openbao-keys -n data -o jsonpath='{.data.key}'` and alerts if empty. Alternatively, a custom Prometheus exporter or a `sunbeam check` probe. -**Runbook:** If empty, run `sunbeam seed` which restores from the local keystore. If `sunbeam vault keys` shows no local keystore either, escalate immediately — this is the scenario we just recovered from. +This is the exact failure mode that bit us — alert when `openbao-keys` has zero data fields or is missing `key`/`root-token`. +- **How:** CronJob or Alloy scraper checking the Secret contents. Alert if empty. +- **Runbook:** Run `sunbeam seed` (restores from local keystore). If no local keystore exists either, escalate immediately. ### Local Keystore Sync Check - -**What:** On every `sunbeam seed` and `sunbeam vault status` invocation, verify local keystore matches cluster state. -**Why:** Drift between local and cluster keys means one copy may be stale or corrupt. -**How:** Already implemented in the `verify_vault_keys()` function and the seed flow's backfill/restore logic. Emits warnings on mismatch. -**Runbook:** If mismatch, determine which is authoritative (usually local — cluster may have been overwritten) and re-sync. +On every `sunbeam seed` and `sunbeam vault status`, verify local keystore matches cluster state. +- **How:** Already implemented in `verify_vault_keys()` and the seed flow's backfill logic. Emits warnings on mismatch. +- **Runbook:** Determine which copy is authoritative (usually local — cluster may have been overwritten) and re-sync. ### VSO Secret Sync Failure Alert +If Vault Secrets Operator can't sync for more than 5 minutes, app secrets go stale. +- **How:** PrometheusRule on `vso_secret_sync_errors_total` or `vso_secret_last_sync_timestamp` exceeding threshold. +- **Runbook:** Check VSO logs. Verify vault is unsealed and the `vso-reader` policy/role exists. -**What:** Alert when Vault Secrets Operator fails to sync secrets for more than 5 minutes. -**Why:** VSO sync failures mean application K8s Secrets go stale. After credential rotation (like this reinit), services won't pick up new creds. -**How:** VSO exposes metrics. PrometheusRule on `vso_secret_sync_errors_total` increasing or `vso_secret_last_sync_timestamp` older than threshold. -**Runbook:** Check VSO logs (`sunbeam k8s logs -n vault-secrets-operator deploy/vault-secrets-operator-controller-manager`). Verify vault is unsealed and the `vso-reader` policy/role exists. +### Pod Restart After Credential Rotation +After `sunbeam seed`, warn if any deployment hasn't been restarted within 10 minutes. +- **How:** Compare seed completion timestamp with deployment restart annotations. Could be a post-seed CLI check. +- **Runbook:** `sunbeam k8s rollout restart -n deployment/` for each stale deployment. -### Pod Restart After Credential Rotation Alert - -**What:** After `sunbeam seed` completes, warn if any deployment hasn't been restarted within 10 minutes. -**Why:** Services running on old credentials will fail when the old K8s Secrets are overwritten by VSO sync. -**How:** Compare `sunbeam seed` completion timestamp with deployment `spec.template.metadata.annotations.restartedAt`. Can be a post-seed check in the CLI itself. -**Runbook:** Run `sunbeam k8s rollout restart -n deployment/` for each stale deployment. - -### Node Memory Pressure Alert (Related) - -**What:** Alert when node memory exceeds 85%. -**Why:** During this incident investigation, we discovered the node was at 95% memory (Longhorn instance-manager leaked 38GB), which caused PostgreSQL to crash. Memory pressure can cascade into vault pod restarts, triggering the sealed-vault failure mode. -**How:** PrometheusRule on `node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.15` for 5 minutes. Severity: warning at 85%, critical at 95%. -**Runbook:** Check `sunbeam k8s top pods -A --sort-by=memory` for the largest consumers. Restart Longhorn instance-manager if it's above 10GB. Scale down non-critical workloads if needed. +### Node Memory Pressure +During this investigation, we found the node at 95% memory (Longhorn instance-manager leaked 38GB), which crashed PostgreSQL. Memory pressure can cascade into vault pod restarts, triggering the sealed-vault failure mode. +- **How:** PrometheusRule: warning at 85% used, critical at 95%. +- **Runbook:** `sunbeam k8s top pods -A --sort-by=memory`. Restart Longhorn instance-manager if above 10GB. --- -## Related Items +## Related -- **Plan file:** `~/.claude/plans/delightful-meandering-thacker.md` — full security hardening plan (22 steps). - **Security audit:** Conducted 2026-03-22, identified 22 findings across proxy, identity, monitoring, storage, matrix, and Kubernetes layers. -- **Backup location:** `/tmp/sunbeam-secrets-backup/` — plaintext backup of all 75 K8s Secrets taken before reinit (temporary — move to secure storage). -- **Local keystore:** `~/Library/Application Support/sunbeam/vault/sunbeam.pt.enc` — encrypted vault keys. -- **Commits:** `sunbeam-sdk` vault_keystore module, seeding integration, CLI commands, kube label, infra placeholder removal. -- **Longhorn memory leak:** COE-2026-002 (pending) — Longhorn instance-manager consumed 38GB causing PostgreSQL crash during the same session. +- **Local keystore:** `~/Library/Application Support/sunbeam/vault/sunbeam.pt.enc` +- **Backup location:** `/tmp/sunbeam-secrets-backup/` — plaintext backup taken before reinit (temporary — move to secure storage) +- **Longhorn memory leak:** COE-2026-002 (pending) — the 38GB leak that added extra spice to this day + +--- + +*Lessons: back up your keys, monitor your secrets, don't trust placeholder YAMLs, and always have wine on hand for incident response.* 🍷✨