Files
sbbb/docs/ops/COE-2026-001-vault-root-token-loss.md
Sienna Meridian Satterwhite fe6634e9c2 docs: COE-2026-001 vault root token loss postmortem
Root token and unseal key were lost when a placeholder manifest
overwrote the openbao-keys Secret. Documents root cause, timeline,
5 whys, remediation actions, and monitoring requirements.
2026-03-23 13:43:51 +00:00

12 KiB

COE-2026-001: OpenBao Vault Root Token and Unseal Key Loss

Date: 2026-03-23 Severity: Critical Author: Sienna Satterwhite Status: Resolved


Summary

On 2026-03-23, during routine CLI development and infrastructure testing, the OpenBao (Vault) root token and unseal key were discovered to be missing from the openbao-keys Kubernetes Secret in the data namespace. The secret had been overwritten with empty data by a placeholder manifest (openbao-keys-placeholder.yaml) during a previous sunbeam apply data operation. The root token — the sole administrative credential for the vault — was permanently lost, with no local backup or external copy.

The vault remained operational only because the openbao-0 pod had not restarted since the secret was wiped (the process held the unseal state in memory). A pod restart would have sealed the vault permanently with no way to unseal or authenticate, causing a total platform outage affecting all services that depend on vault-managed secrets (Hydra, Kratos, Gitea, all La Suite services, Matrix/Tuwunel, LiveKit, monitoring).

The incident was resolved by re-initializing the vault with new keys, implementing an encrypted local keystore to prevent future key loss, and re-seeding all service credentials.

Duration of exposure: Unknown — estimated days to weeks. The placeholder overwrite likely occurred during a prior sunbeam apply data run.

Duration of resolution: ~2 hours (keystore implementation, reinit, reseed, service restart).


Impact

  • Direct impact: Root token permanently lost. No ability to write new vault secrets, rotate credentials, or configure new vault policies.
  • Blast radius (if pod had restarted): Total platform outage — all services using vault-managed secrets (SSO, identity, git hosting, file storage, messaging, video conferencing, calendars, email, monitoring) would have lost access to their credentials.
  • Actual user impact: None during the exposure window (vault pod did not restart). During resolution, all users were logged out of SSO and had to re-authenticate (~5 minutes of login disruption).
  • Data loss: Zero. All application data (PostgreSQL databases, Matrix messages, Git repositories, S3 files, OpenSearch indices) was unaffected. Only vault KV secrets were regenerated.

Timeline

All times UTC.

Time Event
Unknown (days prior) sunbeam apply data runs, applying openbao-keys-placeholder.yaml which overwrites the openbao-keys Secret with empty data.
Unknown Auto-unseal sidecar's volume mount refreshes from the now-empty Secret. The key file disappears from /openbao/unseal/.
Unknown The vault remains unsealed because the openbao-0 process has not restarted — seal state is held in memory.
2026-03-23 ~11:30 During CLI testing, sunbeam seed reports "No root token available — skipping KV seeding." Investigation begins.
2026-03-23 ~11:40 sunbeam k8s get secret openbao-keys -n data -o yaml reveals the Secret exists but has zero data fields.
2026-03-23 ~11:45 sunbeam k8s exec -n data openbao-0 -- bao status confirms vault is initialized and unsealed (in memory).
2026-03-23 ~11:50 Search of local files, Claude Code transcripts, and shell history finds no copy of the root token or unseal key. Keys are confirmed permanently lost.
2026-03-23 ~12:00 Decision made to implement a local encrypted keystore before reinitializing, to prevent recurrence.
2026-03-23 ~12:30 vault_keystore.rs module implemented — AES-256-GCM encryption with Argon2id KDF, 26 unit tests passing.
2026-03-23 ~13:00 Keystore wired into seed flow, vault reinit/keys/export-keys CLI commands added, placeholder YAML removed from infra manifests.
2026-03-23 ~13:10 All secrets from all namespaces backed up to /tmp/sunbeam-secrets-backup/ (75 files, 304K).
2026-03-23 13:12 sunbeam vault reinit executed — vault storage wiped, new root token and unseal key generated.
2026-03-23 13:13 New keys saved to local encrypted keystore at ~/Library/Application Support/sunbeam/vault/sunbeam.pt.enc.
2026-03-23 13:14 sunbeam seed completes — all 19 KV paths written, database engine configured, K8s Secrets created, policies set.
2026-03-23 13:15 Sol's manual secrets (matrix-access-token, matrix-device-id, mistral-api-key) restored from backup.
2026-03-23 13:20 All service deployments restarted across ory, devtools, lasuite, matrix, media, monitoring namespaces.
2026-03-23 13:25 All critical services confirmed running (Hydra, Kratos, Gitea, Drive, Tuwunel, Sol). Platform operational.

Metrics

  • Time to detect: Unknown (days to weeks — the empty secret was not monitored).
  • Time to resolve (from detection): ~2 hours.
  • Services affected during resolution: All (brief SSO session invalidation).
  • Data loss: None.
  • Secrets regenerated: 19 KV paths, ~75 K8s Secrets across 8 namespaces.
  • Manual secrets requiring restore: 3 (Sol's matrix-access-token, matrix-device-id, mistral-api-key).

Incident Questions

Q: How was the incident detected? A: During routine sunbeam seed testing, the command reported "No root token available." Manual inspection revealed the empty K8s Secret.

Q: Why wasn't this detected earlier? A: No monitoring or alerting on the openbao-keys Secret contents. The vault pod hadn't restarted, so all services continued operating normally on cached credentials.

Q: What was the single point of failure? A: The root token and unseal key were stored in exactly one location — a K8s Secret — with no local backup, no external copy, and no integrity monitoring.

Q: Was any data exposed? A: No. The vault was still sealed-in-memory with valid credentials. The risk was total loss of access (not unauthorized access).


5 Whys

Why was the root token lost? The openbao-keys K8s Secret was overwritten with empty data.

Why was it overwritten? The infrastructure manifest openbao-keys-placeholder.yaml was included in sbbb/base/data/kustomization.yaml and applied during sunbeam apply data, replacing the populated Secret.

Why was a placeholder in the manifests? It was added to ensure the auto-unseal sidecar's volume mount would succeed even before the first sunbeam seed run. The intention was that server-side apply with no data field would leave existing data untouched, but this assumption was incorrect.

Why was there no backup of the keys? The CLI's seed flow stored keys exclusively in the K8s Secret. There was no local encrypted backup, no external backup, and no validation check on subsequent operations.

Why was there no monitoring for this? Vault key integrity was not considered in the operational monitoring setup. The platform's observability focused on service health, not infrastructure credential integrity.


Action Items

# Action Severity Status Notes
1 Implement encrypted local vault keystore Critical Done vault_keystore.rs — AES-256-GCM, Argon2id KDF, 26 unit tests. Keys stored at ~/.local/share/sunbeam/vault/{domain}.enc.
2 Wire keystore into seed flow Critical Done Save after init, load as fallback, backfill from cluster, restore K8s Secret from local.
3 Add vault reinit/keys/export-keys CLI commands Critical Done Recovery, inspection, and migration tools.
4 Remove openbao-keys-placeholder.yaml from infra manifests Critical Done Eliminated the overwrite vector. Auto-unseal volume mount has optional: true.
5 Add sunbeam.dev/managed-by: sunbeam label to programmatic secrets High Done Prevents future manifest overwrites of seed-managed secrets.
6 Use kv_put fallback when kv_patch returns 404 Medium Done Handles fresh vault initialization where KV paths don't exist yet.

Operational Monitoring and Alerting Requirements

The following monitoring and alerting must be implemented to detect similar incidents before they become critical.

Vault Seal Status Alert

What: Alert when OpenBao reports sealed: true for more than 60 seconds. Why: A sealed vault means no service can read secrets. The auto-unseal sidecar should unseal within seconds — if it doesn't, the unseal key is missing or corrupt. How: Prometheus query against the OpenBao metrics endpoint (/v1/sys/health returns HTTP 503 when sealed). PrometheusRule with for: 1m and Severity: critical. Runbook: Check openbao-keys Secret for the key field. If empty, restore from local keystore via sunbeam vault keys / sunbeam seed.

Vault Key Secret Integrity Alert

What: Alert when the openbao-keys Secret in the data namespace has zero data fields or is missing the key or root-token fields. Why: This is the exact failure mode that caused this incident — the secret was silently overwritten with empty data. How: A CronJob or Alloy scraper that periodically checks kubectl get secret openbao-keys -n data -o jsonpath='{.data.key}' and alerts if empty. Alternatively, a custom Prometheus exporter or a sunbeam check probe. Runbook: If empty, run sunbeam seed which restores from the local keystore. If sunbeam vault keys shows no local keystore either, escalate immediately — this is the scenario we just recovered from.

Local Keystore Sync Check

What: On every sunbeam seed and sunbeam vault status invocation, verify local keystore matches cluster state. Why: Drift between local and cluster keys means one copy may be stale or corrupt. How: Already implemented in the verify_vault_keys() function and the seed flow's backfill/restore logic. Emits warnings on mismatch. Runbook: If mismatch, determine which is authoritative (usually local — cluster may have been overwritten) and re-sync.

VSO Secret Sync Failure Alert

What: Alert when Vault Secrets Operator fails to sync secrets for more than 5 minutes. Why: VSO sync failures mean application K8s Secrets go stale. After credential rotation (like this reinit), services won't pick up new creds. How: VSO exposes metrics. PrometheusRule on vso_secret_sync_errors_total increasing or vso_secret_last_sync_timestamp older than threshold. Runbook: Check VSO logs (sunbeam k8s logs -n vault-secrets-operator deploy/vault-secrets-operator-controller-manager). Verify vault is unsealed and the vso-reader policy/role exists.

Pod Restart After Credential Rotation Alert

What: After sunbeam seed completes, warn if any deployment hasn't been restarted within 10 minutes. Why: Services running on old credentials will fail when the old K8s Secrets are overwritten by VSO sync. How: Compare sunbeam seed completion timestamp with deployment spec.template.metadata.annotations.restartedAt. Can be a post-seed check in the CLI itself. Runbook: Run sunbeam k8s rollout restart -n <namespace> deployment/<name> for each stale deployment.

What: Alert when node memory exceeds 85%. Why: During this incident investigation, we discovered the node was at 95% memory (Longhorn instance-manager leaked 38GB), which caused PostgreSQL to crash. Memory pressure can cascade into vault pod restarts, triggering the sealed-vault failure mode. How: PrometheusRule on node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.15 for 5 minutes. Severity: warning at 85%, critical at 95%. Runbook: Check sunbeam k8s top pods -A --sort-by=memory for the largest consumers. Restart Longhorn instance-manager if it's above 10GB. Scale down non-critical workloads if needed.


  • Plan file: ~/.claude/plans/delightful-meandering-thacker.md — full security hardening plan (22 steps).
  • Security audit: Conducted 2026-03-22, identified 22 findings across proxy, identity, monitoring, storage, matrix, and Kubernetes layers.
  • Backup location: /tmp/sunbeam-secrets-backup/ — plaintext backup of all 75 K8s Secrets taken before reinit (temporary — move to secure storage).
  • Local keystore: ~/Library/Application Support/sunbeam/vault/sunbeam.pt.enc — encrypted vault keys.
  • Commits: sunbeam-sdk vault_keystore module, seeding integration, CLI commands, kube label, infra placeholder removal.
  • Longhorn memory leak: COE-2026-002 (pending) — Longhorn instance-manager consumed 38GB causing PostgreSQL crash during the same session.