313-line walkthrough for adopting the `sunbeam.pt/*` label scheme on
existing manifests in sbbb. Documents the required labels, optional
annotations, virtual-service ConfigMap pattern, and the
multi-deployment grouping convention. Includes a complete table of
the 33 services with their target K8s resources and the values to put
on each. Teams onboarding new services can follow this without having
to read the registry source.
Adds the user-facing half of the service registry refactor.
src/service_cmds.rs (new):
- cmd_deploy: resolves a service/category/namespace target via the
registry, applies manifests for each unique namespace, then
rollout-restarts the resolved deployments.
- cmd_secrets: looks up the service's `sunbeam.pt/kv-path`
annotation, port-forwards to OpenBao, and either lists every key
in the path (with values masked) or — given `get <key>` —
prints the single field. Replaces a hand-rolled secret-fetching
flow with one that's driven by the same registry as everything else.
- cmd_shell: drops into a shell on a service's pod. Special-cases
`postgres` to spawn psql against the CNPG primary; everything else
gets `/bin/sh` via kubectl exec.
src/services.rs:
- Drop the static `SERVICES_TO_RESTART` table and the static
MANAGED_NS dependency. Both `cmd_status` and `cmd_restart` now ask
the registry. The legacy `namespace` and `namespace/name` syntaxes
still work as a fallback when the registry can't resolve the input,
so existing muscle memory keeps working during the transition.
- The two static-table tests are removed (they tested the static
tables that no longer exist); the icon helper test stays.
Together with the earlier `Verb::{Deploy,Secrets,Shell}` additions
in src/cli.rs, this completes the service-oriented command surface
for status / logs / restart / deploy / secrets / shell.
Adds `sunbeam_sdk::registry`, the discovery layer that the new
service-oriented CLI commands use to resolve names like "hydra",
"auth", or "ory" into the right Kubernetes resources.
Instead of duplicating service definitions in Rust code, the registry
queries Deployments, StatefulSets, DaemonSets, and ConfigMaps that
carry the `sunbeam.pt/service` label and reads everything else from
labels and annotations:
- sunbeam.pt/service / sunbeam.pt/category — required, the primary keys
- sunbeam.pt/display-name — human-readable label for status output
- sunbeam.pt/kv-path — OpenBao KV v2 path (for `sunbeam secrets <svc>`)
- sunbeam.pt/db-user / sunbeam.pt/db-name — CNPG postgres credentials
- sunbeam.pt/build-target — buildkit target for `sunbeam build`
- sunbeam.pt/depends-on — comma-separated dependency names
- sunbeam.pt/health-check — pod-ready / cnpg / seal-status / HTTP path
- sunbeam.pt/virtual=true — for ConfigMap-only "external" services
`ServiceRegistry::resolve(input)` does name → category → namespace
matching in that order, so `sunbeam logs hydra`, `sunbeam restart auth`,
and `sunbeam status ory` all work uniformly.
Multi-deployment services (e.g. messages-{backend,mta-in,mta-out})
share a service label and the registry merges them into a single
ServiceDefinition with multiple `deployments`.
Includes 14 unit tests covering name/category/namespace resolution,
case-insensitivity, virtual services, and the empty registry case.
Adds `headscale` to the lists that drive the seed workflow so the
existing CNPG role/database creation and OpenBao KV path provisioning
pick up the new VPN coordination service alongside everything else:
- src/secrets.rs: PG_USERS list grows from 15 → 16 (test asserts the
full ordered list, so it's updated to match)
- src/workflows/seed/steps/postgres.rs: pg_db_map adds headscale →
headscale_db
- src/workflows/seed/definition.rs: bumps the role/db step count
assertions from 15 → 16
- src/workflows/primitives/kv_service_configs.rs: new headscale entry
with a single `api-key` field generated as `static:` (placeholder).
The user runs `kubectl exec -n vpn deploy/headscale -- headscale
apikeys create` and pastes the result into vault before calling
`sunbeam vpn create-key`. Bumps service_count test from 18 → 19.
- src/constants.rs: add `vpn` to MANAGED_NS so the legacy namespace
list includes the new namespace.
`sunbeam vpn create-key` calls Headscale's REST API at
`/api/v1/preauthkey` to mint a new pre-auth key for onboarding a new
client. Reads `vpn-url` and `vpn-api-key` from the active context;
the user generates the API key once via `headscale apikeys create`
on the cluster and stores it in their context config.
Flags:
- --user <name> Headscale user the key belongs to
- --reusable allow multiple registrations with the same key
- --ephemeral auto-delete the node when its map stream drops
- --expiration <dur> human-friendly lifetime ("30d", "1h", "2w")
Also adds a `vpn-tls-insecure` context flag that controls TLS
verification across the whole VPN integration: it's now used by both
the daemon (for the Noise control connection + DERP relay) and the
new create-key REST client. Test stacks with self-signed certs set
this to true; production stacks leave it false.
Verified end-to-end against the docker test stack:
$ sunbeam vpn create-key --user test --reusable --expiration 1h
==> Creating pre-auth key on https://localhost:8443
Pre-auth key for user 'test':
ebcd77f51bf30ef373c9070382b834859935797a90c2647f
Add it to a context with:
sunbeam config set --context <ctx> vpn-auth-key ebcd77f5...
The docker-compose stack now serves Headscale (and its embedded DERP)
over TLS on port 8443 with a self-signed cert covering localhost,
127.0.0.1, and the docker-network hostname `headscale`. Tailscale
peers trust the cert via SSL_CERT_FILE; our test daemon uses
`derp_tls_insecure: true` (gated on the SUNBEAM_NET_TEST_DERP_INSECURE
env var) since pinning a self-signed root in tests is more trouble
than it's worth.
With TLS DERP working, the previously-ignored
`test_e2e_tcp_through_tunnel` test now passes: the daemon spawns,
registers, completes a Noise handshake over TLS, opens a TLS DERP
relay session, runs a real WireGuard handshake with peer-a (verified
via boringtun ↔ tailscale interop), and TCP-tunnels an HTTP GET
through smoltcp ↔ engine ↔ proxy ↔ test client. The 191-byte echo
response round-trips and the test asserts on its body.
- tests/config/headscale.yaml: tls_cert_path + tls_key_path, listen on
8443, server_url=https://headscale:8443
- tests/config/test-cert.pem + test-key.pem: 365-day self-signed RSA
cert with SAN DNS:localhost, DNS:headscale, IP:127.0.0.1
- tests/docker-compose.yml: mount certs into headscale + both peers,
set SSL_CERT_FILE on the peers, expose 8443 instead of 8080
- tests/run.sh: switch to https://localhost:8443, set
SUNBEAM_NET_TEST_DERP_INSECURE=1
- tests/integration.rs: drop the #[ignore] on test_e2e_tcp_through_tunnel,
read derp_tls_insecure from env in all four test configs
Production Headscale terminates TLS for both the control plane (via the
TS2021 HTTP CONNECT upgrade endpoint) and the embedded DERP relay.
Without TLS, the daemon could only talk to plain-HTTP test stacks.
- New crate::tls module: shared TlsMode (Verify | InsecureSkipVerify)
+ tls_wrap helper. webpki roots in Verify mode; an explicit
ServerCertVerifier that accepts any cert in InsecureSkipVerify
(test-only).
- Cargo.toml: add tokio-rustls, webpki-roots, rustls-pemfile.
- noise/handshake: perform_handshake is now generic over the underlying
stream and takes an explicit `host_header` argument instead of using
`peer_addr`. Lets callers pass either a TcpStream or a TLS-wrapped
stream.
- noise/stream: NoiseStream<S> is generic over the underlying transport
with `S = TcpStream` as the default. The AsyncRead+AsyncWrite impls
forward to whatever S provides.
- control/client: ControlClient::connect detects `https://` in
coordination_url and TLS-wraps the TCP stream before the Noise
handshake. fetch_server_key now also TLS-wraps when needed. Both
honor the new derp_tls_insecure config flag (which is misnamed but
controls all TLS verification, not just DERP).
- derp/client: DerpClient::connect_with_tls accepts a TlsMode and uses
the shared tls::tls_wrap helper instead of duplicating it. The
client struct's inner Framed is now generic over a Box<dyn
DerpTransport> so it can hold either a plain or TLS-wrapped stream.
- daemon/lifecycle: derive the DERP URL scheme from coordination_url
(https → https) and pass derp_tls_insecure through.
- config.rs: new `derp_tls_insecure: bool` field on VpnConfig.
- src/vpn_cmds.rs: pass `derp_tls_insecure: false` for production.
Two bug fixes found while wiring this up:
- proxy/engine: bridge_connection used to set remote_done on any
smoltcp recv error, including the transient InvalidState that
smoltcp returns while a TCP socket is still in SynSent. That meant
the engine gave up on the connection before the WG handshake even
finished. Distinguish "not ready yet" (returns Ok(0)) from
"actually closed" (returns Err) inside tcp_recv, and only mark
remote_done on the latter.
- proxy/engine: the connection's "done" condition required
local_read_done, but most clients (curl, kubectl) keep their write
side open until they read EOF. The engine never closed its local
TCP, so clients sat in read_to_end forever. Drop the connection as
soon as the remote side has finished and we've drained its buffer
to the local socket — the local TcpStream drop closes the socket
and the client sees EOF.
Adds an optional `cluster_api_host` field to VpnConfig. When set, the
daemon resolves it against the netmap's peer list once the first
netmap arrives and uses that peer's tailnet IP as the proxy backend,
overriding the static `cluster_api_addr`. Falls back to the static
addr if the hostname doesn't match any peer.
The resolver tries hostname first, then peer name (FQDN), then a
prefix match against name. Picks v4 over v6 from the peer's address
list.
- sunbeam-net/src/config.rs: new `cluster_api_host: Option<String>`
- sunbeam-net/src/daemon/lifecycle.rs: resolve_peer_ip helper +
resolution at proxy bind time
- sunbeam-net/tests/integration.rs: pass cluster_api_host: None in
the existing VpnConfig literals
- src/config.rs: new context field `vpn-cluster-host`
- src/vpn_cmds.rs: thread it from context → VpnConfig
`sunbeam connect` now fork-execs itself with a hidden `__vpn-daemon`
subcommand instead of running the daemon in-process. The user-facing
command spawns the child detached (stdio → log file, setsid for no
controlling TTY), polls the IPC socket until the daemon reaches
Running, prints a one-line status, and exits. The user gets back to
their shell immediately.
- src/cli.rs: `Connect { foreground }` instead of unit. Add hidden
`__vpn-daemon` Verb that the spawned child runs.
- src/vpn_cmds.rs: split into spawn_background_daemon (default path)
and run_daemon_foreground (used by both `connect --foreground` and
`__vpn-daemon`). Detached child uses pre_exec(setsid) and inherits
--context from the parent so it resolves the same VPN config.
Refuses to start if a daemon is already running on the control
socket; cleans up stale socket files. Switches the proxy bind from
16443 (sienna's existing SSH tunnel uses it) to 16579.
- sunbeam-net/src/daemon/lifecycle: add a SocketGuard RAII type so the
IPC control socket is unlinked when the daemon exits, regardless of
shutdown path. Otherwise `vpn status` after a clean disconnect would
see a stale socket and report an error.
End-to-end smoke test against the docker stack:
$ sunbeam connect
==> VPN daemon spawned (pid 90072, ...)
Connected (100.64.0.154, fd7a:115c:a1e0::9a) — 2 peers visible
$ sunbeam vpn status
VPN: running
addresses: 100.64.0.154, fd7a:115c:a1e0::9a
peers: 2
derp home: region 0
$ sunbeam disconnect
==> Asking VPN daemon to stop...
Daemon acknowledged shutdown.
$ sunbeam vpn status
VPN: not running
DaemonHandle's shutdown_tx (oneshot) is replaced with a CancellationToken
shared between the daemon loop and the IPC server. The token is the
single source of truth for "should we shut down" — `DaemonHandle::shutdown`
cancels it, and an IPC `Stop` request also cancels it.
- daemon/state: store the CancellationToken on DaemonHandle and clone it
on Clone (so cached IPC handles can still trigger shutdown).
- daemon/ipc: IpcServer takes a daemon_shutdown token; `Stop` now cancels
it instead of returning Ok and doing nothing. Add IpcClient with
`request`, `status`, and `stop` methods so the CLI can drive a
backgrounded daemon over the Unix socket.
- daemon/lifecycle: thread the token through run_daemon_loop and
run_session, pass a clone to IpcServer::new.
- lib.rs: re-export IpcClient/IpcCommand/IpcResponse so callers don't
have to reach into the daemon module.
- src/vpn_cmds.rs: `sunbeam disconnect` now actually talks to the daemon
via IpcClient::stop, and `sunbeam vpn status` queries IpcClient::status
and prints addresses + peer count + DERP home.
Adds the foreground VPN client commands. The daemon runs in-process
inside the CLI for the lifetime of `sunbeam connect` — no separate
background daemon yet, that can come later if needed.
- Cargo.toml: add sunbeam-net as a workspace dep, plus hostname/whoami
for building a per-machine netmap label like "sienna@laptop"
- src/config.rs: new `vpn-url` and `vpn-auth-key` fields on Context
- src/cli.rs: `Connect`, `Disconnect`, and `Vpn { Status }` verbs
- src/vpn_cmds.rs: command handlers
- cmd_connect reads VPN config from the active context, starts the
daemon at ~/.sunbeam/vpn, polls for Running, then blocks on ^C
before calling DaemonHandle::shutdown
- cmd_disconnect / cmd_vpn_status are placeholders that report based
on the control socket; actually talking to a backgrounded daemon
needs an IPC client (not yet exposed from sunbeam-net)
- src/workflows/mod.rs: `..Default::default()` on Context literals so
the new fields don't break the existing tests
- docker-compose.yml: run peer-a and peer-b with TS_USERSPACE=false +
/dev/net/tun device + cap_add. Pin peer-a's WG listen port to 41641
via TS_TAILSCALED_EXTRA_ARGS and publish it to the host so direct
UDP from outside docker has somewhere to land.
- run.sh: use an ephemeral pre-auth key for the test client so
Headscale auto-deletes the test node when its map stream drops
(instead of accumulating hundreds of stale entries that eventually
slow netmap propagation to a crawl). Disable shields-up on both
peers so the kernel firewall doesn't drop inbound tailnet TCP. Tweak
the JSON key extraction to handle pretty-printed output.
- integration.rs: add `test_e2e_tcp_through_tunnel` that brings up
the daemon, dials peer-a's echo server through the proxy, and
asserts the echo body comes back. Currently `#[ignore]`d — the
docker stack runs Headscale over plain HTTP, but Tailscale's client
unconditionally tries TLS to DERP relays ("tls: first record does
not look like a TLS handshake"), so peer-a can never receive
packets we forward via the relay. Unblocking needs either TLS
termination on the docker DERP or running the test inside the same
docker network as peer-a. Test stays in the tree because everything
it tests up to the read timeout is real verified behavior.
A pile of correctness bugs that all stopped real Tailscale peers from
being able to send WireGuard packets back to us. Found while building
out the e2e test against the docker-compose stack.
1. WireGuard static key was wrong (lifecycle.rs)
We were initializing the WgTunnel with `keys.wg_private`, a separate
x25519 key from the one Tailscale advertises in netmaps. Peers know
us by `node_public` and compute mac1 against it; signing handshakes
with a different private key meant every init we sent was silently
dropped. Use `keys.node_private` instead — node_key IS the WG static
key in Tailscale.
2. DERP relay couldn't route packets to us (derp/client.rs)
Our DerpClient was sealing the ClientInfo frame with a fresh
ephemeral NaCl keypair and putting the ephemeral public in the frame
prefix. Tailscale's protocol expects the *long-term* node public key
in the prefix — that's how the relay knows where to forward packets
addressed to our node_key. With the ephemeral key, the relay
accepted the connection but never delivered our peers' responses.
Now seal with the long-term node key.
3. Headscale never persisted our DiscoKey (proto/types.rs, control/*)
The streaming /machine/map handler in Headscale ≥ capVer 68 doesn't
update DiscoKey on the node record — only the "Lite endpoint update"
path does, gated on Stream:false + OmitPeers:true + ReadOnly:false.
Without DiscoKey our nodes appeared in `headscale nodes list` with
`discokey:000…` and never propagated into peer netmaps. Add the
DiscoKey field to RegisterRequest, add OmitPeers/ReadOnly fields to
MapRequest, and call a new `lite_update` between register and the
streaming map. Also add `post_json_no_response` for endpoints that
reply with an empty body.
4. EncapAction is now a struct instead of an enum (wg/tunnel.rs)
Routing was forced to either UDP or DERP. With a peer whose
advertised UDP endpoint is on an unreachable RFC1918 network (e.g.
docker bridge IPs), we'd send via UDP, get nothing, and never fall
back. Send over every available transport — receivers dedupe via
the WireGuard replay window — and let dispatch_encap forward each
populated arm to its respective channel.
5. Drop the dead PacketRouter (wg/router.rs)
Skeleton from an earlier design that never got wired up; it's been
accumulating dead-code warnings.
DERP works for everything but adds relay latency. Add a parallel UDP
transport so peers with reachable endpoints can talk directly:
- wg/tunnel: track each peer's local boringtun index in PeerTunnel and
expose find_peer_by_local_index / find_peer_by_endpoint lookups
- daemon/lifecycle: bind a UdpSocket on 0.0.0.0:0 alongside DERP, run
the recv loop on a clone of an Arc<UdpSocket> so send and recv can
proceed concurrently
- run_wg_loop: new udp_in_rx select arm. For inbound UDP we identify
the source peer by parsing the WireGuard receiver_index out of the
packet header (msg types 2/3/4) and falling back to source-address
matching for type-1 handshake initiations
- dispatch_encap: SendUdp now actually forwards via the UDP channel
UDP failure is non-fatal — DERP can carry traffic alone if the bind
fails or packets are dropped.
Spins up Headscale 0.23 (with embedded DERP) plus two Tailscale peers
in docker compose, generates pre-auth keys, and runs three integration
tests behind the `integration` feature:
- test_register_and_receive_netmap: full TS2021 → register → first
netmap fetch
- test_proxy_listener_accepts: starts the daemon and waits for it to
reach the Running state
- test_daemon_lifecycle: full lifecycle including DERP connect, then
clean shutdown via the DaemonHandle
Run with `sunbeam-net/tests/run.sh` (handles compose up/down + auth
key provisioning) or manually via cargo nextest with the env vars
SUNBEAM_NET_TEST_AUTH_KEY and SUNBEAM_NET_TEST_COORD_URL set.
The daemon orchestrates everything: it owns reconnection backoff, the
WireGuard tunnel, the smoltcp engine, the DERP relay loop, the local
TCP proxy, and a Unix-socket IPC server for status queries.
- daemon/state: DaemonStatus state machine + DaemonHandle for shutdown
signaling and live status access
- daemon/ipc: newline-delimited JSON Unix socket server (Status,
Disconnect, Peers requests)
- daemon/lifecycle: VpnDaemon::start spawns run_daemon_loop, which pins
a session future and selects against shutdown_rx so shutdown breaks
out cleanly. run_session brings up the full pipeline:
control client → register → map stream → wg tunnel → engine →
proxy listener → wg encap/decap loop → DERP relay → IPC server.
DERP transport: when the netmap doesn't surface a usable DERP endpoint
(Headscale's embedded relay returns host_name="headscale", port=0),
fall back to deriving host:port from coordination_url. WG packets to
SendDerp peers go via a dedicated derp_out channel; inbound DERP frames
flow back through derp_in into the decap arm, which forwards Packet
results to the engine and Response results back to derp_out for the
handshake exchange.
- proxy/engine: NetworkEngine that owns the smoltcp VirtualNetwork and
bridges async TCP streams to virtual sockets via a 5ms poll loop.
Each ProxyConnection holds the local TcpStream + smoltcp socket
handle and shuttles data between them with try_read/try_write so the
engine never blocks.
- proxy/tcp: skeleton TcpProxy listener (currently unused; the daemon
inlines its own listener that hands off to the engine via mpsc)
- control/client: TS2021 connection setup — TCP, HTTP CONNECT-style
upgrade to /ts2021, full Noise IK handshake via NoiseStream, then
HTTP/2 client handshake on top via the h2 crate
- control/register: POST /machine/register with pre-auth key, PascalCase
JSON serde matching Tailscale's wire format
- control/netmap: streaming MapStream that reads length-prefixed JSON
messages from POST /machine/map, classifies them into Full/Delta/
PeersChanged/PeersRemoved/KeepAlive, and transparently zstd-decodes
by detecting the 0x28 0xB5 0x2F 0xFD magic (Headscale only compresses
if the client opts in)
- wg/tunnel: per-peer boringtun Tunn management with peer table sync
from netmap (add/remove/update endpoints, allowed_ips, DERP region)
and encapsulate/decapsulate/tick that route to UDP or DERP
- wg/socket: smoltcp Interface backed by an mpsc-channel Device that
bridges sync poll-based smoltcp with async tokio mpsc channels
- wg/router: skeleton PacketRouter (currently unused; reserved for the
unified UDP/DERP ingress path)
DERP is Tailscale's TCP relay protocol for peers that can't establish a
direct UDP path. Add the standalone client:
- derp/framing: 5-byte frame codec (1-byte type + 4-byte BE length)
- derp/client: HTTP /derp upgrade, Tailscale's NaCl SealedBox handshake
(ServerKey → ClientInfo → ServerInfo → NotePreferred), and
send_packet/recv_packet for forwarding WireGuard datagrams
Includes the 8-byte DERP\xf0\x9f\x94\x91 magic prefix in the ServerKey
payload and reads the HTTP upgrade response one byte at a time so the
inline first frame isn't swallowed by a buffered reader.
Tailscale's TS2021 protocol layers HTTP/2 over an encrypted Noise IK
channel reached via HTTP CONNECT-style upgrade. Add the lower half:
- noise/handshake: hand-rolled Noise_IK_25519_ChaChaPoly_BLAKE2s
initiator with HKDF + ChaCha20-Poly1305 (no snow dependency)
- noise/framing: 3-byte frame codec (1-byte type + 2-byte BE length)
- noise/stream: NoiseStream implementing AsyncRead + AsyncWrite over
the framed channel so the h2 crate can sit on top
Add the workspace crate that will host a pure Rust Headscale/Tailscale-
compatible VPN client. This first commit lands the crate skeleton plus
the leaf modules that the rest of the stack builds on:
- error: thiserror Error enum + Result alias
- config: VpnConfig
- keys: Curve25519 node/disco/wg key types with on-disk persistence
- proto/types: PascalCase serde wire types matching Tailscale's JSON
Replace hand-rolled OpenBao HTTP client with vaultrs 0.8.0, which
has official OpenBao support. BaoClient remains the public API so
callers are unchanged. KV patch uses raw HTTP since vaultrs doesn't
expose it yet.
On a clean cluster, the OpenBao pod can't start because it mounts
the openbao-keys secret as a volume, but that secret doesn't exist
until init runs. Create a placeholder secret in WaitPodRunning so
the pod can mount it and start. InitOrUnsealOpenBao overwrites it
with real values during initialization.
The migration from ~/.sunbeam.json to ~/.sunbeam/config.json
copied but never removed the legacy file, which could cause
confusion with older binaries still writing to the old path.
WFE now populates execution pointer step_name from the workflow
definition, so print_summary shows actual step names instead of
"step-0", "step-1", etc.
Move ensure_opensearch_ml and inject_opensearch_model_id out of
cmd_apply post-hooks into dedicated WFE steps that run in a
parallel branch alongside rollout waits. The ML model download
(10+ min on first run) no longer blocks the rest of the pipeline.
The port-forward background task retried infinitely on 500 errors
when the target pod wasn't ready. Add a 30-attempt limit with 2s
backoff between retries so the step eventually fails instead of
spinning forever.
Dispatch `sunbeam up`, `sunbeam seed`, `sunbeam verify`, and
`sunbeam bootstrap` through WFE workflows instead of monolithic
functions. Steps communicate via JSON workflow data and each
workflow is persisted in a per-context SQLite database.
- `sunbeam auth token` prints JSON headers for MCP headersHelper:
{"Authorization": "Bearer <token>"}
- Add penpot to PG_USERS, pg_db_map, KV seed, and all_paths
- Add cert-manager to VSO auth role bound namespaces
Reuse any existing model version (including DEPLOY_FAILED) instead of
registering a new copy. Prevents accumulation of stale model chunks
in .plugins-ml-model when OpenSearch restarts between applies.
Add rand_alphanum() using OsRng for generating fixed-length
alphanumeric secrets. Seed secrets-cipher (32 chars) into the
kratos KV path for at-rest encryption of OIDC tokens.
- bao: replaced by `sunbeam vault` with proper JWT auth
- docs: La Suite Docs not ready for production
- people: La Suite People not ready for production