Documents the missing lk-jwt-service, well-known URL fix, bare domain routing, DNS apex records, and IPv6 cert-manager self-check failure. Includes dual-stack K3s migration plan.
8.4 KiB
COE-2026-002: The One Where Nobody Could Call Anyone 📞
Date: 2026-03-25 Severity: Medium (feature gap, not an outage) Author: Sienna Meridian Satterwhite Status: Resolved
What Happened
Matrix voice/video calls weren't working. Element Desktop could initiate calls, but Element X (iOS/Android) showed "Unsupported call. Ask if the caller can use the new Element X app." — which is ironic because they were using Element X.
After fixing that, Element X showed "Call is not supported. MISSING_MATRIX_RTC_TRANSPORT" — meaning the server wasn't advertising the LiveKit SFU to clients at all.
Root cause: three missing pieces in our Matrix + LiveKit integration, plus a bare-domain DNS/TLS gap, plus an IPv6 connectivity problem inside the cluster.
What Was Wrong
1. No lk-jwt-service
LiveKit was deployed as a TURN relay only. Element Call needs lk-jwt-service — a tiny service that exchanges Matrix OpenID tokens for LiveKit JWTs. Without it, clients have no way to authenticate with the SFU. We had a LiveKit server running, we had the well-known pointing at it, but the bridge between "Matrix user" and "LiveKit room participant" didn't exist.
2. Well-known URL was wss:// instead of https://
Tuwunel's config had livekit_url = "wss://livekit.sunbeam.pt". The livekit_service_url field in .well-known/matrix/client is supposed to be an HTTPS URL pointing at lk-jwt-service, not a WebSocket URL. Element Call hits this URL over HTTP to get a JWT, then connects to LiveKit via WSS separately.
3. Bare domain sunbeam.pt didn't serve .well-known
Element X's Rust SDK resolves .well-known/matrix/client from the server_name first — that's sunbeam.pt, not messages.sunbeam.pt. The bare domain had no route in the proxy and no TLS cert. So Element X never discovered the RTC foci → MISSING_MATRIX_RTC_TRANSPORT.
4. DNS: wildcard records don't match the apex
We had * A 62.210.145.138 and * AAAA 2001:bc8:702:10d9:: in Scaleway DNS, but no explicit records for sunbeam.pt itself. Per RFC 4592, wildcard records don't match the zone apex. So sunbeam.pt resolved to nothing while anything.sunbeam.pt resolved fine.
5. IPv6 broke cert-manager from inside the cluster
After adding the bare domain A + AAAA records and requesting a new cert, cert-manager's HTTP-01 self-check failed. The self-check resolves the domain from inside the cluster, gets both A and AAAA records, Go's HTTP client prefers IPv6, and the cluster has no internal IPv6 routing — K3s is single-stack IPv4. Connection to [2001:bc8:702:10d9::]:80 fails → self-check fails → challenge never submitted to Let's Encrypt.
Workaround: Temporarily removed the sunbeam.pt AAAA record, let the cert issue over IPv4, then re-added the AAAA. This works because cert renewals reuse the existing cert until it expires (90 days), and by then we should have dual-stack K3s.
The Fix
lk-jwt-service deployment (base/media/lk-jwt-service.yaml)
- Image:
ghcr.io/element-hq/lk-jwt-service:latest - Shares LiveKit API credentials via the existing
livekit-api-credentialsVSO secret LIVEKIT_FULL_ACCESS_HOMESERVERS=sunbeam.pt- Health checks on
/healthz
Proxy routing (base/ingress/pingora-config.yaml)
Added path-based routing under the livekit host prefix:
/sfu/get*,/healthz*,/get_token*→lk-jwt-service.media.svc.cluster.local:80- Everything else →
livekit-server.media.svc.cluster.local:80(WebSocket)
Added a sunbeam host prefix route for the bare domain:
/.well-known/matrix/*→tuwunel.matrix.svc.cluster.local:6167
Tuwunel config (base/matrix/tuwunel-config.yaml)
Changed livekit_url = "wss://livekit.sunbeam.pt" → livekit_url = "https://livekit.sunbeam.pt"
TLS cert (overlays/production/cert-manager.yaml)
Added sunbeam.pt (bare domain) to the certificate SANs.
DNS (Scaleway)
Added explicit A and AAAA records for sunbeam.pt (apex). The wildcard * records only cover subdomains.
VSO secret (base/media/vault-secrets.yaml)
Removed excludeRaw: true from the livekit-api-credentials transformation so lk-jwt-service can read the raw api-key and api-secret fields. Added lk-jwt-service to rolloutRestartTargets.
Call Flow (for future reference)
Element Call (in Element X or Desktop)
→ GET https://sunbeam.pt/.well-known/matrix/client
→ discovers livekit_service_url: https://livekit.sunbeam.pt
→ POST https://livekit.sunbeam.pt/sfu/get (with Matrix OpenID token)
→ lk-jwt-service validates token against tuwunel, mints LiveKit JWT
→ Client connects wss://livekit.sunbeam.pt (LiveKit SFU) with JWT
→ Audio/video flows through LiveKit SFU
→ TURN fallback via turn:meet.sunbeam.pt:3478 / turns:meet.sunbeam.pt:5349
The IPv6 Problem (and why we need dual-stack K3s)
Current state
The Dedibox has both IPv4 and IPv6 at the OS level. Pingora (the proxy) runs on hostNetwork and happily accepts connections on both families from the internet. External traffic works fine over IPv6.
But inside the K3s cluster, there's no IPv6. K3s was installed single-stack (IPv4 only). Pods get 10.42.x.x addresses, services get 10.43.x.x ClusterIPs. When a pod (like cert-manager) tries to reach a public IPv6 address, it fails — there's no route from the pod network to the IPv6 internet.
Why this matters beyond cert-manager
- Any pod that resolves a dual-stack hostname and prefers IPv6 (Go, curl, etc.) will fail to connect
- Cluster-internal DNS (CoreDNS) doesn't serve AAAA records for services
- Pod-to-pod communication is IPv4 only
- If we ever want to expose services natively on IPv6 (not just through the hostNetwork proxy), we need dual-stack
Migration plan: enable dual-stack in K3s
This is a cluster rebuild, not a flag flip. Kubernetes does not support changing cluster/service CIDRs on a running cluster. For our single-node setup, the blast radius is manageable.
Pre-flight
- Snapshot the Dedibox (Scaleway rescue mode or LVM snapshot)
- Back up all PVCs:
kubectl get pv -o yaml > pvs.yaml - Back up cert-manager certs:
kubectl get secret pingora-tls -n ingress -o yaml > tls-backup.yaml - Export all Vault secrets (they're already in OpenBao, but belt-and-suspenders)
- Verify
sunbeam platform seedcan fully re-seed from scratch - Verify all manifests in
sbbb/are the source of truth (no manual kubectl edits)
Rebuild
# 1. Uninstall K3s
/usr/local/bin/k3s-uninstall.sh
# 2. Reinstall with dual-stack
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server \
--cluster-cidr=10.42.0.0/16,fd42::/48 \
--service-cidr=10.43.0.0/16,fd43::/108 \
--flannel-ipv6-masq \
--disable=traefik \
--disable=servicelb" sh -
# 3. Redeploy everything
sunbeam platform up
Post-rebuild changes
- Every Service needs
spec.ipFamilyPolicy: PreferDualStackto get both ClusterIPs. Without this, services remain IPv4-only even on a dual-stack cluster. We should add this to all Service manifests insbbb/. - Pingora runs on
hostNetwork— unaffected by cluster networking, already serves IPv6. - LiveKit runs on
hostNetwork— same, already reachable on IPv6 via the host interface. - cert-manager — will be able to self-check on IPv6 once pods can route to public IPv6 addresses through flannel's IPv6 masquerade.
- CoreDNS — automatically serves AAAA records for dual-stack services. No config changes needed.
What we're NOT doing
- We disabled Traefik and ServiceLB because Pingora is our ingress. No need for MetalLB or dual-stack Traefik config.
- We're not changing pod DNS to dual-stack (
--cluster-dnsstays IPv4). CoreDNS responds to AAAA queries regardless.
Remaining Issues
- Element Desktop legacy calling. Calls from Desktop → Element X show "Unsupported call" unless Desktop is configured with
"element_call": { "use_exclusively": true }. Client-side config, not a server issue. - MSC4140 (delayed events) not in tuwunel. Crashed clients leave phantom call participants. No server-side auto-expiry. Manual cleanup required. Tracked upstream.
- Element X caches
.well-knownfor 7 days. Users who hit the error before the fix need to clear app cache or reinstall. - Cert renewal in ~60 days. If the cluster is still single-stack IPv4 at renewal time and
sunbeam.pthas an AAAA record, we'll need to temporarily remove it again. This is the forcing function for the dual-stack migration.