Files

Sienna Meridian Satterwhite b13555607a

docs(ops): COE-2026-002 Matrix RTC / Element Call + IPv6

Documents the missing lk-jwt-service, well-known URL fix, bare domain
routing, DNS apex records, and IPv6 cert-manager self-check failure.
Includes dual-stack K3s migration plan.

2026-03-25 13:25:05 +00:00

8.4 KiB

Raw Blame History

COE-2026-002: The One Where Nobody Could Call Anyone 📞

Date: 2026-03-25 Severity: Medium (feature gap, not an outage) Author: Sienna Meridian Satterwhite Status: Resolved

What Happened

Matrix voice/video calls weren't working. Element Desktop could initiate calls, but Element X (iOS/Android) showed "Unsupported call. Ask if the caller can use the new Element X app." — which is ironic because they were using Element X.

After fixing that, Element X showed "Call is not supported. MISSING_MATRIX_RTC_TRANSPORT" — meaning the server wasn't advertising the LiveKit SFU to clients at all.

Root cause: three missing pieces in our Matrix + LiveKit integration, plus a bare-domain DNS/TLS gap, plus an IPv6 connectivity problem inside the cluster.

What Was Wrong

1. No lk-jwt-service

LiveKit was deployed as a TURN relay only. Element Call needs lk-jwt-service — a tiny service that exchanges Matrix OpenID tokens for LiveKit JWTs. Without it, clients have no way to authenticate with the SFU. We had a LiveKit server running, we had the well-known pointing at it, but the bridge between "Matrix user" and "LiveKit room participant" didn't exist.

2. Well-known URL was `wss://` instead of `https://`

Tuwunel's config had livekit_url = "wss://livekit.sunbeam.pt". The livekit_service_url field in .well-known/matrix/client is supposed to be an HTTPS URL pointing at lk-jwt-service, not a WebSocket URL. Element Call hits this URL over HTTP to get a JWT, then connects to LiveKit via WSS separately.

3. Bare domain `sunbeam.pt` didn't serve `.well-known`

Element X's Rust SDK resolves .well-known/matrix/client from the server_name first — that's sunbeam.pt, not messages.sunbeam.pt. The bare domain had no route in the proxy and no TLS cert. So Element X never discovered the RTC foci → MISSING_MATRIX_RTC_TRANSPORT.

4. DNS: wildcard records don't match the apex

We had * A 62.210.145.138 and * AAAA 2001:bc8:702:10d9:: in Scaleway DNS, but no explicit records for sunbeam.pt itself. Per RFC 4592, wildcard records don't match the zone apex. So sunbeam.pt resolved to nothing while anything.sunbeam.pt resolved fine.

5. IPv6 broke cert-manager from inside the cluster

After adding the bare domain A + AAAA records and requesting a new cert, cert-manager's HTTP-01 self-check failed. The self-check resolves the domain from inside the cluster, gets both A and AAAA records, Go's HTTP client prefers IPv6, and the cluster has no internal IPv6 routing — K3s is single-stack IPv4. Connection to [2001:bc8:702:10d9::]:80 fails → self-check fails → challenge never submitted to Let's Encrypt.

Workaround: Temporarily removed the sunbeam.pt AAAA record, let the cert issue over IPv4, then re-added the AAAA. This works because cert renewals reuse the existing cert until it expires (90 days), and by then we should have dual-stack K3s.

The Fix

lk-jwt-service deployment (`base/media/lk-jwt-service.yaml`)

Image: ghcr.io/element-hq/lk-jwt-service:latest
Shares LiveKit API credentials via the existing livekit-api-credentials VSO secret
LIVEKIT_FULL_ACCESS_HOMESERVERS=sunbeam.pt
Health checks on /healthz

Proxy routing (`base/ingress/pingora-config.yaml`)

Added path-based routing under the livekit host prefix:

/sfu/get*, /healthz*, /get_token* → lk-jwt-service.media.svc.cluster.local:80
Everything else → livekit-server.media.svc.cluster.local:80 (WebSocket)

Added a sunbeam host prefix route for the bare domain:

/.well-known/matrix/* → tuwunel.matrix.svc.cluster.local:6167

Tuwunel config (`base/matrix/tuwunel-config.yaml`)

Changed livekit_url = "wss://livekit.sunbeam.pt" → livekit_url = "https://livekit.sunbeam.pt"

TLS cert (`overlays/production/cert-manager.yaml`)

Added sunbeam.pt (bare domain) to the certificate SANs.

DNS (Scaleway)

Added explicit A and AAAA records for sunbeam.pt (apex). The wildcard * records only cover subdomains.

VSO secret (`base/media/vault-secrets.yaml`)

Removed excludeRaw: true from the livekit-api-credentials transformation so lk-jwt-service can read the raw api-key and api-secret fields. Added lk-jwt-service to rolloutRestartTargets.

Call Flow (for future reference)

Element Call (in Element X or Desktop)
  → GET https://sunbeam.pt/.well-known/matrix/client
  → discovers livekit_service_url: https://livekit.sunbeam.pt
  → POST https://livekit.sunbeam.pt/sfu/get (with Matrix OpenID token)
  → lk-jwt-service validates token against tuwunel, mints LiveKit JWT
  → Client connects wss://livekit.sunbeam.pt (LiveKit SFU) with JWT
  → Audio/video flows through LiveKit SFU
  → TURN fallback via turn:meet.sunbeam.pt:3478 / turns:meet.sunbeam.pt:5349

The IPv6 Problem (and why we need dual-stack K3s)

Current state

The Dedibox has both IPv4 and IPv6 at the OS level. Pingora (the proxy) runs on hostNetwork and happily accepts connections on both families from the internet. External traffic works fine over IPv6.

But inside the K3s cluster, there's no IPv6. K3s was installed single-stack (IPv4 only). Pods get 10.42.x.x addresses, services get 10.43.x.x ClusterIPs. When a pod (like cert-manager) tries to reach a public IPv6 address, it fails — there's no route from the pod network to the IPv6 internet.

Why this matters beyond cert-manager

Any pod that resolves a dual-stack hostname and prefers IPv6 (Go, curl, etc.) will fail to connect
Cluster-internal DNS (CoreDNS) doesn't serve AAAA records for services
Pod-to-pod communication is IPv4 only
If we ever want to expose services natively on IPv6 (not just through the hostNetwork proxy), we need dual-stack

Migration plan: enable dual-stack in K3s

This is a cluster rebuild, not a flag flip. Kubernetes does not support changing cluster/service CIDRs on a running cluster. For our single-node setup, the blast radius is manageable.

Pre-flight

Snapshot the Dedibox (Scaleway rescue mode or LVM snapshot)
Back up all PVCs: kubectl get pv -o yaml > pvs.yaml
Back up cert-manager certs: kubectl get secret pingora-tls -n ingress -o yaml > tls-backup.yaml
Export all Vault secrets (they're already in OpenBao, but belt-and-suspenders)
Verify sunbeam platform seed can fully re-seed from scratch
Verify all manifests in sbbb/ are the source of truth (no manual kubectl edits)

Rebuild

# 1. Uninstall K3s
/usr/local/bin/k3s-uninstall.sh

# 2. Reinstall with dual-stack
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server \
  --cluster-cidr=10.42.0.0/16,fd42::/48 \
  --service-cidr=10.43.0.0/16,fd43::/108 \
  --flannel-ipv6-masq \
  --disable=traefik \
  --disable=servicelb" sh -

# 3. Redeploy everything
sunbeam platform up

Post-rebuild changes

Every Service needs spec.ipFamilyPolicy: PreferDualStack to get both ClusterIPs. Without this, services remain IPv4-only even on a dual-stack cluster. We should add this to all Service manifests in sbbb/.
Pingora runs on hostNetwork — unaffected by cluster networking, already serves IPv6.
LiveKit runs on hostNetwork — same, already reachable on IPv6 via the host interface.
cert-manager — will be able to self-check on IPv6 once pods can route to public IPv6 addresses through flannel's IPv6 masquerade.
CoreDNS — automatically serves AAAA records for dual-stack services. No config changes needed.

What we're NOT doing

We disabled Traefik and ServiceLB because Pingora is our ingress. No need for MetalLB or dual-stack Traefik config.
We're not changing pod DNS to dual-stack (--cluster-dns stays IPv4). CoreDNS responds to AAAA queries regardless.

Remaining Issues

Element Desktop legacy calling. Calls from Desktop → Element X show "Unsupported call" unless Desktop is configured with "element_call": { "use_exclusively": true }. Client-side config, not a server issue.
MSC4140 (delayed events) not in tuwunel. Crashed clients leave phantom call participants. No server-side auto-expiry. Manual cleanup required. Tracked upstream.
Element X caches .well-known for 7 days. Users who hit the error before the fix need to clear app cache or reinstall.
Cert renewal in ~60 days. If the cluster is still single-stack IPv4 at renewal time and sunbeam.pt has an AAAA record, we'll need to temporarily remove it again. This is the forcing function for the dual-stack migration.

8.4 KiB Raw Blame History