docs/ops/COE-2026-002-matrix-rtc-element-call.md

# COE-2026-002: The One Where Nobody Could Call Anyone 📞

**Date:** 2026-03-25
**Severity:** Medium (feature gap, not an outage)
**Author:** Sienna Meridian Satterwhite
**Status:** Resolved

---

## What Happened

Matrix voice/video calls weren't working. Element Desktop could initiate calls, but Element X (iOS/Android) showed "Unsupported call. Ask if the caller can use the new Element X app." — which is ironic because they *were* using Element X.

After fixing that, Element X showed "Call is not supported. MISSING_MATRIX_RTC_TRANSPORT" — meaning the server wasn't advertising the LiveKit SFU to clients at all.

Root cause: **three missing pieces** in our Matrix + LiveKit integration, plus a bare-domain DNS/TLS gap, plus an IPv6 connectivity problem inside the cluster.

---

## What Was Wrong

### 1. No lk-jwt-service

LiveKit was deployed as a TURN relay only. Element Call needs `lk-jwt-service` — a tiny service that exchanges Matrix OpenID tokens for LiveKit JWTs. Without it, clients have no way to authenticate with the SFU. We had a LiveKit server running, we had the well-known pointing at it, but the bridge between "Matrix user" and "LiveKit room participant" didn't exist.

### 2. Well-known URL was `wss://` instead of `https://`

Tuwunel's config had `livekit_url = "wss://livekit.sunbeam.pt"`. The `livekit_service_url` field in `.well-known/matrix/client` is supposed to be an **HTTPS** URL pointing at `lk-jwt-service`, not a WebSocket URL. Element Call hits this URL over HTTP to get a JWT, then connects to LiveKit via WSS separately.

### 3. Bare domain `sunbeam.pt` didn't serve `.well-known`

Element X's Rust SDK resolves `.well-known/matrix/client` from the `server_name` first — that's `sunbeam.pt`, not `messages.sunbeam.pt`. The bare domain had no route in the proxy and no TLS cert. So Element X never discovered the RTC foci → `MISSING_MATRIX_RTC_TRANSPORT`.

### 4. DNS: wildcard records don't match the apex

We had `* A 62.210.145.138` and `* AAAA 2001:bc8:702:10d9::` in Scaleway DNS, but no explicit records for `sunbeam.pt` itself. Per RFC 4592, wildcard records don't match the zone apex. So `sunbeam.pt` resolved to nothing while `anything.sunbeam.pt` resolved fine.

### 5. IPv6 broke cert-manager from inside the cluster

After adding the bare domain A + AAAA records and requesting a new cert, cert-manager's HTTP-01 self-check failed. The self-check resolves the domain from inside the cluster, gets both A and AAAA records, Go's HTTP client **prefers IPv6**, and the cluster has **no internal IPv6 routing** — K3s is single-stack IPv4. Connection to `[2001:bc8:702:10d9::]:80` fails → self-check fails → challenge never submitted to Let's Encrypt.

**Workaround:** Temporarily removed the `sunbeam.pt` AAAA record, let the cert issue over IPv4, then re-added the AAAA. This works because cert renewals reuse the existing cert until it expires (90 days), and by then we should have dual-stack K3s.

---

## The Fix

### lk-jwt-service deployment (`base/media/lk-jwt-service.yaml`)

- Image: `ghcr.io/element-hq/lk-jwt-service:latest`
- Shares LiveKit API credentials via the existing `livekit-api-credentials` VSO secret
- `LIVEKIT_FULL_ACCESS_HOMESERVERS=sunbeam.pt`
- Health checks on `/healthz`

### Proxy routing (`base/ingress/pingora-config.yaml`)

Added path-based routing under the `livekit` host prefix:
- `/sfu/get*`, `/healthz*`, `/get_token*` → `lk-jwt-service.media.svc.cluster.local:80`
- Everything else → `livekit-server.media.svc.cluster.local:80` (WebSocket)

Added a `sunbeam` host prefix route for the bare domain:
- `/.well-known/matrix/*` → `tuwunel.matrix.svc.cluster.local:6167`

### Tuwunel config (`base/matrix/tuwunel-config.yaml`)

Changed `livekit_url = "wss://livekit.sunbeam.pt"` → `livekit_url = "https://livekit.sunbeam.pt"`

### TLS cert (`overlays/production/cert-manager.yaml`)

Added `sunbeam.pt` (bare domain) to the certificate SANs.

### DNS (Scaleway)

Added explicit A and AAAA records for `sunbeam.pt` (apex). The wildcard `*` records only cover subdomains.

### VSO secret (`base/media/vault-secrets.yaml`)

Removed `excludeRaw: true` from the livekit-api-credentials transformation so `lk-jwt-service` can read the raw `api-key` and `api-secret` fields. Added `lk-jwt-service` to `rolloutRestartTargets`.

---

## Call Flow (for future reference)

```
Element Call (in Element X or Desktop)
  → GET https://sunbeam.pt/.well-known/matrix/client
  → discovers livekit_service_url: https://livekit.sunbeam.pt
  → POST https://livekit.sunbeam.pt/sfu/get (with Matrix OpenID token)
  → lk-jwt-service validates token against tuwunel, mints LiveKit JWT
  → Client connects wss://livekit.sunbeam.pt (LiveKit SFU) with JWT
  → Audio/video flows through LiveKit SFU
  → TURN fallback via turn:meet.sunbeam.pt:3478 / turns:meet.sunbeam.pt:5349
```

---

## The IPv6 Problem (and why we need dual-stack K3s)

### Current state

The Dedibox has both IPv4 and IPv6 at the OS level. Pingora (the proxy) runs on `hostNetwork` and happily accepts connections on both families from the internet. External traffic works fine over IPv6.

But **inside the K3s cluster**, there's no IPv6. K3s was installed single-stack (IPv4 only). Pods get `10.42.x.x` addresses, services get `10.43.x.x` ClusterIPs. When a pod (like cert-manager) tries to reach a public IPv6 address, it fails — there's no route from the pod network to the IPv6 internet.

### Why this matters beyond cert-manager

- Any pod that resolves a dual-stack hostname and prefers IPv6 (Go, curl, etc.) will fail to connect
- Cluster-internal DNS (CoreDNS) doesn't serve AAAA records for services
- Pod-to-pod communication is IPv4 only
- If we ever want to expose services natively on IPv6 (not just through the hostNetwork proxy), we need dual-stack

### Migration plan: enable dual-stack in K3s

**This is a cluster rebuild, not a flag flip.** Kubernetes does not support changing cluster/service CIDRs on a running cluster. For our single-node setup, the blast radius is manageable.

#### Pre-flight

1. Snapshot the Dedibox (Scaleway rescue mode or LVM snapshot)
2. Back up all PVCs: `kubectl get pv -o yaml > pvs.yaml`
3. Back up cert-manager certs: `kubectl get secret pingora-tls -n ingress -o yaml > tls-backup.yaml`
4. Export all Vault secrets (they're already in OpenBao, but belt-and-suspenders)
5. Verify `sunbeam platform seed` can fully re-seed from scratch
6. Verify all manifests in `sbbb/` are the source of truth (no manual kubectl edits)

#### Rebuild

```bash
# 1. Uninstall K3s
/usr/local/bin/k3s-uninstall.sh

# 2. Reinstall with dual-stack
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server \
  --cluster-cidr=10.42.0.0/16,fd42::/48 \
  --service-cidr=10.43.0.0/16,fd43::/108 \
  --flannel-ipv6-masq \
  --disable=traefik \
  --disable=servicelb" sh -

# 3. Redeploy everything
sunbeam platform up
```

#### Post-rebuild changes

- **Every Service** needs `spec.ipFamilyPolicy: PreferDualStack` to get both ClusterIPs. Without this, services remain IPv4-only even on a dual-stack cluster. We should add this to all Service manifests in `sbbb/`.
- **Pingora** runs on `hostNetwork` — unaffected by cluster networking, already serves IPv6.
- **LiveKit** runs on `hostNetwork` — same, already reachable on IPv6 via the host interface.
- **cert-manager** — will be able to self-check on IPv6 once pods can route to public IPv6 addresses through flannel's IPv6 masquerade.
- **CoreDNS** — automatically serves AAAA records for dual-stack services. No config changes needed.

#### What we're NOT doing

- We disabled Traefik and ServiceLB because Pingora is our ingress. No need for MetalLB or dual-stack Traefik config.
- We're not changing pod DNS to dual-stack (`--cluster-dns` stays IPv4). CoreDNS responds to AAAA queries regardless.

---

## Remaining Issues

- **Element Desktop legacy calling.** Calls from Desktop → Element X show "Unsupported call" unless Desktop is configured with `"element_call": { "use_exclusively": true }`. Client-side config, not a server issue.
- **MSC4140 (delayed events) not in tuwunel.** Crashed clients leave phantom call participants. No server-side auto-expiry. Manual cleanup required. Tracked upstream.
- **Element X caches `.well-known` for 7 days.** Users who hit the error before the fix need to clear app cache or reinstall.
- **Cert renewal in ~60 days.** If the cluster is still single-stack IPv4 at renewal time and `sunbeam.pt` has an AAAA record, we'll need to temporarily remove it again. This is the forcing function for the dual-stack migration.
docs(ops): COE-2026-002 Matrix RTC / Element Call + IPv6 Documents the missing lk-jwt-service, well-known URL fix, bare domain routing, DNS apex records, and IPv6 cert-manager self-check failure. Includes dual-stack K3s migration plan. 2026-03-25 13:25:05 +00:00			`# COE-2026-002: The One Where Nobody Could Call Anyone 📞`

			`Date: 2026-03-25`
			`Severity: Medium (feature gap, not an outage)`
			`Author: Sienna Meridian Satterwhite`
			`Status: Resolved`

			`---`

			`## What Happened`

			`Matrix voice/video calls weren't working. Element Desktop could initiate calls, but Element X (iOS/Android) showed "Unsupported call. Ask if the caller can use the new Element X app." — which is ironic because they were using Element X.`

			`After fixing that, Element X showed "Call is not supported. MISSING_MATRIX_RTC_TRANSPORT" — meaning the server wasn't advertising the LiveKit SFU to clients at all.`

			`Root cause: three missing pieces in our Matrix + LiveKit integration, plus a bare-domain DNS/TLS gap, plus an IPv6 connectivity problem inside the cluster.`

			`---`

			`## What Was Wrong`

			`### 1. No lk-jwt-service`

			LiveKit was deployed as a TURN relay only. Element Call needs `lk-jwt-service` — a tiny service that exchanges Matrix OpenID tokens for LiveKit JWTs. Without it, clients have no way to authenticate with the SFU. We had a LiveKit server running, we had the well-known pointing at it, but the bridge between "Matrix user" and "LiveKit room participant" didn't exist.

			### 2. Well-known URL was `wss://` instead of `https://`

			Tuwunel's config had `livekit_url = "wss://livekit.sunbeam.pt"`. The `livekit_service_url` field in `.well-known/matrix/client` is supposed to be an HTTPS URL pointing at `lk-jwt-service`, not a WebSocket URL. Element Call hits this URL over HTTP to get a JWT, then connects to LiveKit via WSS separately.

			### 3. Bare domain `sunbeam.pt` didn't serve `.well-known`

			Element X's Rust SDK resolves `.well-known/matrix/client` from the `server_name` first — that's `sunbeam.pt`, not `messages.sunbeam.pt`. The bare domain had no route in the proxy and no TLS cert. So Element X never discovered the RTC foci → `MISSING_MATRIX_RTC_TRANSPORT`.

			`### 4. DNS: wildcard records don't match the apex`

			We had `* A 62.210.145.138` and `* AAAA 2001:bc8:702:10d9::` in Scaleway DNS, but no explicit records for `sunbeam.pt` itself. Per RFC 4592, wildcard records don't match the zone apex. So `sunbeam.pt` resolved to nothing while `anything.sunbeam.pt` resolved fine.

			`### 5. IPv6 broke cert-manager from inside the cluster`

			After adding the bare domain A + AAAA records and requesting a new cert, cert-manager's HTTP-01 self-check failed. The self-check resolves the domain from inside the cluster, gets both A and AAAA records, Go's HTTP client prefers IPv6, and the cluster has no internal IPv6 routing — K3s is single-stack IPv4. Connection to `[2001:bc8:702:10d9::]:80` fails → self-check fails → challenge never submitted to Let's Encrypt.

			Workaround: Temporarily removed the `sunbeam.pt` AAAA record, let the cert issue over IPv4, then re-added the AAAA. This works because cert renewals reuse the existing cert until it expires (90 days), and by then we should have dual-stack K3s.

			`---`

			`## The Fix`

			### lk-jwt-service deployment (`base/media/lk-jwt-service.yaml`)

			- Image: `ghcr.io/element-hq/lk-jwt-service:latest`
			- Shares LiveKit API credentials via the existing `livekit-api-credentials` VSO secret
			- `LIVEKIT_FULL_ACCESS_HOMESERVERS=sunbeam.pt`
			- Health checks on `/healthz`

			### Proxy routing (`base/ingress/pingora-config.yaml`)

			Added path-based routing under the `livekit` host prefix:
			- `/sfu/get`, `/healthz`, `/get_token*` → `lk-jwt-service.media.svc.cluster.local:80`
			- Everything else → `livekit-server.media.svc.cluster.local:80` (WebSocket)

			Added a `sunbeam` host prefix route for the bare domain:
			- `/.well-known/matrix/*` → `tuwunel.matrix.svc.cluster.local:6167`

			### Tuwunel config (`base/matrix/tuwunel-config.yaml`)

			Changed `livekit_url = "wss://livekit.sunbeam.pt"` → `livekit_url = "https://livekit.sunbeam.pt"`

			### TLS cert (`overlays/production/cert-manager.yaml`)

			Added `sunbeam.pt` (bare domain) to the certificate SANs.

			`### DNS (Scaleway)`

			Added explicit A and AAAA records for `sunbeam.pt` (apex). The wildcard `*` records only cover subdomains.

			### VSO secret (`base/media/vault-secrets.yaml`)

			Removed `excludeRaw: true` from the livekit-api-credentials transformation so `lk-jwt-service` can read the raw `api-key` and `api-secret` fields. Added `lk-jwt-service` to `rolloutRestartTargets`.

			`---`

			`## Call Flow (for future reference)`

			```
			`Element Call (in Element X or Desktop)`
			`→ GET https://sunbeam.pt/.well-known/matrix/client`
			`→ discovers livekit_service_url: https://livekit.sunbeam.pt`
			`→ POST https://livekit.sunbeam.pt/sfu/get (with Matrix OpenID token)`
			`→ lk-jwt-service validates token against tuwunel, mints LiveKit JWT`
			`→ Client connects wss://livekit.sunbeam.pt (LiveKit SFU) with JWT`
			`→ Audio/video flows through LiveKit SFU`
			`→ TURN fallback via turn:meet.sunbeam.pt:3478 / turns:meet.sunbeam.pt:5349`
			```

			`---`

			`## The IPv6 Problem (and why we need dual-stack K3s)`

			`### Current state`

			The Dedibox has both IPv4 and IPv6 at the OS level. Pingora (the proxy) runs on `hostNetwork` and happily accepts connections on both families from the internet. External traffic works fine over IPv6.

			But inside the K3s cluster, there's no IPv6. K3s was installed single-stack (IPv4 only). Pods get `10.42.x.x` addresses, services get `10.43.x.x` ClusterIPs. When a pod (like cert-manager) tries to reach a public IPv6 address, it fails — there's no route from the pod network to the IPv6 internet.

			`### Why this matters beyond cert-manager`

			`- Any pod that resolves a dual-stack hostname and prefers IPv6 (Go, curl, etc.) will fail to connect`
			`- Cluster-internal DNS (CoreDNS) doesn't serve AAAA records for services`
			`- Pod-to-pod communication is IPv4 only`
			`- If we ever want to expose services natively on IPv6 (not just through the hostNetwork proxy), we need dual-stack`

			`### Migration plan: enable dual-stack in K3s`

			`This is a cluster rebuild, not a flag flip. Kubernetes does not support changing cluster/service CIDRs on a running cluster. For our single-node setup, the blast radius is manageable.`

			`#### Pre-flight`

			`1. Snapshot the Dedibox (Scaleway rescue mode or LVM snapshot)`
			2. Back up all PVCs: `kubectl get pv -o yaml > pvs.yaml`
			3. Back up cert-manager certs: `kubectl get secret pingora-tls -n ingress -o yaml > tls-backup.yaml`
			`4. Export all Vault secrets (they're already in OpenBao, but belt-and-suspenders)`
			5. Verify `sunbeam platform seed` can fully re-seed from scratch
			6. Verify all manifests in `sbbb/` are the source of truth (no manual kubectl edits)

			`#### Rebuild`

			```bash
			`# 1. Uninstall K3s`
			`/usr/local/bin/k3s-uninstall.sh`

			`# 2. Reinstall with dual-stack`
			`curl -sfL https://get.k3s.io \| INSTALL_K3S_EXEC="server \`
			`--cluster-cidr=10.42.0.0/16,fd42::/48 \`
			`--service-cidr=10.43.0.0/16,fd43::/108 \`
			`--flannel-ipv6-masq \`
			`--disable=traefik \`
			`--disable=servicelb" sh -`

			`# 3. Redeploy everything`
			`sunbeam platform up`
			```

			`#### Post-rebuild changes`

			- Every Service needs `spec.ipFamilyPolicy: PreferDualStack` to get both ClusterIPs. Without this, services remain IPv4-only even on a dual-stack cluster. We should add this to all Service manifests in `sbbb/`.
			- Pingora runs on `hostNetwork` — unaffected by cluster networking, already serves IPv6.
			- LiveKit runs on `hostNetwork` — same, already reachable on IPv6 via the host interface.
			`- cert-manager — will be able to self-check on IPv6 once pods can route to public IPv6 addresses through flannel's IPv6 masquerade.`
			`- CoreDNS — automatically serves AAAA records for dual-stack services. No config changes needed.`

			`#### What we're NOT doing`

			`- We disabled Traefik and ServiceLB because Pingora is our ingress. No need for MetalLB or dual-stack Traefik config.`
			- We're not changing pod DNS to dual-stack (`--cluster-dns` stays IPv4). CoreDNS responds to AAAA queries regardless.

			`---`

			`## Remaining Issues`

			- Element Desktop legacy calling. Calls from Desktop → Element X show "Unsupported call" unless Desktop is configured with `"element_call": { "use_exclusively": true }`. Client-side config, not a server issue.
			`- MSC4140 (delayed events) not in tuwunel. Crashed clients leave phantom call participants. No server-side auto-expiry. Manual cleanup required. Tracked upstream.`
			- Element X caches `.well-known` for 7 days. Users who hit the error before the fix need to clear app cache or reinstall.
			- Cert renewal in ~60 days. If the cluster is still single-stack IPv4 at renewal time and `sunbeam.pt` has an AAAA record, we'll need to temporarily remove it again. This is the forcing function for the dual-stack migration.