feat(infra): production bootstrap — cert-manager, longhorn, monitoring

Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn
distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo
+ Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning.

Production overlay: add patches for postgres sizing, SeaweedFS volume,
OpenSearch storage, LiveKit service, Pingora host ports, resource limits,
and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames
for all *.sunbeam.pt subdomains.
This commit is contained in:
2026-03-06 12:06:27 +00:00
parent f7774558e9
commit 7ff35d3e0c
23 changed files with 855 additions and 35 deletions

View File

@@ -1,18 +1,30 @@
# cert-manager resources for production TLS.
# cert-manager issuers and certificate for production TLS.
#
# Prerequisites:
# cert-manager must be installed in the cluster before applying this overlay:
# kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml
# WORKFLOW: start with letsencrypt-staging to verify the HTTP-01 challenge
# flow works without burning production rate limits. Once the staging cert
# is issued successfully, flip the Certificate issuerRef to letsencrypt-production
# and delete the old Secret so cert-manager re-issues with a trusted cert.
#
# DOMAIN_SUFFIX and ACME_EMAIL are substituted by sed at deploy time.
# See overlays/production/kustomization.yaml for the deploy command.
# ACME_EMAIL is substituted by sunbeam apply.
---
# ClusterIssuer: Let's Encrypt production via HTTP-01 challenge.
#
# cert-manager creates one Ingress per challenged domain. The pingora proxy
# watches these Ingresses and routes /.well-known/acme-challenge/<token>
# requests to the per-domain solver Service, so multi-SAN certificates are
# issued correctly even when all domain challenges run in parallel.
# Let's Encrypt staging — untrusted cert but no rate limits. Use for initial setup.
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-staging
spec:
acme:
server: https://acme-staging-v02.api.letsencrypt.org/directory
email: ACME_EMAIL
privateKeySecretRef:
name: letsencrypt-staging-account-key
solvers:
- http01:
ingress:
serviceType: ClusterIP
---
# Let's Encrypt production — trusted cert, strict rate limits.
# Switch to this once staging confirms challenges resolve correctly.
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
@@ -26,16 +38,11 @@ spec:
solvers:
- http01:
ingress:
# ingressClassName is intentionally blank: cert-manager still creates
# the Ingress object (which the proxy watches), but no ingress
# controller needs to act on it — the proxy handles routing itself.
ingressClassName: ""
serviceType: ClusterIP
---
# Certificate: single multi-SAN cert covering all proxy subdomains.
# cert-manager issues it via HTTP-01, stores it in pingora-tls Secret, and
# renews it automatically ~30 days before expiry. The watcher in sunbeam-proxy
# detects the Secret update and triggers a graceful upgrade so the new cert is
# loaded without dropping any connections.
# Certificate covering all proxy subdomains.
# Start with letsencrypt-staging. Once verified, change issuerRef.name to
# letsencrypt-production and delete the pingora-tls Secret to force re-issue.
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
@@ -56,3 +63,6 @@ spec:
- src.DOMAIN_SUFFIX
- auth.DOMAIN_SUFFIX
- s3.DOMAIN_SUFFIX
- grafana.DOMAIN_SUFFIX
- admin.DOMAIN_SUFFIX
- integration.DOMAIN_SUFFIX