feat(infra): production bootstrap — cert-manager, longhorn, monitoring

Add new bases for cert-manager (Let's Encrypt + wildcard cert), Longhorn distributed storage, and monitoring (kube-prometheus-stack + Loki + Tempo + Grafana OIDC). Add cloud-init for Scaleway Elastic Metal provisioning. Production overlay: add patches for postgres sizing, SeaweedFS volume, OpenSearch storage, LiveKit service, Pingora host ports, resource limits, and CNPG daily barman backups. Update cert-manager.yaml with full dnsNames for all *.sunbeam.pt subdomains.
2026-03-06 12:06:27 +00:00
parent f7774558e9
commit 7ff35d3e0c
23 changed files with 855 additions and 35 deletions
--- a/overlays/production/cert-manager.yaml
+++ b/overlays/production/cert-manager.yaml
@@ -1,18 +1,30 @@
-# cert-manager resources for production TLS.
+# cert-manager issuers and certificate for production TLS.
 #
-# Prerequisites:
-#   cert-manager must be installed in the cluster before applying this overlay:
-#   kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml
+# WORKFLOW: start with letsencrypt-staging to verify the HTTP-01 challenge
+# flow works without burning production rate limits. Once the staging cert
+# is issued successfully, flip the Certificate issuerRef to letsencrypt-production
+# and delete the old Secret so cert-manager re-issues with a trusted cert.
 #
-# DOMAIN_SUFFIX and ACME_EMAIL are substituted by sed at deploy time.
-# See overlays/production/kustomization.yaml for the deploy command.
+# ACME_EMAIL is substituted by sunbeam apply.
 ---
-# ClusterIssuer: Let's Encrypt production via HTTP-01 challenge.
-#
-# cert-manager creates one Ingress per challenged domain.  The pingora proxy
-# watches these Ingresses and routes /.well-known/acme-challenge/<token>
-# requests to the per-domain solver Service, so multi-SAN certificates are
-# issued correctly even when all domain challenges run in parallel.
+# Let's Encrypt staging — untrusted cert but no rate limits. Use for initial setup.
+apiVersion: cert-manager.io/v1
+kind: ClusterIssuer
+metadata:
+  name: letsencrypt-staging
+spec:
+  acme:
+    server: https://acme-staging-v02.api.letsencrypt.org/directory
+    email: ACME_EMAIL
+    privateKeySecretRef:
+      name: letsencrypt-staging-account-key
+    solvers:
+      - http01:
+          ingress:
+            serviceType: ClusterIP
+---
+# Let's Encrypt production — trusted cert, strict rate limits.
+# Switch to this once staging confirms challenges resolve correctly.
 apiVersion: cert-manager.io/v1
 kind: ClusterIssuer
 metadata:
@@ -26,16 +38,11 @@ spec:
    solvers:
      - http01:
          ingress:
-            # ingressClassName is intentionally blank: cert-manager still creates
-            # the Ingress object (which the proxy watches), but no ingress
-            # controller needs to act on it — the proxy handles routing itself.
-            ingressClassName: ""
+            serviceType: ClusterIP
 ---
-# Certificate: single multi-SAN cert covering all proxy subdomains.
-# cert-manager issues it via HTTP-01, stores it in pingora-tls Secret, and
-# renews it automatically ~30 days before expiry.  The watcher in sunbeam-proxy
-# detects the Secret update and triggers a graceful upgrade so the new cert is
-# loaded without dropping any connections.
+# Certificate covering all proxy subdomains.
+# Start with letsencrypt-staging. Once verified, change issuerRef.name to
+# letsencrypt-production and delete the pingora-tls Secret to force re-issue.
 apiVersion: cert-manager.io/v1
 kind: Certificate
 metadata:
@@ -56,3 +63,6 @@ spec:
    - src.DOMAIN_SUFFIX
    - auth.DOMAIN_SUFFIX
    - s3.DOMAIN_SUFFIX
+    - grafana.DOMAIN_SUFFIX
+    - admin.DOMAIN_SUFFIX
+    - integration.DOMAIN_SUFFIX