e4987b4c58
feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
...
The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.
Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting
New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
2026-04-06 15:52:06 +01:00
efe574f48e
feat(storage): sccache S3 build cache with scoped SeaweedFS identity
...
Add sunbeam-sccache bucket and a dedicated sccache S3 identity scoped
to Read/Write/List/Tagging on that bucket only. Bump volume server
max from 50 to 100 (was full, blocking all new writes).
2026-04-05 21:50:46 +01:00
e8c64e6f18
feat: add ServiceMonitors and enable metrics scraping
...
- SeaweedFS: enable -metricsPort=9091 on master/volume/filer, add
service labels, create ServiceMonitor
- Gitea: enable metrics in config, create ServiceMonitor
- Hydra/Kratos: standalone ServiceMonitors (chart templates require
.Capabilities.APIVersions unavailable in kustomize helm template)
- LiveKit: add prometheus_port=6789, standalone ServiceMonitor
(disabled in kustomization — host firewall blocks port 6789)
- OpenSearch: revert prometheus-exporter attempt (no plugin for v3.x),
add service label for future exporter sidecar
2026-03-24 12:21:18 +00:00
3fc54c8851
feat: add PrometheusRule alerts for all services
...
28 alert rules across 9 PrometheusRule files covering infrastructure
(Longhorn, cert-manager), data (PostgreSQL, OpenBao, OpenSearch),
storage (SeaweedFS), devtools (Gitea), identity (Hydra, Kratos),
media (LiveKit), and mesh (Linkerd golden signals for all services).
Severity routing: critical alerts fire to Matrix + email, warnings
to Matrix only (AlertManager config updated in separate commit).
2026-03-24 12:20:55 +00:00
a086049de6
fix: harden SeaweedFS storage and fix Drive presigned uploads
...
- SeaweedFS filer: Recreate strategy (prevents LevelDB lock contention),
60s termination grace period, memory 256Mi→2Gi limit
- SeaweedFS volume: 60s termination grace period, memory 256Mi→1Gi limit
- Drive: add AWS_S3_DOMAIN_REPLACE so presigned upload URLs use
s3.sunbeam.pt instead of internal cluster DNS
- Drive: relax liveness/readiness probes (failureThreshold 1→3,
period 1s→10s, timeout 1s→5s) to prevent crash loops under load
2026-03-22 19:48:36 +00:00
ccfe8b877a
feat: La Suite email/messages, buildkitd, monitoring, vault and storage updates
...
- Add Messages (email) service: backend, frontend, MTA in/out, MPA, SOCKS
proxy, worker, DKIM config, and theme customization
- Add Collabora deployment for document collaboration
- Add Drive frontend nginx config and values
- Add buildkitd namespace for in-cluster container builds
- Add SeaweedFS remote sync and additional S3 buckets
- Update vault secrets across namespaces (devtools, lasuite, media,
monitoring, ory, storage) with expanded credential management
- Update monitoring: rename grafana→metrics OAuth2Client, add Prometheus
remote write and additional scrape configs
- Update local/production overlays with resource patches
- Remove stale login-ui resource patch from production overlay
2026-03-10 19:00:57 +00:00
d32d1435f9
feat(infra): data, storage, devtools, and ory layer updates
...
- data: CNPG cluster tuning, OpenBao values, OpenSearch deployment fixes,
OpenSearch PVC, barman vault secret for S3 backup credentials
- storage: SeaweedFS filer updates (s3.json via secret subPath), PVC for
filer persistent storage
- devtools: Gitea values (SSH service, custom theme), gitea-theme-cm ConfigMap
- ory: add kratos-selfservice-urls.yaml for self-service flow URLs
- media: LiveKit values updated (TURN config, STUN, resource limits)
- vso: kustomization cleanup
2026-03-06 12:07:28 +00:00
580eb3983e
feat(storage): migrate SeaweedFS S3 credentials to VSO; mount s3.json from Secret
...
Previously s3.json was embedded in the seaweedfs-filer-config ConfigMap
with hardcoded minioadmin credentials, and the config volume was mounted
at /etc/seaweedfs/ (overwriting filer.toml with its own directory mount).
- Remove s3.json from ConfigMap; fix the config volume to mount only
filer.toml via subPath so both files coexist under /etc/seaweedfs/.
- Add vault-secrets.yaml with VaultStaticSecrets that VSO syncs from
OpenBao secret/seaweedfs: seaweedfs-s3-credentials (S3_ACCESS_KEY /
S3_SECRET_KEY) and seaweedfs-s3-json (s3.json as a JSON template).
- Mount seaweedfs-s3-json Secret at /etc/seaweedfs/s3.json via subPath.
2026-03-02 18:32:16 +00:00
a589e6280d
feat: bring up local dev stack — all services running
...
- Ory Hydra + Kratos: fixed secret management, DSN config, DB migrations,
OAuth2Client CRD (helm template skips crds/ dir), login-ui env vars
- SeaweedFS: added s3.json credentials file via -s3.config CLI flag
- OpenBao: standalone mode with auto-unseal sidecar, keys in K8s secret
- OpenSearch: increased memory to 1.5Gi / JVM 1g heap
- Gitea: SSL_MODE disable, S3 bucket creation fixed
- Hive: automountServiceAccountToken: false (Lima virtiofs read-only rootfs quirk)
- LiveKit: API keys in values, hostPort conflict resolved
- Linkerd: native sidecar (proxy.nativeSidecar=true) to avoid blocking Jobs
- All placeholder images replaced: pingora→nginx:alpine, login-ui→oryd/kratos-selfservice-ui-node
Full stack running: postgres, valkey, openbao, opensearch, seaweedfs,
kratos, hydra, gitea, livekit, hive (placeholder), login-ui
2026-02-28 22:08:38 +00:00
5d9bd7b067
chore: initial infrastructure scaffold
...
Kustomize base + overlays for the full Sunbeam k3s stack:
- base/mesh — Linkerd edge (crds + control-plane + viz)
- base/ingress — custom Pingora edge proxy
- base/ory — Kratos 0.60.1 + Hydra 0.60.1 + login-ui
- base/data — CloudNativePG 0.27.1, Valkey 8, OpenSearch 2
- base/storage — SeaweedFS master + volume + filer (S3 on :8333)
- base/lasuite — Hive sync daemon + La Suite app placeholders
- base/media — LiveKit livekit-server 1.9.0
- base/devtools — Gitea 12.5.0 (external PG + Valkey)
overlays/local — sslip.io domain, mkcert TLS, Lima hostPort
overlays/production — stub (TODOs for sunbeam.pt values)
scripts/ — local-up/down/certs/urls helpers
justfile — up / down / certs / urls targets
2026-02-28 13:42:27 +00:00