e4987b4c58
feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
...
The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.
Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting
New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
2026-04-06 15:52:06 +01:00
7cb6bb1bd2
feat(data): OpenSearch prometheus-exporter sidecar
...
elasticsearch-exporter v1.7.0 runs as a sidecar, scrapes localhost:9200,
exposes elasticsearch_* metrics on :9114. ServiceMonitor re-enabled.
Alert rules updated to use elasticsearch_* metric names.
Flags: --es.all --es.indices --es.shards --collector.clustersettings
2026-03-25 17:53:59 +00:00
3fc54c8851
feat: add PrometheusRule alerts for all services
...
28 alert rules across 9 PrometheusRule files covering infrastructure
(Longhorn, cert-manager), data (PostgreSQL, OpenBao, OpenSearch),
storage (SeaweedFS), devtools (Gitea), identity (Hydra, Kratos),
media (LiveKit), and mesh (Linkerd golden signals for all services).
Severity routing: critical alerts fire to Matrix + email, warnings
to Matrix only (AlertManager config updated in separate commit).
2026-03-24 12:20:55 +00:00
dc95e1d8ec
sol v1.1.0: SearXNG web search, evaluator redesign, research agents
...
- SearXNG deployment in data namespace (free, no-tracking web search)
- sol-config: SearXNG URL, research config, identity agent, updated
system prompt (DM search rules, research mode, silence, hard rules)
- sol-deployment: debug logging (RUST_LOG=sol=debug), full image path
- opensearch: tolerate missing prometheus-exporter plugin on OS 3
2026-03-23 09:54:56 +00:00
d32d1435f9
feat(infra): data, storage, devtools, and ory layer updates
...
- data: CNPG cluster tuning, OpenBao values, OpenSearch deployment fixes,
OpenSearch PVC, barman vault secret for S3 backup credentials
- storage: SeaweedFS filer updates (s3.json via secret subPath), PVC for
filer persistent storage
- devtools: Gitea values (SSH service, custom theme), gitea-theme-cm ConfigMap
- ory: add kratos-selfservice-urls.yaml for self-service flow URLs
- media: LiveKit values updated (TURN config, STUN, resource limits)
- vso: kustomization cleanup
2026-03-06 12:07:28 +00:00
a589e6280d
feat: bring up local dev stack — all services running
...
- Ory Hydra + Kratos: fixed secret management, DSN config, DB migrations,
OAuth2Client CRD (helm template skips crds/ dir), login-ui env vars
- SeaweedFS: added s3.json credentials file via -s3.config CLI flag
- OpenBao: standalone mode with auto-unseal sidecar, keys in K8s secret
- OpenSearch: increased memory to 1.5Gi / JVM 1g heap
- Gitea: SSL_MODE disable, S3 bucket creation fixed
- Hive: automountServiceAccountToken: false (Lima virtiofs read-only rootfs quirk)
- LiveKit: API keys in values, hostPort conflict resolved
- Linkerd: native sidecar (proxy.nativeSidecar=true) to avoid blocking Jobs
- All placeholder images replaced: pingora→nginx:alpine, login-ui→oryd/kratos-selfservice-ui-node
Full stack running: postgres, valkey, openbao, opensearch, seaweedfs,
kratos, hydra, gitea, livekit, hive (placeholder), login-ui
2026-02-28 22:08:38 +00:00
886c4221b2
fix(local): kustomize render passes cleanly
...
- Remove base/mesh from local overlay (Linkerd installed via CLI in local-up.sh)
- Fix LiveKit namespace: chart doesn't set .Release.Namespace, add explicit patches
- Fix release names: livekit-server and cloudnative-pg match chart names (avoid double-prefix)
- Disable hydra-maester (not needed for local dev)
- Add memory limits for cloudnative-pg operator and livekit-server deployments
- Remove non-functional values-ory.yaml patch (DOMAIN_SUFFIX handled by sed in local-up.sh)
- Gitignore **/charts/ (kustomize helm cache, generated artifact)
2026-02-28 14:00:31 +00:00
5d9bd7b067
chore: initial infrastructure scaffold
...
Kustomize base + overlays for the full Sunbeam k3s stack:
- base/mesh — Linkerd edge (crds + control-plane + viz)
- base/ingress — custom Pingora edge proxy
- base/ory — Kratos 0.60.1 + Hydra 0.60.1 + login-ui
- base/data — CloudNativePG 0.27.1, Valkey 8, OpenSearch 2
- base/storage — SeaweedFS master + volume + filer (S3 on :8333)
- base/lasuite — Hive sync daemon + La Suite app placeholders
- base/media — LiveKit livekit-server 1.9.0
- base/devtools — Gitea 12.5.0 (external PG + Valkey)
overlays/local — sslip.io domain, mkcert TLS, Lima hostPort
overlays/production — stub (TODOs for sunbeam.pt values)
scripts/ — local-up/down/certs/urls helpers
justfile — up / down / certs / urls targets
2026-02-28 13:42:27 +00:00