Commit Graph

9 Commits

Author SHA1 Message Date
e4987b4c58 feat(monitoring): comprehensive alerting overhaul, 66 rules across 14 PrometheusRules
The Longhorn memory leak went undetected for 14 days because alerting
was broken (email receiver, missing label selector, no node alerts).
This overhaul brings alerting to production grade.

Fixes:
- Alloy Loki URL pointed to deleted loki-gateway, now loki:3100
- seaweedfs-bucket-init crash on unsupported `mc versioning` command
- All PrometheusRules now have `release: kube-prometheus-stack` label
- Removed broken email receiver, Matrix-only alerting

New alert coverage:
- Node: memory, CPU, swap, filesystem, inodes, network, clock skew, OOM
- Kubernetes: deployment down, CronJob failed, pod crash-looping, PVC full
- Backups: Postgres barman stale/failed, WAL archiving, SeaweedFS mirror
- Observability: Prometheus WAL/storage/rules, Loki/Tempo/AlertManager down
- Services: Stalwart, Bulwark, Tuwunel, Sol, Valkey, OpenSearch (smart)
- SLOs: auth stack 99.9% burn rate, Matrix 99.5%, latency p95 < 2s
- Recording rules for Linkerd RED metrics and node aggregates
- Watchdog heartbeat → Matrix every 12h (dead pipeline detection)
- Inhibition: critical suppresses warning for same alert+namespace
- OpenSearchClusterYellow only fires with >1 data node (single-node aware)
2026-04-06 15:52:06 +01:00
9f15f5099e fix: meet external-api route, drive media proxy, alertbot, misc tweaks
- Meet: add external-api backend path, CSRF trusted origins
- Drive: fix media proxy regex for preview URLs and S3 key signing
- OpenBao: enable Prometheus telemetry
- Postgres alerts: fix metric name (cnpg_backends_total)
- Gitea: bump memory limits for mirror workloads
- Alertbot: expanded deployment config
- Kratos: add find/cal/projects to allowed return URLs, settings path
- Pingora: meet external-api route fix
- Sol: config update
2026-03-25 18:01:15 +00:00
fdcc15080f fix(matrix): use https:// for livekit_url in well-known
Element Call expects livekit_service_url to be an HTTPS endpoint
(lk-jwt-service), not a WebSocket URL. The client connects to LiveKit
via WSS separately after getting a JWT.
2026-03-25 13:24:12 +00:00
dc95e1d8ec sol v1.1.0: SearXNG web search, evaluator redesign, research agents
- SearXNG deployment in data namespace (free, no-tracking web search)
- sol-config: SearXNG URL, research config, identity agent, updated
  system prompt (DM search rules, research mode, silence, hard rules)
- sol-deployment: debug logging (RUST_LOG=sol=debug), full image path
- opensearch: tolerate missing prometheus-exporter plugin on OS 3
2026-03-23 09:54:56 +00:00
d7ff1da729 sol: identity agent, research mode, evaluator redesign, DM search
sol-config.yaml:
- added [services.kratos] with admin URL
- added research config (model, max_iterations, max_agents, max_depth)
- tool iterations bumped to 250
- updated system prompt: research mode guidance, DM search rules,
  run_script docs, room overlap explanation, silence mechanic
- time context uses {time_block} with midnight-based boundaries
- evaluator returns response_type (message/thread/react/ignore)
2026-03-23 08:47:40 +00:00
fb91fcd284 sol: vault auth, gitea integration, search fixes
sol-config: added [vault] and [services.gitea] sections, fetch
allowlist (wttr.in, open-meteo, github), bumped context windows
to 200, updated system prompt with run_script docs and tool rules.

sol-deployment: added gitea admin credential env vars from
sol-secrets, automountServiceAccountToken for vault k8s auth.

vault-secrets: added gitea-admin-username and gitea-admin-password
templates to sol-secrets VSS.
2026-03-22 15:16:22 +00:00
e1e6a6bc31 update sol configmap: multi-agent architecture + conversations API
- Add db_path (/data/sol.db) for SQLite persistence
- Add memory_index, script_*, memory_extraction_enabled fields
- Add [agents] section: orchestrator model, compaction threshold, conversations API enabled
- Rewrite system prompt (687 → 150 lines): dense, few-shot, hard rules
- Add {room_context_rules} placeholder for group vs DM behavior
2026-03-21 22:25:54 +00:00
5f923d14f9 feat(matrix): add Sol virtual librarian deployment manifests
Sol is a Matrix bot with E2EE that archives conversations to OpenSearch
and responds via Mistral AI function calling. Adds deployment, PVC,
ConfigMap (sol.toml + system prompt), VaultStaticSecret for credentials,
and production overlay image entry.
2026-03-20 21:38:48 +00:00
d2148335de feat(matrix): add tuwunel Matrix homeserver deployment manifests
Kubernetes manifests for tuwunel — a Rust Matrix homeserver using RocksDB
for storage. Includes deployment, service, PVC, ConfigMap (tuwunel.toml),
Hydra OAuth2Client for SSO, and Vault secrets for credentials injection.

Key design decisions:
- enableServiceLinks: false to prevent K8s TUWUNEL_* env var conflicts
- strategy: Recreate for RocksDB exclusive lock (no rolling updates)
- Identity provider configured entirely via env vars (client_id/secret
  from hydra-maester Secret, not hardcoded)
- OpenSearch model_id injected via ConfigMap from CLI post-apply hook
- SSO-only auth (login_with_password=false, single_sso=true)
- OpenSearch hybrid neural+BM25 search (768-dim, all-mpnet-base-v2)
2026-03-10 18:52:21 +00:00