From 977972d9f395fd31fe6dba7868a5fb2accb46246 Mon Sep 17 00:00:00 2001 From: Sienna Meridian Satterwhite Date: Tue, 24 Mar 2026 11:46:33 +0000 Subject: [PATCH] =?UTF-8?q?docs:=20add=20observability=20documentation=20?= =?UTF-8?q?=E2=80=94=20Keeping=20an=20Eye=20on=20the=20Girlies=20?= =?UTF-8?q?=F0=9F=91=81=EF=B8=8F?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Prometheus, Grafana (10 dashboards), Loki, Tempo, AlertManager → Matrix, ServiceMonitors, PrometheusRules per component. --- docs/monitoring.md | 158 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 158 insertions(+) create mode 100644 docs/monitoring.md diff --git a/docs/monitoring.md b/docs/monitoring.md new file mode 100644 index 0000000..81d04ea --- /dev/null +++ b/docs/monitoring.md @@ -0,0 +1,158 @@ +# Keeping an Eye on the Girlies 👁️ + +You don't run a production stack without watching it. The Super Boujee Business Box ✨ has a full observability stack — metrics, logs, traces, dashboards, and alerts that go straight to your Matrix chat. If something's off, you'll know before anyone complains. + +--- + +## The Stack + +| Component | What it does | Where it lives | +|-----------|-------------|----------------| +| **Prometheus** | Scrapes metrics from everything, every 30 seconds | `systemmetrics.DOMAIN` | +| **Grafana** | Dashboards, visualizations, the pretty pictures | `metrics.DOMAIN` | +| **Loki** | Log aggregation — all container logs, searchable | `systemlogs.DOMAIN` | +| **Tempo** | Distributed tracing — follow a request across services | `systemtracing.DOMAIN` | +| **AlertManager** | Routes alerts to the right place | Internal | +| **Alloy** | Collection agent — ships metrics, logs, and traces | DaemonSet on all nodes | + +All deployed via Helm in the `monitoring` namespace. Grafana authenticates via Hydra (same OIDC as everything else) and auto-assigns Admin to all authenticated users — if you're on the team, you see everything. + +--- + +## Dashboards + +Ten Grafana dashboard ConfigMaps, organized by domain: + +| Dashboard | What it covers | +|-----------|---------------| +| `dashboards-infrastructure` | Kubernetes nodes, pods, resource usage | +| `dashboards-ingress` | Pingora proxy — request rates, latencies, cache hits, security decisions | +| `dashboards-identity` | Ory Kratos + Hydra — auth flows, token issuance, error rates | +| `dashboards-lasuite` | La Suite apps — per-app health, response times | +| `dashboards-comms` | Matrix/Tuwunel + Sol☀️ — message rates, bot activity | +| `dashboards-devtools` | Gitea — repo activity, CI builds | +| `dashboards-storage` | SeaweedFS — volume health, S3 operations | +| `dashboards-search` | OpenSearch — index health, query performance | +| `dashboards-media` | LiveKit — active rooms, participants, media quality | +| `dashboards-observability` | Meta — Prometheus/Loki/Tempo self-monitoring | + +Dashboards are stored as ConfigMaps with the `grafana_dashboard: "1"` label. Grafana's sidecar picks them up automatically. + +--- + +## What Gets Scraped + +ServiceMonitors tell Prometheus what to scrape. Currently active: + +| Target | Namespace | Notes | +|--------|-----------|-------| +| Pingora proxy | ingress | Custom metrics on :9090 | +| Ory Kratos | ory | Identity operations | +| Ory Hydra | ory | OAuth2/OIDC metrics | +| Gitea | devtools | Repo + CI metrics | +| SeaweedFS | storage | Master, volume, filer | +| Linkerd | mesh | Control plane + data plane | +| cert-manager | cert-manager | Certificate lifecycle | +| OpenBao | data | Seal status, audit events | +| kube-state-metrics | monitoring | Kubernetes object state | +| node-exporter | monitoring | Host-level metrics | +| kube-apiserver | — | Kubernetes API | +| kubelet | — | Container runtime | + +**Disabled** (not available or firewalled — we'll get to them): +- OpenSearch (no prometheus-exporter in v3.x) +- LiveKit (hostNetwork, behind firewall) +- kube-proxy (replaced by Cilium) +- etcd/scheduler/controller-manager (k3s quirks) + +--- + +## Alert Rules + +PrometheusRule resources per component, firing when things need attention: + +| File | Covers | +|------|--------| +| `alertrules-infrastructure.yaml` | General K8s health — pod restarts, node pressure, PVC usage | +| `postgres-alertrules.yaml` | Database — connection limits, replication lag, backup freshness | +| `opensearch-alertrules.yaml` | Search — cluster health, index issues | +| `openbao-alertrules.yaml` | Vault — seal status, auth failures | +| `gitea-alertrules.yaml` | Git — service health, repo errors | +| `seaweedfs-alertrules.yaml` | Storage — volume health, filer availability | +| `livekit-alertrules.yaml` | Media — server health, room issues | +| `linkerd-alertrules.yaml` | Mesh — proxy health, certificate expiry | +| `ory-alertrules.yaml` | Identity — auth failures, token errors | + +--- + +## Alerts → Matrix + +When an alert fires, it doesn't just sit in AlertManager waiting for someone to check a dashboard. It goes to Matrix. + +``` +Alert fires in Prometheus + → AlertManager evaluates routing rules + → Webhook to matrix-alertmanager-receiver + → Bot posts to Matrix room + → Team sees it in chat ✨ +``` + +The `matrix-alertmanager-receiver` is a small deployment in the monitoring namespace with a bot account (`matrix-bot-secret`). Alerts show up in your Matrix chat alongside your regular conversations — no separate pager app, no email noise. + +--- + +## Logs — Loki + +Every container's stdout/stderr gets shipped to Loki by the Alloy DaemonSet. Query with LogQL: + +```bash +# Via sunbeam CLI +sunbeam mon loki logs '{namespace="matrix"}' +sunbeam mon loki logs '{namespace="lasuite", container="drive-backend"}' + +# Via Grafana Explore (systemlogs.DOMAIN) +{namespace="matrix"} |= "error" +{namespace="lasuite", container="messages-backend"} | json | level="ERROR" +``` + +Grafana's derived fields let you click a traceID in a log line and jump straight to the trace in Tempo. Logs → traces in one click. Chef's kiss. 💅 + +--- + +## Traces — Tempo + +Distributed tracing via OTLP. Alloy receives traces and ships them to Tempo. When enabled on Pingora (optional, requires dedicated Tokio runtime build), you can follow a request from the proxy through the service mesh to the backend and back. + +```bash +# Via Grafana Explore (systemtracing.DOMAIN) +# Search by trace ID, service name, or duration +``` + +--- + +## Datasources + +Grafana comes pre-configured with all three backends: + +| Datasource | URL | Notes | +|------------|-----|-------| +| Prometheus | `http://kube-prometheus-stack-prometheus.monitoring.svc:9090` | Default | +| Loki | `http://loki-gateway.monitoring.svc:80` | With derived fields for trace correlation | +| Tempo | `http://tempo.monitoring.svc:3200` | Trace backend | + +--- + +## Quick Reference + +```bash +# Check everything +sunbeam mon prometheus query 'up' +sunbeam mon grafana alerts + +# Check specific service +sunbeam mon prometheus query 'up{job="pingora"}' +sunbeam mon loki logs '{namespace="matrix", container="sol"}' + +# Open dashboards +# → metrics.DOMAIN in your browser +```