From 977972d9f395fd31fe6dba7868a5fb2accb46246 Mon Sep 17 00:00:00 2001
From: Sienna Meridian Satterwhite <sienna@sunbeam.pt>
Date: Tue, 24 Mar 2026 11:46:33 +0000
Subject: [PATCH] =?UTF-8?q?docs:=20add=20observability=20documentation=20?=
 =?UTF-8?q?=E2=80=94=20Keeping=20an=20Eye=20on=20the=20Girlies=20?=
 =?UTF-8?q?=F0=9F=91=81=EF=B8=8F?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Prometheus, Grafana (10 dashboards), Loki, Tempo, AlertManager → Matrix,
ServiceMonitors, PrometheusRules per component.
---
 docs/monitoring.md | 158 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 158 insertions(+)
 create mode 100644 docs/monitoring.md

diff --git a/docs/monitoring.md b/docs/monitoring.md
new file mode 100644
index 0000000..81d04ea
--- /dev/null
+++ b/docs/monitoring.md
@@ -0,0 +1,158 @@
+# Keeping an Eye on the Girlies 👁️
+
+You don't run a production stack without watching it. The Super Boujee Business Box ✨ has a full observability stack — metrics, logs, traces, dashboards, and alerts that go straight to your Matrix chat. If something's off, you'll know before anyone complains.
+
+---
+
+## The Stack
+
+| Component | What it does | Where it lives |
+|-----------|-------------|----------------|
+| **Prometheus** | Scrapes metrics from everything, every 30 seconds | `systemmetrics.DOMAIN` |
+| **Grafana** | Dashboards, visualizations, the pretty pictures | `metrics.DOMAIN` |
+| **Loki** | Log aggregation — all container logs, searchable | `systemlogs.DOMAIN` |
+| **Tempo** | Distributed tracing — follow a request across services | `systemtracing.DOMAIN` |
+| **AlertManager** | Routes alerts to the right place | Internal |
+| **Alloy** | Collection agent — ships metrics, logs, and traces | DaemonSet on all nodes |
+
+All deployed via Helm in the `monitoring` namespace. Grafana authenticates via Hydra (same OIDC as everything else) and auto-assigns Admin to all authenticated users — if you're on the team, you see everything.
+
+---
+
+## Dashboards
+
+Ten Grafana dashboard ConfigMaps, organized by domain:
+
+| Dashboard | What it covers |
+|-----------|---------------|
+| `dashboards-infrastructure` | Kubernetes nodes, pods, resource usage |
+| `dashboards-ingress` | Pingora proxy — request rates, latencies, cache hits, security decisions |
+| `dashboards-identity` | Ory Kratos + Hydra — auth flows, token issuance, error rates |
+| `dashboards-lasuite` | La Suite apps — per-app health, response times |
+| `dashboards-comms` | Matrix/Tuwunel + Sol☀️ — message rates, bot activity |
+| `dashboards-devtools` | Gitea — repo activity, CI builds |
+| `dashboards-storage` | SeaweedFS — volume health, S3 operations |
+| `dashboards-search` | OpenSearch — index health, query performance |
+| `dashboards-media` | LiveKit — active rooms, participants, media quality |
+| `dashboards-observability` | Meta — Prometheus/Loki/Tempo self-monitoring |
+
+Dashboards are stored as ConfigMaps with the `grafana_dashboard: "1"` label. Grafana's sidecar picks them up automatically.
+
+---
+
+## What Gets Scraped
+
+ServiceMonitors tell Prometheus what to scrape. Currently active:
+
+| Target | Namespace | Notes |
+|--------|-----------|-------|
+| Pingora proxy | ingress | Custom metrics on :9090 |
+| Ory Kratos | ory | Identity operations |
+| Ory Hydra | ory | OAuth2/OIDC metrics |
+| Gitea | devtools | Repo + CI metrics |
+| SeaweedFS | storage | Master, volume, filer |
+| Linkerd | mesh | Control plane + data plane |
+| cert-manager | cert-manager | Certificate lifecycle |
+| OpenBao | data | Seal status, audit events |
+| kube-state-metrics | monitoring | Kubernetes object state |
+| node-exporter | monitoring | Host-level metrics |
+| kube-apiserver | — | Kubernetes API |
+| kubelet | — | Container runtime |
+
+**Disabled** (not available or firewalled — we'll get to them):
+- OpenSearch (no prometheus-exporter in v3.x)
+- LiveKit (hostNetwork, behind firewall)
+- kube-proxy (replaced by Cilium)
+- etcd/scheduler/controller-manager (k3s quirks)
+
+---
+
+## Alert Rules
+
+PrometheusRule resources per component, firing when things need attention:
+
+| File | Covers |
+|------|--------|
+| `alertrules-infrastructure.yaml` | General K8s health — pod restarts, node pressure, PVC usage |
+| `postgres-alertrules.yaml` | Database — connection limits, replication lag, backup freshness |
+| `opensearch-alertrules.yaml` | Search — cluster health, index issues |
+| `openbao-alertrules.yaml` | Vault — seal status, auth failures |
+| `gitea-alertrules.yaml` | Git — service health, repo errors |
+| `seaweedfs-alertrules.yaml` | Storage — volume health, filer availability |
+| `livekit-alertrules.yaml` | Media — server health, room issues |
+| `linkerd-alertrules.yaml` | Mesh — proxy health, certificate expiry |
+| `ory-alertrules.yaml` | Identity — auth failures, token errors |
+
+---
+
+## Alerts → Matrix
+
+When an alert fires, it doesn't just sit in AlertManager waiting for someone to check a dashboard. It goes to Matrix.
+
+```
+Alert fires in Prometheus
+  → AlertManager evaluates routing rules
+  → Webhook to matrix-alertmanager-receiver
+  → Bot posts to Matrix room
+  → Team sees it in chat ✨
+```
+
+The `matrix-alertmanager-receiver` is a small deployment in the monitoring namespace with a bot account (`matrix-bot-secret`). Alerts show up in your Matrix chat alongside your regular conversations — no separate pager app, no email noise.
+
+---
+
+## Logs — Loki
+
+Every container's stdout/stderr gets shipped to Loki by the Alloy DaemonSet. Query with LogQL:
+
+```bash
+# Via sunbeam CLI
+sunbeam mon loki logs '{namespace="matrix"}'
+sunbeam mon loki logs '{namespace="lasuite", container="drive-backend"}'
+
+# Via Grafana Explore (systemlogs.DOMAIN)
+{namespace="matrix"} |= "error"
+{namespace="lasuite", container="messages-backend"} | json | level="ERROR"
+```
+
+Grafana's derived fields let you click a traceID in a log line and jump straight to the trace in Tempo. Logs → traces in one click. Chef's kiss. 💅
+
+---
+
+## Traces — Tempo
+
+Distributed tracing via OTLP. Alloy receives traces and ships them to Tempo. When enabled on Pingora (optional, requires dedicated Tokio runtime build), you can follow a request from the proxy through the service mesh to the backend and back.
+
+```bash
+# Via Grafana Explore (systemtracing.DOMAIN)
+# Search by trace ID, service name, or duration
+```
+
+---
+
+## Datasources
+
+Grafana comes pre-configured with all three backends:
+
+| Datasource | URL | Notes |
+|------------|-----|-------|
+| Prometheus | `http://kube-prometheus-stack-prometheus.monitoring.svc:9090` | Default |
+| Loki | `http://loki-gateway.monitoring.svc:80` | With derived fields for trace correlation |
+| Tempo | `http://tempo.monitoring.svc:3200` | Trace backend |
+
+---
+
+## Quick Reference
+
+```bash
+# Check everything
+sunbeam mon prometheus query 'up'
+sunbeam mon grafana alerts
+
+# Check specific service
+sunbeam mon prometheus query 'up{job="pingora"}'
+sunbeam mon loki logs '{namespace="matrix", container="sol"}'
+
+# Open dashboards
+# → metrics.DOMAIN in your browser
+```