Files
sbbb/docs/monitoring.md
Sienna Meridian Satterwhite 977972d9f3 docs: add observability documentation — Keeping an Eye on the Girlies 👁️
Prometheus, Grafana (10 dashboards), Loki, Tempo, AlertManager → Matrix,
ServiceMonitors, PrometheusRules per component.
2026-03-24 11:46:33 +00:00

159 lines
6.1 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Keeping an Eye on the Girlies 👁️
You don't run a production stack without watching it. The Super Boujee Business Box ✨ has a full observability stack — metrics, logs, traces, dashboards, and alerts that go straight to your Matrix chat. If something's off, you'll know before anyone complains.
---
## The Stack
| Component | What it does | Where it lives |
|-----------|-------------|----------------|
| **Prometheus** | Scrapes metrics from everything, every 30 seconds | `systemmetrics.DOMAIN` |
| **Grafana** | Dashboards, visualizations, the pretty pictures | `metrics.DOMAIN` |
| **Loki** | Log aggregation — all container logs, searchable | `systemlogs.DOMAIN` |
| **Tempo** | Distributed tracing — follow a request across services | `systemtracing.DOMAIN` |
| **AlertManager** | Routes alerts to the right place | Internal |
| **Alloy** | Collection agent — ships metrics, logs, and traces | DaemonSet on all nodes |
All deployed via Helm in the `monitoring` namespace. Grafana authenticates via Hydra (same OIDC as everything else) and auto-assigns Admin to all authenticated users — if you're on the team, you see everything.
---
## Dashboards
Ten Grafana dashboard ConfigMaps, organized by domain:
| Dashboard | What it covers |
|-----------|---------------|
| `dashboards-infrastructure` | Kubernetes nodes, pods, resource usage |
| `dashboards-ingress` | Pingora proxy — request rates, latencies, cache hits, security decisions |
| `dashboards-identity` | Ory Kratos + Hydra — auth flows, token issuance, error rates |
| `dashboards-lasuite` | La Suite apps — per-app health, response times |
| `dashboards-comms` | Matrix/Tuwunel + Sol☀ — message rates, bot activity |
| `dashboards-devtools` | Gitea — repo activity, CI builds |
| `dashboards-storage` | SeaweedFS — volume health, S3 operations |
| `dashboards-search` | OpenSearch — index health, query performance |
| `dashboards-media` | LiveKit — active rooms, participants, media quality |
| `dashboards-observability` | Meta — Prometheus/Loki/Tempo self-monitoring |
Dashboards are stored as ConfigMaps with the `grafana_dashboard: "1"` label. Grafana's sidecar picks them up automatically.
---
## What Gets Scraped
ServiceMonitors tell Prometheus what to scrape. Currently active:
| Target | Namespace | Notes |
|--------|-----------|-------|
| Pingora proxy | ingress | Custom metrics on :9090 |
| Ory Kratos | ory | Identity operations |
| Ory Hydra | ory | OAuth2/OIDC metrics |
| Gitea | devtools | Repo + CI metrics |
| SeaweedFS | storage | Master, volume, filer |
| Linkerd | mesh | Control plane + data plane |
| cert-manager | cert-manager | Certificate lifecycle |
| OpenBao | data | Seal status, audit events |
| kube-state-metrics | monitoring | Kubernetes object state |
| node-exporter | monitoring | Host-level metrics |
| kube-apiserver | — | Kubernetes API |
| kubelet | — | Container runtime |
**Disabled** (not available or firewalled — we'll get to them):
- OpenSearch (no prometheus-exporter in v3.x)
- LiveKit (hostNetwork, behind firewall)
- kube-proxy (replaced by Cilium)
- etcd/scheduler/controller-manager (k3s quirks)
---
## Alert Rules
PrometheusRule resources per component, firing when things need attention:
| File | Covers |
|------|--------|
| `alertrules-infrastructure.yaml` | General K8s health — pod restarts, node pressure, PVC usage |
| `postgres-alertrules.yaml` | Database — connection limits, replication lag, backup freshness |
| `opensearch-alertrules.yaml` | Search — cluster health, index issues |
| `openbao-alertrules.yaml` | Vault — seal status, auth failures |
| `gitea-alertrules.yaml` | Git — service health, repo errors |
| `seaweedfs-alertrules.yaml` | Storage — volume health, filer availability |
| `livekit-alertrules.yaml` | Media — server health, room issues |
| `linkerd-alertrules.yaml` | Mesh — proxy health, certificate expiry |
| `ory-alertrules.yaml` | Identity — auth failures, token errors |
---
## Alerts → Matrix
When an alert fires, it doesn't just sit in AlertManager waiting for someone to check a dashboard. It goes to Matrix.
```
Alert fires in Prometheus
→ AlertManager evaluates routing rules
→ Webhook to matrix-alertmanager-receiver
→ Bot posts to Matrix room
→ Team sees it in chat ✨
```
The `matrix-alertmanager-receiver` is a small deployment in the monitoring namespace with a bot account (`matrix-bot-secret`). Alerts show up in your Matrix chat alongside your regular conversations — no separate pager app, no email noise.
---
## Logs — Loki
Every container's stdout/stderr gets shipped to Loki by the Alloy DaemonSet. Query with LogQL:
```bash
# Via sunbeam CLI
sunbeam mon loki logs '{namespace="matrix"}'
sunbeam mon loki logs '{namespace="lasuite", container="drive-backend"}'
# Via Grafana Explore (systemlogs.DOMAIN)
{namespace="matrix"} |= "error"
{namespace="lasuite", container="messages-backend"} | json | level="ERROR"
```
Grafana's derived fields let you click a traceID in a log line and jump straight to the trace in Tempo. Logs → traces in one click. Chef's kiss. 💅
---
## Traces — Tempo
Distributed tracing via OTLP. Alloy receives traces and ships them to Tempo. When enabled on Pingora (optional, requires dedicated Tokio runtime build), you can follow a request from the proxy through the service mesh to the backend and back.
```bash
# Via Grafana Explore (systemtracing.DOMAIN)
# Search by trace ID, service name, or duration
```
---
## Datasources
Grafana comes pre-configured with all three backends:
| Datasource | URL | Notes |
|------------|-----|-------|
| Prometheus | `http://kube-prometheus-stack-prometheus.monitoring.svc:9090` | Default |
| Loki | `http://loki-gateway.monitoring.svc:80` | With derived fields for trace correlation |
| Tempo | `http://tempo.monitoring.svc:3200` | Trace backend |
---
## Quick Reference
```bash
# Check everything
sunbeam mon prometheus query 'up'
sunbeam mon grafana alerts
# Check specific service
sunbeam mon prometheus query 'up{job="pingora"}'
sunbeam mon loki logs '{namespace="matrix", container="sol"}'
# Open dashboards
# → metrics.DOMAIN in your browser
```