sbbb/docs/monitoring.md

# Keeping an Eye on the Girlies 👁️

You don't run a production stack without watching it. The Super Boujee Business Box ✨ has a full observability stack — metrics, logs, traces, dashboards, and alerts that go straight to your Matrix chat. If something's off, you'll know before anyone complains.

---

## The Stack

| Component | What it does | Where it lives |
|-----------|-------------|----------------|
| **Prometheus** | Scrapes metrics from everything, every 30 seconds | `systemmetrics.DOMAIN` |
| **Grafana** | Dashboards, visualizations, the pretty pictures | `metrics.DOMAIN` |
| **Loki** | Log aggregation — all container logs, searchable | `systemlogs.DOMAIN` |
| **Tempo** | Distributed tracing — follow a request across services | `systemtracing.DOMAIN` |
| **AlertManager** | Routes alerts to the right place | Internal |
| **Alloy** | Collection agent — ships metrics, logs, and traces | DaemonSet on all nodes |

All deployed via Helm in the `monitoring` namespace. Grafana authenticates via Hydra (same OIDC as everything else) and auto-assigns Admin to all authenticated users — if you're on the team, you see everything.

---

## Dashboards

Ten Grafana dashboard ConfigMaps, organized by domain:

| Dashboard | What it covers |
|-----------|---------------|
| `dashboards-infrastructure` | Kubernetes nodes, pods, resource usage |
| `dashboards-ingress` | Pingora proxy — request rates, latencies, cache hits, security decisions |
| `dashboards-identity` | Ory Kratos + Hydra — auth flows, token issuance, error rates |
| `dashboards-lasuite` | La Suite apps — per-app health, response times |
| `dashboards-comms` | Matrix/Tuwunel + Sol☀️ — message rates, bot activity |
| `dashboards-devtools` | Gitea — repo activity, CI builds |
| `dashboards-storage` | SeaweedFS — volume health, S3 operations |
| `dashboards-search` | OpenSearch — index health, query performance |
| `dashboards-media` | LiveKit — active rooms, participants, media quality |
| `dashboards-observability` | Meta — Prometheus/Loki/Tempo self-monitoring |

Dashboards are stored as ConfigMaps with the `grafana_dashboard: "1"` label. Grafana's sidecar picks them up automatically.

---

## What Gets Scraped

ServiceMonitors tell Prometheus what to scrape. Currently active:

| Target | Namespace | Notes |
|--------|-----------|-------|
| Pingora proxy | ingress | Custom metrics on :9090 |
| Ory Kratos | ory | Identity operations |
| Ory Hydra | ory | OAuth2/OIDC metrics |
| Gitea | devtools | Repo + CI metrics |
| SeaweedFS | storage | Master, volume, filer |
| Linkerd | mesh | Control plane + data plane |
| cert-manager | cert-manager | Certificate lifecycle |
| OpenBao | data | Seal status, audit events |
| kube-state-metrics | monitoring | Kubernetes object state |
| node-exporter | monitoring | Host-level metrics |
| kube-apiserver | — | Kubernetes API |
| kubelet | — | Container runtime |

**Disabled** (not available or firewalled — we'll get to them):
- OpenSearch (no prometheus-exporter in v3.x)
- LiveKit (hostNetwork, behind firewall)
- kube-proxy (replaced by Cilium)
- etcd/scheduler/controller-manager (k3s quirks)

---

## Alert Rules

PrometheusRule resources per component, firing when things need attention:

| File | Covers |
|------|--------|
| `alertrules-infrastructure.yaml` | General K8s health — pod restarts, node pressure, PVC usage |
| `postgres-alertrules.yaml` | Database — connection limits, replication lag, backup freshness |
| `opensearch-alertrules.yaml` | Search — cluster health, index issues |
| `openbao-alertrules.yaml` | Vault — seal status, auth failures |
| `gitea-alertrules.yaml` | Git — service health, repo errors |
| `seaweedfs-alertrules.yaml` | Storage — volume health, filer availability |
| `livekit-alertrules.yaml` | Media — server health, room issues |
| `linkerd-alertrules.yaml` | Mesh — proxy health, certificate expiry |
| `ory-alertrules.yaml` | Identity — auth failures, token errors |

---

## Alerts → Matrix

When an alert fires, it doesn't just sit in AlertManager waiting for someone to check a dashboard. It goes to Matrix.

```
Alert fires in Prometheus
  → AlertManager evaluates routing rules
  → Webhook to matrix-alertmanager-receiver
  → Bot posts to Matrix room
  → Team sees it in chat ✨
```

The `matrix-alertmanager-receiver` is a small deployment in the monitoring namespace with a bot account (`matrix-bot-secret`). Alerts show up in your Matrix chat alongside your regular conversations — no separate pager app, no email noise.

---

## Logs — Loki

Every container's stdout/stderr gets shipped to Loki by the Alloy DaemonSet. Query with LogQL:

```bash
# Via sunbeam CLI
sunbeam mon loki logs '{namespace="matrix"}'
sunbeam mon loki logs '{namespace="lasuite", container="drive-backend"}'

# Via Grafana Explore (systemlogs.DOMAIN)
{namespace="matrix"} |= "error"
{namespace="lasuite", container="messages-backend"} | json | level="ERROR"
```

Grafana's derived fields let you click a traceID in a log line and jump straight to the trace in Tempo. Logs → traces in one click. Chef's kiss. 💅

---

## Traces — Tempo

Distributed tracing via OTLP. Alloy receives traces and ships them to Tempo. When enabled on Pingora (optional, requires dedicated Tokio runtime build), you can follow a request from the proxy through the service mesh to the backend and back.

```bash
# Via Grafana Explore (systemtracing.DOMAIN)
# Search by trace ID, service name, or duration
```

---

## Datasources

Grafana comes pre-configured with all three backends:

| Datasource | URL | Notes |
|------------|-----|-------|
| Prometheus | `http://kube-prometheus-stack-prometheus.monitoring.svc:9090` | Default |
| Loki | `http://loki-gateway.monitoring.svc:80` | With derived fields for trace correlation |
| Tempo | `http://tempo.monitoring.svc:3200` | Trace backend |

---

## Quick Reference

```bash
# Check everything
sunbeam mon prometheus query 'up'
sunbeam mon grafana alerts

# Check specific service
sunbeam mon prometheus query 'up{job="pingora"}'
sunbeam mon loki logs '{namespace="matrix", container="sol"}'

# Open dashboards
# → metrics.DOMAIN in your browser
```