Files
sbbb/docs/monitoring.md
Sienna Meridian Satterwhite 977972d9f3 docs: add observability documentation — Keeping an Eye on the Girlies 👁️
Prometheus, Grafana (10 dashboards), Loki, Tempo, AlertManager → Matrix,
ServiceMonitors, PrometheusRules per component.
2026-03-24 11:46:33 +00:00

6.1 KiB

Keeping an Eye on the Girlies 👁️

You don't run a production stack without watching it. The Super Boujee Business Box has a full observability stack — metrics, logs, traces, dashboards, and alerts that go straight to your Matrix chat. If something's off, you'll know before anyone complains.


The Stack

Component What it does Where it lives
Prometheus Scrapes metrics from everything, every 30 seconds systemmetrics.DOMAIN
Grafana Dashboards, visualizations, the pretty pictures metrics.DOMAIN
Loki Log aggregation — all container logs, searchable systemlogs.DOMAIN
Tempo Distributed tracing — follow a request across services systemtracing.DOMAIN
AlertManager Routes alerts to the right place Internal
Alloy Collection agent — ships metrics, logs, and traces DaemonSet on all nodes

All deployed via Helm in the monitoring namespace. Grafana authenticates via Hydra (same OIDC as everything else) and auto-assigns Admin to all authenticated users — if you're on the team, you see everything.


Dashboards

Ten Grafana dashboard ConfigMaps, organized by domain:

Dashboard What it covers
dashboards-infrastructure Kubernetes nodes, pods, resource usage
dashboards-ingress Pingora proxy — request rates, latencies, cache hits, security decisions
dashboards-identity Ory Kratos + Hydra — auth flows, token issuance, error rates
dashboards-lasuite La Suite apps — per-app health, response times
dashboards-comms Matrix/Tuwunel + Sol☀️ — message rates, bot activity
dashboards-devtools Gitea — repo activity, CI builds
dashboards-storage SeaweedFS — volume health, S3 operations
dashboards-search OpenSearch — index health, query performance
dashboards-media LiveKit — active rooms, participants, media quality
dashboards-observability Meta — Prometheus/Loki/Tempo self-monitoring

Dashboards are stored as ConfigMaps with the grafana_dashboard: "1" label. Grafana's sidecar picks them up automatically.


What Gets Scraped

ServiceMonitors tell Prometheus what to scrape. Currently active:

Target Namespace Notes
Pingora proxy ingress Custom metrics on :9090
Ory Kratos ory Identity operations
Ory Hydra ory OAuth2/OIDC metrics
Gitea devtools Repo + CI metrics
SeaweedFS storage Master, volume, filer
Linkerd mesh Control plane + data plane
cert-manager cert-manager Certificate lifecycle
OpenBao data Seal status, audit events
kube-state-metrics monitoring Kubernetes object state
node-exporter monitoring Host-level metrics
kube-apiserver Kubernetes API
kubelet Container runtime

Disabled (not available or firewalled — we'll get to them):

  • OpenSearch (no prometheus-exporter in v3.x)
  • LiveKit (hostNetwork, behind firewall)
  • kube-proxy (replaced by Cilium)
  • etcd/scheduler/controller-manager (k3s quirks)

Alert Rules

PrometheusRule resources per component, firing when things need attention:

File Covers
alertrules-infrastructure.yaml General K8s health — pod restarts, node pressure, PVC usage
postgres-alertrules.yaml Database — connection limits, replication lag, backup freshness
opensearch-alertrules.yaml Search — cluster health, index issues
openbao-alertrules.yaml Vault — seal status, auth failures
gitea-alertrules.yaml Git — service health, repo errors
seaweedfs-alertrules.yaml Storage — volume health, filer availability
livekit-alertrules.yaml Media — server health, room issues
linkerd-alertrules.yaml Mesh — proxy health, certificate expiry
ory-alertrules.yaml Identity — auth failures, token errors

Alerts → Matrix

When an alert fires, it doesn't just sit in AlertManager waiting for someone to check a dashboard. It goes to Matrix.

Alert fires in Prometheus
  → AlertManager evaluates routing rules
  → Webhook to matrix-alertmanager-receiver
  → Bot posts to Matrix room
  → Team sees it in chat ✨

The matrix-alertmanager-receiver is a small deployment in the monitoring namespace with a bot account (matrix-bot-secret). Alerts show up in your Matrix chat alongside your regular conversations — no separate pager app, no email noise.


Logs — Loki

Every container's stdout/stderr gets shipped to Loki by the Alloy DaemonSet. Query with LogQL:

# Via sunbeam CLI
sunbeam mon loki logs '{namespace="matrix"}'
sunbeam mon loki logs '{namespace="lasuite", container="drive-backend"}'

# Via Grafana Explore (systemlogs.DOMAIN)
{namespace="matrix"} |= "error"
{namespace="lasuite", container="messages-backend"} | json | level="ERROR"

Grafana's derived fields let you click a traceID in a log line and jump straight to the trace in Tempo. Logs → traces in one click. Chef's kiss. 💅


Traces — Tempo

Distributed tracing via OTLP. Alloy receives traces and ships them to Tempo. When enabled on Pingora (optional, requires dedicated Tokio runtime build), you can follow a request from the proxy through the service mesh to the backend and back.

# Via Grafana Explore (systemtracing.DOMAIN)
# Search by trace ID, service name, or duration

Datasources

Grafana comes pre-configured with all three backends:

Datasource URL Notes
Prometheus http://kube-prometheus-stack-prometheus.monitoring.svc:9090 Default
Loki http://loki-gateway.monitoring.svc:80 With derived fields for trace correlation
Tempo http://tempo.monitoring.svc:3200 Trace backend

Quick Reference

# Check everything
sunbeam mon prometheus query 'up'
sunbeam mon grafana alerts

# Check specific service
sunbeam mon prometheus query 'up{job="pingora"}'
sunbeam mon loki logs '{namespace="matrix", container="sol"}'

# Open dashboards
# → metrics.DOMAIN in your browser