Files

Sienna Meridian Satterwhite 977972d9f3

docs: add observability documentation — Keeping an Eye on the Girlies 👁️

Prometheus, Grafana (10 dashboards), Loki, Tempo, AlertManager → Matrix,
ServiceMonitors, PrometheusRules per component.

2026-03-24 11:46:33 +00:00

6.1 KiB

Raw Blame History

Keeping an Eye on the Girlies 👁️

You don't run a production stack without watching it. The Super Boujee Business Box ✨ has a full observability stack — metrics, logs, traces, dashboards, and alerts that go straight to your Matrix chat. If something's off, you'll know before anyone complains.

The Stack

Component	What it does	Where it lives
Prometheus	Scrapes metrics from everything, every 30 seconds	`systemmetrics.DOMAIN`
Grafana	Dashboards, visualizations, the pretty pictures	`metrics.DOMAIN`
Loki	Log aggregation — all container logs, searchable	`systemlogs.DOMAIN`
Tempo	Distributed tracing — follow a request across services	`systemtracing.DOMAIN`
AlertManager	Routes alerts to the right place	Internal
Alloy	Collection agent — ships metrics, logs, and traces	DaemonSet on all nodes

All deployed via Helm in the monitoring namespace. Grafana authenticates via Hydra (same OIDC as everything else) and auto-assigns Admin to all authenticated users — if you're on the team, you see everything.

Dashboards

Ten Grafana dashboard ConfigMaps, organized by domain:

Dashboard	What it covers
`dashboards-infrastructure`	Kubernetes nodes, pods, resource usage
`dashboards-ingress`	Pingora proxy — request rates, latencies, cache hits, security decisions
`dashboards-identity`	Ory Kratos + Hydra — auth flows, token issuance, error rates
`dashboards-lasuite`	La Suite apps — per-app health, response times
`dashboards-comms`	Matrix/Tuwunel + Sol☀️ — message rates, bot activity
`dashboards-devtools`	Gitea — repo activity, CI builds
`dashboards-storage`	SeaweedFS — volume health, S3 operations
`dashboards-search`	OpenSearch — index health, query performance
`dashboards-media`	LiveKit — active rooms, participants, media quality
`dashboards-observability`	Meta — Prometheus/Loki/Tempo self-monitoring

Dashboards are stored as ConfigMaps with the grafana_dashboard: "1" label. Grafana's sidecar picks them up automatically.

What Gets Scraped

ServiceMonitors tell Prometheus what to scrape. Currently active:

Target	Namespace	Notes
Pingora proxy	ingress	Custom metrics on :9090
Ory Kratos	ory	Identity operations
Ory Hydra	ory	OAuth2/OIDC metrics
Gitea	devtools	Repo + CI metrics
SeaweedFS	storage	Master, volume, filer
Linkerd	mesh	Control plane + data plane
cert-manager	cert-manager	Certificate lifecycle
OpenBao	data	Seal status, audit events
kube-state-metrics	monitoring	Kubernetes object state
node-exporter	monitoring	Host-level metrics
kube-apiserver	—	Kubernetes API
kubelet	—	Container runtime

Disabled (not available or firewalled — we'll get to them):

OpenSearch (no prometheus-exporter in v3.x)
LiveKit (hostNetwork, behind firewall)
kube-proxy (replaced by Cilium)
etcd/scheduler/controller-manager (k3s quirks)

Alert Rules

PrometheusRule resources per component, firing when things need attention:

File	Covers
`alertrules-infrastructure.yaml`	General K8s health — pod restarts, node pressure, PVC usage
`postgres-alertrules.yaml`	Database — connection limits, replication lag, backup freshness
`opensearch-alertrules.yaml`	Search — cluster health, index issues
`openbao-alertrules.yaml`	Vault — seal status, auth failures
`gitea-alertrules.yaml`	Git — service health, repo errors
`seaweedfs-alertrules.yaml`	Storage — volume health, filer availability
`livekit-alertrules.yaml`	Media — server health, room issues
`linkerd-alertrules.yaml`	Mesh — proxy health, certificate expiry
`ory-alertrules.yaml`	Identity — auth failures, token errors

Alerts → Matrix

When an alert fires, it doesn't just sit in AlertManager waiting for someone to check a dashboard. It goes to Matrix.

Alert fires in Prometheus
  → AlertManager evaluates routing rules
  → Webhook to matrix-alertmanager-receiver
  → Bot posts to Matrix room
  → Team sees it in chat ✨

The matrix-alertmanager-receiver is a small deployment in the monitoring namespace with a bot account (matrix-bot-secret). Alerts show up in your Matrix chat alongside your regular conversations — no separate pager app, no email noise.

Logs — Loki

Every container's stdout/stderr gets shipped to Loki by the Alloy DaemonSet. Query with LogQL:

# Via sunbeam CLI
sunbeam mon loki logs '{namespace="matrix"}'
sunbeam mon loki logs '{namespace="lasuite", container="drive-backend"}'

# Via Grafana Explore (systemlogs.DOMAIN)
{namespace="matrix"} |= "error"
{namespace="lasuite", container="messages-backend"} | json | level="ERROR"

Grafana's derived fields let you click a traceID in a log line and jump straight to the trace in Tempo. Logs → traces in one click. Chef's kiss. 💅

Traces — Tempo

Distributed tracing via OTLP. Alloy receives traces and ships them to Tempo. When enabled on Pingora (optional, requires dedicated Tokio runtime build), you can follow a request from the proxy through the service mesh to the backend and back.

# Via Grafana Explore (systemtracing.DOMAIN)
# Search by trace ID, service name, or duration

Datasources

Grafana comes pre-configured with all three backends:

Datasource	URL	Notes
Prometheus	`http://kube-prometheus-stack-prometheus.monitoring.svc:9090`	Default
Loki	`http://loki-gateway.monitoring.svc:80`	With derived fields for trace correlation
Tempo	`http://tempo.monitoring.svc:3200`	Trace backend

Quick Reference

# Check everything
sunbeam mon prometheus query 'up'
sunbeam mon grafana alerts

# Check specific service
sunbeam mon prometheus query 'up{job="pingora"}'
sunbeam mon loki logs '{namespace="matrix", container="sol"}'

# Open dashboards
# → metrics.DOMAIN in your browser

6.1 KiB Raw Blame History