Prometheus, Grafana (10 dashboards), Loki, Tempo, AlertManager → Matrix, ServiceMonitors, PrometheusRules per component.
6.1 KiB
Keeping an Eye on the Girlies 👁️
You don't run a production stack without watching it. The Super Boujee Business Box ✨ has a full observability stack — metrics, logs, traces, dashboards, and alerts that go straight to your Matrix chat. If something's off, you'll know before anyone complains.
The Stack
| Component | What it does | Where it lives |
|---|---|---|
| Prometheus | Scrapes metrics from everything, every 30 seconds | systemmetrics.DOMAIN |
| Grafana | Dashboards, visualizations, the pretty pictures | metrics.DOMAIN |
| Loki | Log aggregation — all container logs, searchable | systemlogs.DOMAIN |
| Tempo | Distributed tracing — follow a request across services | systemtracing.DOMAIN |
| AlertManager | Routes alerts to the right place | Internal |
| Alloy | Collection agent — ships metrics, logs, and traces | DaemonSet on all nodes |
All deployed via Helm in the monitoring namespace. Grafana authenticates via Hydra (same OIDC as everything else) and auto-assigns Admin to all authenticated users — if you're on the team, you see everything.
Dashboards
Ten Grafana dashboard ConfigMaps, organized by domain:
| Dashboard | What it covers |
|---|---|
dashboards-infrastructure |
Kubernetes nodes, pods, resource usage |
dashboards-ingress |
Pingora proxy — request rates, latencies, cache hits, security decisions |
dashboards-identity |
Ory Kratos + Hydra — auth flows, token issuance, error rates |
dashboards-lasuite |
La Suite apps — per-app health, response times |
dashboards-comms |
Matrix/Tuwunel + Sol☀️ — message rates, bot activity |
dashboards-devtools |
Gitea — repo activity, CI builds |
dashboards-storage |
SeaweedFS — volume health, S3 operations |
dashboards-search |
OpenSearch — index health, query performance |
dashboards-media |
LiveKit — active rooms, participants, media quality |
dashboards-observability |
Meta — Prometheus/Loki/Tempo self-monitoring |
Dashboards are stored as ConfigMaps with the grafana_dashboard: "1" label. Grafana's sidecar picks them up automatically.
What Gets Scraped
ServiceMonitors tell Prometheus what to scrape. Currently active:
| Target | Namespace | Notes |
|---|---|---|
| Pingora proxy | ingress | Custom metrics on :9090 |
| Ory Kratos | ory | Identity operations |
| Ory Hydra | ory | OAuth2/OIDC metrics |
| Gitea | devtools | Repo + CI metrics |
| SeaweedFS | storage | Master, volume, filer |
| Linkerd | mesh | Control plane + data plane |
| cert-manager | cert-manager | Certificate lifecycle |
| OpenBao | data | Seal status, audit events |
| kube-state-metrics | monitoring | Kubernetes object state |
| node-exporter | monitoring | Host-level metrics |
| kube-apiserver | — | Kubernetes API |
| kubelet | — | Container runtime |
Disabled (not available or firewalled — we'll get to them):
- OpenSearch (no prometheus-exporter in v3.x)
- LiveKit (hostNetwork, behind firewall)
- kube-proxy (replaced by Cilium)
- etcd/scheduler/controller-manager (k3s quirks)
Alert Rules
PrometheusRule resources per component, firing when things need attention:
| File | Covers |
|---|---|
alertrules-infrastructure.yaml |
General K8s health — pod restarts, node pressure, PVC usage |
postgres-alertrules.yaml |
Database — connection limits, replication lag, backup freshness |
opensearch-alertrules.yaml |
Search — cluster health, index issues |
openbao-alertrules.yaml |
Vault — seal status, auth failures |
gitea-alertrules.yaml |
Git — service health, repo errors |
seaweedfs-alertrules.yaml |
Storage — volume health, filer availability |
livekit-alertrules.yaml |
Media — server health, room issues |
linkerd-alertrules.yaml |
Mesh — proxy health, certificate expiry |
ory-alertrules.yaml |
Identity — auth failures, token errors |
Alerts → Matrix
When an alert fires, it doesn't just sit in AlertManager waiting for someone to check a dashboard. It goes to Matrix.
Alert fires in Prometheus
→ AlertManager evaluates routing rules
→ Webhook to matrix-alertmanager-receiver
→ Bot posts to Matrix room
→ Team sees it in chat ✨
The matrix-alertmanager-receiver is a small deployment in the monitoring namespace with a bot account (matrix-bot-secret). Alerts show up in your Matrix chat alongside your regular conversations — no separate pager app, no email noise.
Logs — Loki
Every container's stdout/stderr gets shipped to Loki by the Alloy DaemonSet. Query with LogQL:
# Via sunbeam CLI
sunbeam mon loki logs '{namespace="matrix"}'
sunbeam mon loki logs '{namespace="lasuite", container="drive-backend"}'
# Via Grafana Explore (systemlogs.DOMAIN)
{namespace="matrix"} |= "error"
{namespace="lasuite", container="messages-backend"} | json | level="ERROR"
Grafana's derived fields let you click a traceID in a log line and jump straight to the trace in Tempo. Logs → traces in one click. Chef's kiss. 💅
Traces — Tempo
Distributed tracing via OTLP. Alloy receives traces and ships them to Tempo. When enabled on Pingora (optional, requires dedicated Tokio runtime build), you can follow a request from the proxy through the service mesh to the backend and back.
# Via Grafana Explore (systemtracing.DOMAIN)
# Search by trace ID, service name, or duration
Datasources
Grafana comes pre-configured with all three backends:
| Datasource | URL | Notes |
|---|---|---|
| Prometheus | http://kube-prometheus-stack-prometheus.monitoring.svc:9090 |
Default |
| Loki | http://loki-gateway.monitoring.svc:80 |
With derived fields for trace correlation |
| Tempo | http://tempo.monitoring.svc:3200 |
Trace backend |
Quick Reference
# Check everything
sunbeam mon prometheus query 'up'
sunbeam mon grafana alerts
# Check specific service
sunbeam mon prometheus query 'up{job="pingora"}'
sunbeam mon loki logs '{namespace="matrix", container="sol"}'
# Open dashboards
# → metrics.DOMAIN in your browser