Prometheus, Grafana (10 dashboards), Loki, Tempo, AlertManager → Matrix, ServiceMonitors, PrometheusRules per component.
159 lines
6.1 KiB
Markdown
159 lines
6.1 KiB
Markdown
# Keeping an Eye on the Girlies 👁️
|
||
|
||
You don't run a production stack without watching it. The Super Boujee Business Box ✨ has a full observability stack — metrics, logs, traces, dashboards, and alerts that go straight to your Matrix chat. If something's off, you'll know before anyone complains.
|
||
|
||
---
|
||
|
||
## The Stack
|
||
|
||
| Component | What it does | Where it lives |
|
||
|-----------|-------------|----------------|
|
||
| **Prometheus** | Scrapes metrics from everything, every 30 seconds | `systemmetrics.DOMAIN` |
|
||
| **Grafana** | Dashboards, visualizations, the pretty pictures | `metrics.DOMAIN` |
|
||
| **Loki** | Log aggregation — all container logs, searchable | `systemlogs.DOMAIN` |
|
||
| **Tempo** | Distributed tracing — follow a request across services | `systemtracing.DOMAIN` |
|
||
| **AlertManager** | Routes alerts to the right place | Internal |
|
||
| **Alloy** | Collection agent — ships metrics, logs, and traces | DaemonSet on all nodes |
|
||
|
||
All deployed via Helm in the `monitoring` namespace. Grafana authenticates via Hydra (same OIDC as everything else) and auto-assigns Admin to all authenticated users — if you're on the team, you see everything.
|
||
|
||
---
|
||
|
||
## Dashboards
|
||
|
||
Ten Grafana dashboard ConfigMaps, organized by domain:
|
||
|
||
| Dashboard | What it covers |
|
||
|-----------|---------------|
|
||
| `dashboards-infrastructure` | Kubernetes nodes, pods, resource usage |
|
||
| `dashboards-ingress` | Pingora proxy — request rates, latencies, cache hits, security decisions |
|
||
| `dashboards-identity` | Ory Kratos + Hydra — auth flows, token issuance, error rates |
|
||
| `dashboards-lasuite` | La Suite apps — per-app health, response times |
|
||
| `dashboards-comms` | Matrix/Tuwunel + Sol☀️ — message rates, bot activity |
|
||
| `dashboards-devtools` | Gitea — repo activity, CI builds |
|
||
| `dashboards-storage` | SeaweedFS — volume health, S3 operations |
|
||
| `dashboards-search` | OpenSearch — index health, query performance |
|
||
| `dashboards-media` | LiveKit — active rooms, participants, media quality |
|
||
| `dashboards-observability` | Meta — Prometheus/Loki/Tempo self-monitoring |
|
||
|
||
Dashboards are stored as ConfigMaps with the `grafana_dashboard: "1"` label. Grafana's sidecar picks them up automatically.
|
||
|
||
---
|
||
|
||
## What Gets Scraped
|
||
|
||
ServiceMonitors tell Prometheus what to scrape. Currently active:
|
||
|
||
| Target | Namespace | Notes |
|
||
|--------|-----------|-------|
|
||
| Pingora proxy | ingress | Custom metrics on :9090 |
|
||
| Ory Kratos | ory | Identity operations |
|
||
| Ory Hydra | ory | OAuth2/OIDC metrics |
|
||
| Gitea | devtools | Repo + CI metrics |
|
||
| SeaweedFS | storage | Master, volume, filer |
|
||
| Linkerd | mesh | Control plane + data plane |
|
||
| cert-manager | cert-manager | Certificate lifecycle |
|
||
| OpenBao | data | Seal status, audit events |
|
||
| kube-state-metrics | monitoring | Kubernetes object state |
|
||
| node-exporter | monitoring | Host-level metrics |
|
||
| kube-apiserver | — | Kubernetes API |
|
||
| kubelet | — | Container runtime |
|
||
|
||
**Disabled** (not available or firewalled — we'll get to them):
|
||
- OpenSearch (no prometheus-exporter in v3.x)
|
||
- LiveKit (hostNetwork, behind firewall)
|
||
- kube-proxy (replaced by Cilium)
|
||
- etcd/scheduler/controller-manager (k3s quirks)
|
||
|
||
---
|
||
|
||
## Alert Rules
|
||
|
||
PrometheusRule resources per component, firing when things need attention:
|
||
|
||
| File | Covers |
|
||
|------|--------|
|
||
| `alertrules-infrastructure.yaml` | General K8s health — pod restarts, node pressure, PVC usage |
|
||
| `postgres-alertrules.yaml` | Database — connection limits, replication lag, backup freshness |
|
||
| `opensearch-alertrules.yaml` | Search — cluster health, index issues |
|
||
| `openbao-alertrules.yaml` | Vault — seal status, auth failures |
|
||
| `gitea-alertrules.yaml` | Git — service health, repo errors |
|
||
| `seaweedfs-alertrules.yaml` | Storage — volume health, filer availability |
|
||
| `livekit-alertrules.yaml` | Media — server health, room issues |
|
||
| `linkerd-alertrules.yaml` | Mesh — proxy health, certificate expiry |
|
||
| `ory-alertrules.yaml` | Identity — auth failures, token errors |
|
||
|
||
---
|
||
|
||
## Alerts → Matrix
|
||
|
||
When an alert fires, it doesn't just sit in AlertManager waiting for someone to check a dashboard. It goes to Matrix.
|
||
|
||
```
|
||
Alert fires in Prometheus
|
||
→ AlertManager evaluates routing rules
|
||
→ Webhook to matrix-alertmanager-receiver
|
||
→ Bot posts to Matrix room
|
||
→ Team sees it in chat ✨
|
||
```
|
||
|
||
The `matrix-alertmanager-receiver` is a small deployment in the monitoring namespace with a bot account (`matrix-bot-secret`). Alerts show up in your Matrix chat alongside your regular conversations — no separate pager app, no email noise.
|
||
|
||
---
|
||
|
||
## Logs — Loki
|
||
|
||
Every container's stdout/stderr gets shipped to Loki by the Alloy DaemonSet. Query with LogQL:
|
||
|
||
```bash
|
||
# Via sunbeam CLI
|
||
sunbeam mon loki logs '{namespace="matrix"}'
|
||
sunbeam mon loki logs '{namespace="lasuite", container="drive-backend"}'
|
||
|
||
# Via Grafana Explore (systemlogs.DOMAIN)
|
||
{namespace="matrix"} |= "error"
|
||
{namespace="lasuite", container="messages-backend"} | json | level="ERROR"
|
||
```
|
||
|
||
Grafana's derived fields let you click a traceID in a log line and jump straight to the trace in Tempo. Logs → traces in one click. Chef's kiss. 💅
|
||
|
||
---
|
||
|
||
## Traces — Tempo
|
||
|
||
Distributed tracing via OTLP. Alloy receives traces and ships them to Tempo. When enabled on Pingora (optional, requires dedicated Tokio runtime build), you can follow a request from the proxy through the service mesh to the backend and back.
|
||
|
||
```bash
|
||
# Via Grafana Explore (systemtracing.DOMAIN)
|
||
# Search by trace ID, service name, or duration
|
||
```
|
||
|
||
---
|
||
|
||
## Datasources
|
||
|
||
Grafana comes pre-configured with all three backends:
|
||
|
||
| Datasource | URL | Notes |
|
||
|------------|-----|-------|
|
||
| Prometheus | `http://kube-prometheus-stack-prometheus.monitoring.svc:9090` | Default |
|
||
| Loki | `http://loki-gateway.monitoring.svc:80` | With derived fields for trace correlation |
|
||
| Tempo | `http://tempo.monitoring.svc:3200` | Trace backend |
|
||
|
||
---
|
||
|
||
## Quick Reference
|
||
|
||
```bash
|
||
# Check everything
|
||
sunbeam mon prometheus query 'up'
|
||
sunbeam mon grafana alerts
|
||
|
||
# Check specific service
|
||
sunbeam mon prometheus query 'up{job="pingora"}'
|
||
sunbeam mon loki logs '{namespace="matrix", container="sol"}'
|
||
|
||
# Open dashboards
|
||
# → metrics.DOMAIN in your browser
|
||
```
|