# Keeping an Eye on the Girlies 👁️ You don't run a production stack without watching it. The Super Boujee Business Box ✨ has a full observability stack — metrics, logs, traces, dashboards, and alerts that go straight to your Matrix chat. If something's off, you'll know before anyone complains. --- ## The Stack | Component | What it does | Where it lives | |-----------|-------------|----------------| | **Prometheus** | Scrapes metrics from everything, every 30 seconds | `systemmetrics.DOMAIN` | | **Grafana** | Dashboards, visualizations, the pretty pictures | `metrics.DOMAIN` | | **Loki** | Log aggregation — all container logs, searchable | `systemlogs.DOMAIN` | | **Tempo** | Distributed tracing — follow a request across services | `systemtracing.DOMAIN` | | **AlertManager** | Routes alerts to the right place | Internal | | **Alloy** | Collection agent — ships metrics, logs, and traces | DaemonSet on all nodes | All deployed via Helm in the `monitoring` namespace. Grafana authenticates via Hydra (same OIDC as everything else) and auto-assigns Admin to all authenticated users — if you're on the team, you see everything. --- ## Dashboards Ten Grafana dashboard ConfigMaps, organized by domain: | Dashboard | What it covers | |-----------|---------------| | `dashboards-infrastructure` | Kubernetes nodes, pods, resource usage | | `dashboards-ingress` | Pingora proxy — request rates, latencies, cache hits, security decisions | | `dashboards-identity` | Ory Kratos + Hydra — auth flows, token issuance, error rates | | `dashboards-lasuite` | La Suite apps — per-app health, response times | | `dashboards-comms` | Matrix/Tuwunel + Sol☀️ — message rates, bot activity | | `dashboards-devtools` | Gitea — repo activity, CI builds | | `dashboards-storage` | SeaweedFS — volume health, S3 operations | | `dashboards-search` | OpenSearch — index health, query performance | | `dashboards-media` | LiveKit — active rooms, participants, media quality | | `dashboards-observability` | Meta — Prometheus/Loki/Tempo self-monitoring | Dashboards are stored as ConfigMaps with the `grafana_dashboard: "1"` label. Grafana's sidecar picks them up automatically. --- ## What Gets Scraped ServiceMonitors tell Prometheus what to scrape. Currently active: | Target | Namespace | Notes | |--------|-----------|-------| | Pingora proxy | ingress | Custom metrics on :9090 | | Ory Kratos | ory | Identity operations | | Ory Hydra | ory | OAuth2/OIDC metrics | | Gitea | devtools | Repo + CI metrics | | SeaweedFS | storage | Master, volume, filer | | Linkerd | mesh | Control plane + data plane | | cert-manager | cert-manager | Certificate lifecycle | | OpenBao | data | Seal status, audit events | | kube-state-metrics | monitoring | Kubernetes object state | | node-exporter | monitoring | Host-level metrics | | kube-apiserver | — | Kubernetes API | | kubelet | — | Container runtime | **Disabled** (not available or firewalled — we'll get to them): - OpenSearch (no prometheus-exporter in v3.x) - LiveKit (hostNetwork, behind firewall) - kube-proxy (replaced by Cilium) - etcd/scheduler/controller-manager (k3s quirks) --- ## Alert Rules PrometheusRule resources per component, firing when things need attention: | File | Covers | |------|--------| | `alertrules-infrastructure.yaml` | General K8s health — pod restarts, node pressure, PVC usage | | `postgres-alertrules.yaml` | Database — connection limits, replication lag, backup freshness | | `opensearch-alertrules.yaml` | Search — cluster health, index issues | | `openbao-alertrules.yaml` | Vault — seal status, auth failures | | `gitea-alertrules.yaml` | Git — service health, repo errors | | `seaweedfs-alertrules.yaml` | Storage — volume health, filer availability | | `livekit-alertrules.yaml` | Media — server health, room issues | | `linkerd-alertrules.yaml` | Mesh — proxy health, certificate expiry | | `ory-alertrules.yaml` | Identity — auth failures, token errors | --- ## Alerts → Matrix When an alert fires, it doesn't just sit in AlertManager waiting for someone to check a dashboard. It goes to Matrix. ``` Alert fires in Prometheus → AlertManager evaluates routing rules → Webhook to matrix-alertmanager-receiver → Bot posts to Matrix room → Team sees it in chat ✨ ``` The `matrix-alertmanager-receiver` is a small deployment in the monitoring namespace with a bot account (`matrix-bot-secret`). Alerts show up in your Matrix chat alongside your regular conversations — no separate pager app, no email noise. --- ## Logs — Loki Every container's stdout/stderr gets shipped to Loki by the Alloy DaemonSet. Query with LogQL: ```bash # Via sunbeam CLI sunbeam mon loki logs '{namespace="matrix"}' sunbeam mon loki logs '{namespace="lasuite", container="drive-backend"}' # Via Grafana Explore (systemlogs.DOMAIN) {namespace="matrix"} |= "error" {namespace="lasuite", container="messages-backend"} | json | level="ERROR" ``` Grafana's derived fields let you click a traceID in a log line and jump straight to the trace in Tempo. Logs → traces in one click. Chef's kiss. 💅 --- ## Traces — Tempo Distributed tracing via OTLP. Alloy receives traces and ships them to Tempo. When enabled on Pingora (optional, requires dedicated Tokio runtime build), you can follow a request from the proxy through the service mesh to the backend and back. ```bash # Via Grafana Explore (systemtracing.DOMAIN) # Search by trace ID, service name, or duration ``` --- ## Datasources Grafana comes pre-configured with all three backends: | Datasource | URL | Notes | |------------|-----|-------| | Prometheus | `http://kube-prometheus-stack-prometheus.monitoring.svc:9090` | Default | | Loki | `http://loki-gateway.monitoring.svc:80` | With derived fields for trace correlation | | Tempo | `http://tempo.monitoring.svc:3200` | Trace backend | --- ## Quick Reference ```bash # Check everything sunbeam mon prometheus query 'up' sunbeam mon grafana alerts # Check specific service sunbeam mon prometheus query 'up{job="pingora"}' sunbeam mon loki logs '{namespace="matrix", container="sol"}' # Open dashboards # → metrics.DOMAIN in your browser ```