feat(monitoring): wire up full LGTM observability stack

- Prometheus: discover ServiceMonitors/PodMonitors in all namespaces,
  enable remote write receiver for Tempo metrics generator
- Tempo: enable metrics generator (service-graphs + span-metrics)
  with remote write to Prometheus
- Loki: add Grafana Alloy DaemonSet to ship container logs
- Grafana: enable dashboard sidecar, add Pingora/Loki/Tempo/OpenBao
  dashboards, add stable UIDs and cross-linking between datasources
  (Loki↔Tempo derived fields, traces→logs, traces→metrics, service map)
- Linkerd: enable proxy tracing to Alloy OTLP collector, point
  linkerd-viz at existing Prometheus instead of deploying its own
- Pingora: add OTLP rollout plan (endpoint commented out until proxy
  telemetry panic fix is deployed and Alloy is verified healthy)
This commit is contained in:
2026-03-21 17:36:54 +00:00
parent 5f923d14f9
commit d3943c9a84
9 changed files with 523 additions and 0 deletions

View File

@@ -29,9 +29,11 @@ helmCharts:
version: "2025.12.3"
releaseName: linkerd-control-plane
namespace: mesh
valuesFile: linkerd-control-plane-values.yaml
- name: linkerd-viz
repo: https://helm.linkerd.io/edge
version: "2026.1.4"
releaseName: linkerd-viz
namespace: mesh
valuesFile: linkerd-viz-values.yaml

View File

@@ -0,0 +1,19 @@
# Linkerd control-plane overrides — enable proxy tracing to Tempo.
#
# Every meshed pod's Linkerd sidecar will export OTLP traces to the
# Alloy collector in the monitoring namespace, which forwards to Tempo.
# Controller-level tracing (identity, destination controllers)
controller:
tracing:
enabled: true
collector:
endpoint: "alloy.monitoring.svc.cluster.local:4317"
# Proxy-level tracing (every meshed sidecar)
proxy:
tracing:
enabled: true
traceServiceName: linkerd-proxy
collector:
endpoint: "alloy.monitoring.svc.cluster.local:4317"

View File

@@ -0,0 +1,9 @@
# Linkerd-viz overrides — use existing Prometheus instead of deploying a second one.
#
# By default linkerd-viz ships its own Prometheus, which wastes resources
# and creates a second scrape loop. Point it at kube-prometheus-stack instead.
prometheus:
enabled: false
prometheusUrl: "http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local:9090"