Skip to content

erasmo-dominguez-stuff/grafana-observability-stack

Repository files navigation

Grafana Observability Playground

A batteries-included, local observability sandbox built on the Grafana OSS stack. Spin it up in two commands and explore metrics, logs, traces, alerting, SLOs, and load testing — all wired together with realistic FastAPI services.

No cloud account needed. Everything runs locally via Docker Compose.


What you'll learn

This playground covers the three pillars of observability plus the connective tissue between them:

Topic Tools What to explore
Metrics Prometheus + Grafana Alloy Scraping, PromQL, recording rules, SLIs/SLOs
Logs Loki + Grafana Alloy Structured logs, LogQL, label design, log→trace links
Traces Tempo + OpenTelemetry Distributed tracing, span metrics, TraceQL, trace→log links
Alerting Prometheus + Alertmanager Threshold alerts, multi-window burn-rate SLO alerts
Dashboards Grafana Correlation: metrics ↔ logs ↔ traces in a single click
Load testing k6 Realistic traffic patterns, metrics in Prometheus
Collection agent Grafana Alloy Unified pipeline: scrape + collect logs + receive OTLP

Architecture

Two simulated servers connected by a shared Docker network:

graph LR
  subgraph server1["Server 1 — docker-compose.app.yml"]
    api["api :8080<br/>/payments · /health · /metrics"]
    worker["worker :8081<br/>/process · /health · /metrics"]
  end

  subgraph server2["Server 2 — docker-compose.observability.yml"]
    alloy["Grafana Alloy :12345"]
    prometheus["Prometheus :9090"]
    loki["Loki :3100"]
    tempo["Tempo :3200"]
    alertmanager["Alertmanager :9093"]
    grafana["Grafana :3000"]
    cadvisor["cAdvisor :8082"]
  end

  api -- "/metrics" --> alloy
  worker -- "/metrics" --> alloy
  api -- "OTLP gRPC<br/>traces" --> alloy
  worker -- "OTLP gRPC<br/>traces" --> alloy
  cadvisor -- "/metrics" --> alloy
  alloy -- "remote_write" --> prometheus
  alloy -- "Docker stdout<br/>JSON → labels" --> loki
  alloy -- "OTLP" --> tempo
  tempo -- "span metrics<br/>remote_write" --> prometheus
  prometheus -- "alert rules" --> alertmanager
  grafana -- "PromQL" --> prometheus
  grafana -- "LogQL" --> loki
  grafana -- "TraceQL" --> tempo

  style server1 fill:#1a1a2e,stroke:#e94560,color:#fff
  style server2 fill:#1a1a2e,stroke:#0f3460,color:#fff
Loading

Data flow

sequenceDiagram
  participant App as api / worker
  participant Alloy as Grafana Alloy
  participant Prom as Prometheus
  participant Loki as Loki
  participant Tempo as Tempo
  participant AM as Alertmanager
  participant G as Grafana

  loop every 15s
    Alloy->>App: GET /metrics
    Alloy->>Prom: remote_write (metrics)
  end

  App->>Alloy: OTLP gRPC (traces)
  Alloy->>Tempo: forward traces
  Tempo->>Prom: span metrics (remote_write)

  App-->>Alloy: stdout (JSON logs + trace_id)
  Alloy->>Loki: push log streams

  loop every 15s
    Prom->>Prom: evaluate alert rules
    Prom-->>AM: firing alerts
  end

  G->>Prom: PromQL queries
  G->>Loki: LogQL queries
  G->>Tempo: TraceQL queries
Loading

Quick Start

Requirements: Docker + Docker Compose

git clone <this-repo>
cd <this-repo>

make up       # Build images and start both servers
make check    # Smoke-test every service endpoint
make open     # Open Grafana in the browser (admin / admin)
make load     # Run k6 load test — triggers the HighErrorRate alert

All Makefile targets

Target Description
make help List all commands
make build Build only the application images
make up Build + start both servers (observability first, then apps)
make down Stop all containers in both servers
make check Smoke-test every service endpoint
make logs-api Tail API logs only
make logs-worker Tail worker logs only
make load Run k6 load test (both servers must be up)
make open Open Grafana in the browser
make clean Stop containers, delete volumes, and remove the shared network
make screenshots Open all UIs for taking fresh screenshots into docs/images/

Service Endpoints

Service URL Notes
API http://localhost:8080 GET /payments, POST /payments, GET /health, GET /metrics
Worker http://localhost:8081 POST /process, GET /health, GET /metrics
Grafana http://localhost:3000 admin / admin
Prometheus http://localhost:9090 Metrics + alert rules
Alertmanager http://localhost:9093 Alert routing
Loki http://localhost:3100 Log storage
Tempo http://localhost:3200 Trace storage + service graph
Alloy UI http://localhost:12345 Live pipeline inspector
cAdvisor http://localhost:8082 Container resource metrics

What's Running

Applications

api — Payments API (:8080)

Method Path Description
GET /payments List payments
GET /payments/{id} Get payment by ID
POST /payments Create payment (3% simulated error rate)
GET /health Health check
GET /metrics Prometheus metrics

worker — Background Processor (:8081)

Method Path Description
POST /process Process a task (5% simulated error rate)
GET /health Health check
GET /metrics Prometheus metrics

Both services expose:

Metric Type Labels
http_requests_total Counter method, endpoint, status_code
http_request_duration_seconds Histogram method, endpoint

Observability Features in Detail

Structured logs with trace context

Every request emits a JSON log line with trace_id and span_id injected by a custom TraceContextFormatter:

{
  "message": "request",
  "method": "POST",
  "path": "/payments",
  "status": 200,
  "trace_id": "a1b2c3d4e5f6...",
  "span_id": "1a2b3c4d..."
}

Alloy parses these: level, service, and container become Loki labels (low cardinality); trace_id and span_id become structured metadata (high cardinality, not indexed) — enabling log→trace navigation without exploding the Loki index.

Loki logs with trace_id structured metadata

Distributed tracing (OpenTelemetry)

Both services auto-instrument with FastAPIInstrumentor — zero changes to endpoint code:

_provider = TracerProvider(resource=_resource)
_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(_provider)
FastAPIInstrumentor.instrument_app(app)

OTEL_EXPORTER_OTLP_ENDPOINT points to Alloy (:4317), which forwards to Tempo.

Tempo also generates span metrics (rate, error, duration histograms) and a service graph — both remote-written to Prometheus so they appear in Grafana dashboards automatically.

Tempo traces explorer

Metrics ↔ Logs ↔ Traces correlation

Full bidirectional navigation configured in observability/grafana/provisioning/datasources/datasources.yml:

Navigation Mechanism
Metrics → Traces Prometheus exemplar annotations link to the trace that caused the data point
Traces → Logs Tempo "Logs for this trace" queries Loki by trace_id structured metadata
Traces → Metrics Tempo links to Prometheus span metrics for the same service
Logs → Traces derivedFields on trace_id field opens the trace in Tempo

Try it: open the Application Overview dashboard, click a data point on the error rate panel, click the exemplar → you jump to the exact trace. From the trace, click "Logs" → you see the log lines for that request.

Grafana datasources with cross-links

Dashboards

Dashboard URL Shows
Application Overview http://localhost:3000/d/app-overview Request rate, error rate, P95 latency, live logs panel
Infrastructure Overview http://localhost:3000/d/infra-overview CPU, memory, network I/O, restarts per container

Both dashboards auto-provision on startup via Grafana provisioning. No manual import needed.

Application Overview dashboard

Infrastructure Overview dashboard

Alloy pipeline inspector

Open http://localhost:12345 to see the live component graph — every metrics scraper, log processor, and trace pipeline visualised as a DAG:

Alloy UI — component graph

Alerting

Four alert rules in observability/prometheus/rules/alerts.yml:

Alert Condition For Severity
ServiceDown up{job="prometheus.scrape.apps"} == 0 1 min critical
HighErrorRate 5xx / total > 5% 2 min warning
HighCpuUsage container CPU > 0.8 cores 5 min warning
HighMemoryUsage container memory > 512 MiB 5 min warning

Prometheus alerts

Trigger alerts manually

# HighErrorRate — run load test; spike scenario fires it in ~2 min
make load

# ServiceDown — stop the api container
docker compose -f docker-compose.app.yml -p server-1 stop api
# Restore:
docker compose -f docker-compose.app.yml -p server-1 start api

SLOs and error budgets

observability/prometheus/rules/slo.yml implements full SLI/SLO tracking with multi-window burn-rate alerts (Google SRE Workbook pattern):

Service SLI Target
api Availability 99.0% non-5xx requests
api Latency 99.0% requests < 300 ms
worker Availability 98.0% non-5xx requests
worker Latency 95.0% requests < 500 ms

Grafana alerting — SLO burn-rate

Health and /metrics endpoints are excluded from SLIs (synthetic traffic, not user requests).


Load Testing (k6)

make load runs observability/k6/load.js with four concurrent scenarios:

Scenario Service Pattern Purpose
api_baseline api 2 VUs × 3 min Stable baseline traffic
api_ramp api 0 → 10 → 0 over ~2 min Latency under load
api_spike api 0 → 30 → 0 over 35 s Triggers HighErrorRate
worker_baseline worker 3 VUs × 3 min Exercises worker service

k6 streams results to Prometheus via remote_write — VU count, request rate, and error rate appear in Grafana dashboards alongside application metrics.


Useful Queries

PromQL — Application metrics

# Request rate per service (req/s)
sum(rate(http_requests_total[1m])) by (service)

# Error rate %
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) * 100

# P95 latency per service
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

PromQL — Infrastructure (cAdvisor)

# CPU per container (cores)
sum(rate(container_cpu_usage_seconds_total{name=~"obs-.*"}[5m])) by (name)

# Memory per container
container_memory_usage_bytes{name=~"obs-.*"}

# Network I/O
sum(rate(container_network_receive_bytes_total{name=~"obs-.*"}[5m])) by (name)

LogQL — Logs

# All logs from api and worker
{service=~"api|worker"}

# Only errors
{service=~"api|worker"} | json | level = "ERROR"

# Logs for a specific trace
{service=~"api|worker"} | trace_id = "<paste-trace-id>"

# Error rate from logs (metric query)
sum(rate({service=~"api|worker"} | json | level="ERROR" [5m])) by (service)
/ sum(rate({service=~"api|worker"} [5m])) by (service)

TraceQL — Traces

# All traces for a service
{ resource.service.name = "api" }

# Traces with errors only
{ status = error }

# Slow requests with errors
{ resource.service.name = "api" && status = error && duration > 50ms }

# Traces involving both services (cross-service calls)
{ resource.service.name =~ "api|worker" } | by(rootServiceName)

Learning Guide

This repo is designed as a hands-on learning resource. Here's a suggested exploration path:

1. Start with the pipeline (Alloy)

Open http://localhost:12345/graph and study how data flows:

  • Which components collect metrics? Logs? Traces?
  • How are Docker containers discovered automatically?
  • Open the component detail for loki.process.containers — what stages run on each log line?

Read: Alloy Reference Cards

2. Explore metrics (Prometheus + Grafana)

  • Go to http://localhost:9090/targets — which targets are UP? Why are they discovered?
  • Run sum(rate(http_requests_total[1m])) by (service) in Prometheus
  • Open the Application Overview dashboard and understand each panel
  • Then run make load and watch the error rate spike

Read: Alloy Metrics Pipeline

3. Understand logs (Loki)

  • In Grafana → Explore → select Loki datasource
  • Query {service="api"} — see structured JSON logs
  • Now query {service="api"} | json | level="ERROR" — filtered by parsed field
  • Click a log line, look for the trace_id link — this jumps to Tempo

Read: Loki Architecture · Labels Design · LogQL Reference

4. Follow a trace end-to-end (Tempo)

  • In Grafana → Explore → select Tempo datasource
  • Search for traces by service api with status = error
  • Open a trace — see the span waterfall across the request lifecycle
  • Click "Logs for this trace" → jumps to Loki with the exact trace_id

Read: Alloy Traces Pipeline

5. Trigger and investigate an alert

  • Run make load to generate spike traffic
  • Wait ~2 minutes, then check http://localhost:9093HighErrorRate should fire
  • In Grafana, go to Alerting → open the firing alert → click the panel link
  • From the panel, click an exemplar on the error rate graph → opens the trace
  • From the trace, click Logs → see the exact log lines for that request

Read: Loki Alerting & Recording Rules

6. Study the SLO implementation

  • Open observability/prometheus/rules/slo.yml
  • Understand how SLIs are defined as recording rules
  • See how multi-window burn-rate alerts work (1h + 6h windows)
  • Understand the error budget calculation

Reference: Google SRE Workbook — Alerting on SLOs

7. Inspect the label strategy

  • Open omd.yaml — the centralized label registry
  • Understand why service, container, level are labels in Loki
  • Understand why trace_id is not a label (structured metadata instead)
  • Look at observability/alloy/config.alloy — see how labels are assigned

Read: Loki Labels Design


Reference Cards

Detailed reference documentation for each tool in the stack:

docs/observability-mastery/
├── README.md                          ← start here — overview + quick lookup
├── alloy/
│   ├── 01-core-concepts.md            component model, River syntax, CLI
│   ├── 02-metrics-pipeline.md         discovery, relabeling, scrape, remote_write
│   ├── 03-logs-pipeline.md            loki.source, all processing stages, loki.write
│   ├── 04-traces-pipeline.md          otelcol components, tail sampling, Tempo export
│   └── 05-production-advanced.md      K8s, hot reload, clustering, debugging
└── loki/
    ├── 01-architecture.md             mental model, write/read path, deployment modes
    ├── 02-labels-design.md            cardinality, labels vs structured metadata
    ├── 03-logql-reference.md          full query language with examples
    ├── 04-production-config.md        loki.yaml, schema, storage, ruler, Helm
    └── 05-operations-cost.md          troubleshooting, cost reduction, sizing

Repository Structure

.
├── src/
│   ├── api/                        # Payments API — FastAPI + OTel + Prometheus
│   │   ├── main.py
│   │   ├── requirements.txt
│   │   └── Dockerfile
│   └── worker/                     # Background processor — same instrumentation
│       ├── main.py
│       ├── requirements.txt
│       └── Dockerfile
├── observability/
│   ├── alloy/
│   │   └── config.alloy            # Metrics + logs (Docker) + OTLP traces pipeline
│   ├── prometheus/
│   │   ├── prometheus.yml
│   │   └── rules/
│   │       ├── alerts.yml          # ServiceDown, HighErrorRate, CPU, Memory
│   │       └── slo.yml             # SLI recording rules + error budget + burn-rate
│   ├── alertmanager/
│   │   └── config.yml
│   ├── tempo/
│   │   └── config.yml              # Tracing + span metrics generator
│   ├── loki/
│   │   └── config.yml
│   ├── grafana/
│   │   └── provisioning/
│   │       ├── datasources/        # Prometheus + Loki + Tempo with cross-links
│   │       └── dashboards/
│   │           ├── hello-api.json  # Application Overview dashboard
│   │           └── infra.json      # Infrastructure Overview dashboard
│   └── k6/
│       └── load.js                 # Load test: 4 scenarios across both services
├── docs/
│   ├── images/                     # Screenshots of the running stack
│   ├── decisions.md                # Architecture Decision Records (ADRs)
│   └── observability-mastery/      # Reference cards for the stack
│       ├── README.md
│       ├── alloy/                  # 5 reference cards for Grafana Alloy
│       └── loki/                   # 5 reference cards for Grafana Loki
├── docker-compose.app.yml          # Server 1 — applications
├── docker-compose.observability.yml # Server 2 — observability stack
├── omd.yaml                        # Observability Metadata Definition — label standards
├── Makefile
└── README.md

Design Decisions

Full rationale in docs/decisions.md. Summary:

Decision Summary ADR
Grafana Alloy Single agent for metrics + logs + traces ADR-001
OpenTelemetry SDK CNCF-standard auto-instrumentation, backend-agnostic ADR-002
omd.yaml Centralized label registry to prevent drift and cardinality issues ADR-003
Two Compose files Faithful two-server simulation with independent lifecycles ADR-004
Prometheus single-node No object storage needed; one-line migration to Mimir ADR-005
k6 + remote_write Load test metrics in the same Grafana dashboards ADR-006
Tempo span metrics Automatic RED metrics from traces, service graph for free ADR-007
Datasource cross-links Full bidirectional navigation: metrics ↔ logs ↔ traces ADR-008

Production Path

This stack runs on Docker Compose for simplicity. The natural evolution:

graph TB
  subgraph local["Current — Local Playground"]
    dc1["docker-compose.app.yml"]
    dc2["docker-compose.observability.yml"]
  end

  subgraph prod["Production — Kubernetes"]
    helm["Helm Charts"]
    apps["api / worker<br/>Deployments + HPA"]
    mimir["Grafana Mimir<br/>(replaces Prometheus)"]
    loki_prod["Grafana Loki<br/>SimpleScalable + S3"]
    tempo_prod["Grafana Tempo<br/>(distributed)"]
    alloy_prod["Alloy DaemonSet"]
    grafana_prod["Grafana (HA + SSO)"]
  end

  dc1 -->|"helm install"| helm
  helm --> apps
  helm --> mimir
  helm --> loki_prod
  helm --> tempo_prod
  helm --> alloy_prod
  helm --> grafana_prod

  style local fill:#1a1a2e,stroke:#e94560,color:#fff
  style prod fill:#1a1a2e,stroke:#16c79a,color:#fff
Loading
Concern Local (current) Production
Orchestration Docker Compose Kubernetes (EKS / GKE / AKS)
Metrics Single Prometheus, 7 d Grafana Mimir, S3/GCS, multi-tenant
Logs Loki single-binary Loki SimpleScalable + S3
Traces Tempo single-binary Tempo distributed
Agent Alloy container Alloy DaemonSet per node
Alerting Single Alertmanager 3-node HA cluster + PagerDuty/Slack
Labels omd.yaml manual convention CI lint enforcing omd.yaml at PR time

References

About

Demo repository used for understanding and play with some of the main Grafana tools

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors