Grafana Observability Playground

A batteries-included, local observability sandbox built on the Grafana OSS stack. Spin it up in two commands and explore metrics, logs, traces, alerting, SLOs, and load testing — all wired together with realistic FastAPI services.

No cloud account needed. Everything runs locally via Docker Compose.

What you'll learn

This playground covers the three pillars of observability plus the connective tissue between them:

Topic	Tools	What to explore
Metrics	Prometheus + Grafana Alloy	Scraping, PromQL, recording rules, SLIs/SLOs
Logs	Loki + Grafana Alloy	Structured logs, LogQL, label design, log→trace links
Traces	Tempo + OpenTelemetry	Distributed tracing, span metrics, TraceQL, trace→log links
Alerting	Prometheus + Alertmanager	Threshold alerts, multi-window burn-rate SLO alerts
Dashboards	Grafana	Correlation: metrics ↔ logs ↔ traces in a single click
Load testing	k6	Realistic traffic patterns, metrics in Prometheus
Collection agent	Grafana Alloy	Unified pipeline: scrape + collect logs + receive OTLP

Architecture

Two simulated servers connected by a shared Docker network:

graph LR
  subgraph server1["Server 1 — docker-compose.app.yml"]
    api["api :8080<br/>/payments · /health · /metrics"]
    worker["worker :8081<br/>/process · /health · /metrics"]
  end

  subgraph server2["Server 2 — docker-compose.observability.yml"]
    alloy["Grafana Alloy :12345"]
    prometheus["Prometheus :9090"]
    loki["Loki :3100"]
    tempo["Tempo :3200"]
    alertmanager["Alertmanager :9093"]
    grafana["Grafana :3000"]
    cadvisor["cAdvisor :8082"]
  end

  api -- "/metrics" --> alloy
  worker -- "/metrics" --> alloy
  api -- "OTLP gRPC<br/>traces" --> alloy
  worker -- "OTLP gRPC<br/>traces" --> alloy
  cadvisor -- "/metrics" --> alloy
  alloy -- "remote_write" --> prometheus
  alloy -- "Docker stdout<br/>JSON → labels" --> loki
  alloy -- "OTLP" --> tempo
  tempo -- "span metrics<br/>remote_write" --> prometheus
  prometheus -- "alert rules" --> alertmanager
  grafana -- "PromQL" --> prometheus
  grafana -- "LogQL" --> loki
  grafana -- "TraceQL" --> tempo

  style server1 fill:#1a1a2e,stroke:#e94560,color:#fff
  style server2 fill:#1a1a2e,stroke:#0f3460,color:#fff

Data flow

sequenceDiagram
  participant App as api / worker
  participant Alloy as Grafana Alloy
  participant Prom as Prometheus
  participant Loki as Loki
  participant Tempo as Tempo
  participant AM as Alertmanager
  participant G as Grafana

  loop every 15s
    Alloy->>App: GET /metrics
    Alloy->>Prom: remote_write (metrics)
  end

  App->>Alloy: OTLP gRPC (traces)
  Alloy->>Tempo: forward traces
  Tempo->>Prom: span metrics (remote_write)

  App-->>Alloy: stdout (JSON logs + trace_id)
  Alloy->>Loki: push log streams

  loop every 15s
    Prom->>Prom: evaluate alert rules
    Prom-->>AM: firing alerts
  end

  G->>Prom: PromQL queries
  G->>Loki: LogQL queries
  G->>Tempo: TraceQL queries

Quick Start

Requirements: Docker + Docker Compose

git clone <this-repo>
cd <this-repo>

make up       # Build images and start both servers
make check    # Smoke-test every service endpoint
make open     # Open Grafana in the browser (admin / admin)
make load     # Run k6 load test — triggers the HighErrorRate alert

All Makefile targets

Target	Description
`make help`	List all commands
`make build`	Build only the application images
`make up`	Build + start both servers (observability first, then apps)
`make down`	Stop all containers in both servers
`make check`	Smoke-test every service endpoint
`make logs-api`	Tail API logs only
`make logs-worker`	Tail worker logs only
`make load`	Run k6 load test (both servers must be up)
`make open`	Open Grafana in the browser
`make clean`	Stop containers, delete volumes, and remove the shared network
`make screenshots`	Open all UIs for taking fresh screenshots into `docs/images/`

Service Endpoints

Service	URL	Notes
API	http://localhost:8080	GET /payments, POST /payments, GET /health, GET /metrics
Worker	http://localhost:8081	POST /process, GET /health, GET /metrics
Grafana	http://localhost:3000	admin / admin
Prometheus	http://localhost:9090	Metrics + alert rules
Alertmanager	http://localhost:9093	Alert routing
Loki	http://localhost:3100	Log storage
Tempo	http://localhost:3200	Trace storage + service graph
Alloy UI	http://localhost:12345	Live pipeline inspector
cAdvisor	http://localhost:8082	Container resource metrics

What's Running

Applications

`api` — Payments API (`:8080`)

Method	Path	Description
`GET`	`/payments`	List payments
`GET`	`/payments/{id}`	Get payment by ID
`POST`	`/payments`	Create payment (3% simulated error rate)
`GET`	`/health`	Health check
`GET`	`/metrics`	Prometheus metrics

`worker` — Background Processor (`:8081`)

Method	Path	Description
`POST`	`/process`	Process a task (5% simulated error rate)
`GET`	`/health`	Health check
`GET`	`/metrics`	Prometheus metrics

Both services expose:

Metric	Type	Labels
`http_requests_total`	Counter	`method`, `endpoint`, `status_code`
`http_request_duration_seconds`	Histogram	`method`, `endpoint`

Observability Features in Detail

Structured logs with trace context

Every request emits a JSON log line with trace_id and span_id injected by a custom TraceContextFormatter:

{
  "message": "request",
  "method": "POST",
  "path": "/payments",
  "status": 200,
  "trace_id": "a1b2c3d4e5f6...",
  "span_id": "1a2b3c4d..."
}

Alloy parses these: level, service, and container become Loki labels (low cardinality); trace_id and span_id become structured metadata (high cardinality, not indexed) — enabling log→trace navigation without exploding the Loki index.

Distributed tracing (OpenTelemetry)

Both services auto-instrument with FastAPIInstrumentor — zero changes to endpoint code:

_provider = TracerProvider(resource=_resource)
_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(_provider)
FastAPIInstrumentor.instrument_app(app)

OTEL_EXPORTER_OTLP_ENDPOINT points to Alloy (:4317), which forwards to Tempo.

Tempo also generates span metrics (rate, error, duration histograms) and a service graph — both remote-written to Prometheus so they appear in Grafana dashboards automatically.

Metrics ↔ Logs ↔ Traces correlation

Full bidirectional navigation configured in observability/grafana/provisioning/datasources/datasources.yml:

Navigation	Mechanism
Metrics → Traces	Prometheus exemplar annotations link to the trace that caused the data point
Traces → Logs	Tempo "Logs for this trace" queries Loki by `trace_id` structured metadata
Traces → Metrics	Tempo links to Prometheus span metrics for the same service
Logs → Traces	`derivedFields` on `trace_id` field opens the trace in Tempo

Try it: open the Application Overview dashboard, click a data point on the error rate panel, click the exemplar → you jump to the exact trace. From the trace, click "Logs" → you see the log lines for that request.

Dashboards

Dashboard	URL	Shows
Application Overview	http://localhost:3000/d/app-overview	Request rate, error rate, P95 latency, live logs panel
Infrastructure Overview	http://localhost:3000/d/infra-overview	CPU, memory, network I/O, restarts per container

Both dashboards auto-provision on startup via Grafana provisioning. No manual import needed.

Alloy pipeline inspector

Open http://localhost:12345 to see the live component graph — every metrics scraper, log processor, and trace pipeline visualised as a DAG:

Alerting

Four alert rules in observability/prometheus/rules/alerts.yml:

Alert	Condition	For	Severity
`ServiceDown`	`up{job="prometheus.scrape.apps"} == 0`	1 min	critical
`HighErrorRate`	5xx / total > 5%	2 min	warning
`HighCpuUsage`	container CPU > 0.8 cores	5 min	warning
`HighMemoryUsage`	container memory > 512 MiB	5 min	warning

Trigger alerts manually

# HighErrorRate — run load test; spike scenario fires it in ~2 min
make load

# ServiceDown — stop the api container
docker compose -f docker-compose.app.yml -p server-1 stop api
# Restore:
docker compose -f docker-compose.app.yml -p server-1 start api

SLOs and error budgets

observability/prometheus/rules/slo.yml implements full SLI/SLO tracking with multi-window burn-rate alerts (Google SRE Workbook pattern):

Service	SLI	Target
`api`	Availability	99.0% non-5xx requests
`api`	Latency	99.0% requests < 300 ms
`worker`	Availability	98.0% non-5xx requests
`worker`	Latency	95.0% requests < 500 ms

Health and /metrics endpoints are excluded from SLIs (synthetic traffic, not user requests).

Load Testing (k6)

make load runs observability/k6/load.js with four concurrent scenarios:

Scenario	Service	Pattern	Purpose
`api_baseline`	api	2 VUs × 3 min	Stable baseline traffic
`api_ramp`	api	0 → 10 → 0 over ~2 min	Latency under load
`api_spike`	api	0 → 30 → 0 over 35 s	Triggers `HighErrorRate`
`worker_baseline`	worker	3 VUs × 3 min	Exercises worker service

k6 streams results to Prometheus via remote_write — VU count, request rate, and error rate appear in Grafana dashboards alongside application metrics.

Useful Queries

PromQL — Application metrics

# Request rate per service (req/s)
sum(rate(http_requests_total[1m])) by (service)

# Error rate %
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) * 100

# P95 latency per service
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

PromQL — Infrastructure (cAdvisor)

# CPU per container (cores)
sum(rate(container_cpu_usage_seconds_total{name=~"obs-.*"}[5m])) by (name)

# Memory per container
container_memory_usage_bytes{name=~"obs-.*"}

# Network I/O
sum(rate(container_network_receive_bytes_total{name=~"obs-.*"}[5m])) by (name)

LogQL — Logs

# All logs from api and worker
{service=~"api|worker"}

# Only errors
{service=~"api|worker"} | json | level = "ERROR"

# Logs for a specific trace
{service=~"api|worker"} | trace_id = "<paste-trace-id>"

# Error rate from logs (metric query)
sum(rate({service=~"api|worker"} | json | level="ERROR" [5m])) by (service)
/ sum(rate({service=~"api|worker"} [5m])) by (service)

TraceQL — Traces

# All traces for a service
{ resource.service.name = "api" }

# Traces with errors only
{ status = error }

# Slow requests with errors
{ resource.service.name = "api" && status = error && duration > 50ms }

# Traces involving both services (cross-service calls)
{ resource.service.name =~ "api|worker" } | by(rootServiceName)

Learning Guide

This repo is designed as a hands-on learning resource. Here's a suggested exploration path:

1. Start with the pipeline (Alloy)

Open http://localhost:12345/graph and study how data flows:

Which components collect metrics? Logs? Traces?
How are Docker containers discovered automatically?
Open the component detail for loki.process.containers — what stages run on each log line?

Read: Alloy Reference Cards

2. Explore metrics (Prometheus + Grafana)

Go to http://localhost:9090/targets — which targets are UP? Why are they discovered?
Run sum(rate(http_requests_total[1m])) by (service) in Prometheus
Open the Application Overview dashboard and understand each panel
Then run make load and watch the error rate spike

Read: Alloy Metrics Pipeline

3. Understand logs (Loki)

In Grafana → Explore → select Loki datasource
Query {service="api"} — see structured JSON logs
Now query {service="api"} | json | level="ERROR" — filtered by parsed field
Click a log line, look for the trace_id link — this jumps to Tempo

Read: Loki Architecture · Labels Design · LogQL Reference

4. Follow a trace end-to-end (Tempo)

In Grafana → Explore → select Tempo datasource
Search for traces by service api with status = error
Open a trace — see the span waterfall across the request lifecycle
Click "Logs for this trace" → jumps to Loki with the exact trace_id

Read: Alloy Traces Pipeline

5. Trigger and investigate an alert

Run make load to generate spike traffic
Wait ~2 minutes, then check http://localhost:9093 — HighErrorRate should fire
In Grafana, go to Alerting → open the firing alert → click the panel link
From the panel, click an exemplar on the error rate graph → opens the trace
From the trace, click Logs → see the exact log lines for that request

Read: Loki Alerting & Recording Rules

6. Study the SLO implementation

Open observability/prometheus/rules/slo.yml
Understand how SLIs are defined as recording rules
See how multi-window burn-rate alerts work (1h + 6h windows)
Understand the error budget calculation

Reference: Google SRE Workbook — Alerting on SLOs

7. Inspect the label strategy

Open omd.yaml — the centralized label registry
Understand why service, container, level are labels in Loki
Understand why trace_id is not a label (structured metadata instead)
Look at observability/alloy/config.alloy — see how labels are assigned

Read: Loki Labels Design

Reference Cards

Detailed reference documentation for each tool in the stack:

docs/observability-mastery/
├── README.md                          ← start here — overview + quick lookup
├── alloy/
│   ├── 01-core-concepts.md            component model, River syntax, CLI
│   ├── 02-metrics-pipeline.md         discovery, relabeling, scrape, remote_write
│   ├── 03-logs-pipeline.md            loki.source, all processing stages, loki.write
│   ├── 04-traces-pipeline.md          otelcol components, tail sampling, Tempo export
│   └── 05-production-advanced.md      K8s, hot reload, clustering, debugging
└── loki/
    ├── 01-architecture.md             mental model, write/read path, deployment modes
    ├── 02-labels-design.md            cardinality, labels vs structured metadata
    ├── 03-logql-reference.md          full query language with examples
    ├── 04-production-config.md        loki.yaml, schema, storage, ruler, Helm
    └── 05-operations-cost.md          troubleshooting, cost reduction, sizing

Repository Structure

.
├── src/
│   ├── api/                        # Payments API — FastAPI + OTel + Prometheus
│   │   ├── main.py
│   │   ├── requirements.txt
│   │   └── Dockerfile
│   └── worker/                     # Background processor — same instrumentation
│       ├── main.py
│       ├── requirements.txt
│       └── Dockerfile
├── observability/
│   ├── alloy/
│   │   └── config.alloy            # Metrics + logs (Docker) + OTLP traces pipeline
│   ├── prometheus/
│   │   ├── prometheus.yml
│   │   └── rules/
│   │       ├── alerts.yml          # ServiceDown, HighErrorRate, CPU, Memory
│   │       └── slo.yml             # SLI recording rules + error budget + burn-rate
│   ├── alertmanager/
│   │   └── config.yml
│   ├── tempo/
│   │   └── config.yml              # Tracing + span metrics generator
│   ├── loki/
│   │   └── config.yml
│   ├── grafana/
│   │   └── provisioning/
│   │       ├── datasources/        # Prometheus + Loki + Tempo with cross-links
│   │       └── dashboards/
│   │           ├── hello-api.json  # Application Overview dashboard
│   │           └── infra.json      # Infrastructure Overview dashboard
│   └── k6/
│       └── load.js                 # Load test: 4 scenarios across both services
├── docs/
│   ├── images/                     # Screenshots of the running stack
│   ├── decisions.md                # Architecture Decision Records (ADRs)
│   └── observability-mastery/      # Reference cards for the stack
│       ├── README.md
│       ├── alloy/                  # 5 reference cards for Grafana Alloy
│       └── loki/                   # 5 reference cards for Grafana Loki
├── docker-compose.app.yml          # Server 1 — applications
├── docker-compose.observability.yml # Server 2 — observability stack
├── omd.yaml                        # Observability Metadata Definition — label standards
├── Makefile
└── README.md

Design Decisions

Full rationale in docs/decisions.md. Summary:

Decision	Summary	ADR
Grafana Alloy	Single agent for metrics + logs + traces	ADR-001
OpenTelemetry SDK	CNCF-standard auto-instrumentation, backend-agnostic	ADR-002
`omd.yaml`	Centralized label registry to prevent drift and cardinality issues	ADR-003
Two Compose files	Faithful two-server simulation with independent lifecycles	ADR-004
Prometheus single-node	No object storage needed; one-line migration to Mimir	ADR-005
k6 + remote_write	Load test metrics in the same Grafana dashboards	ADR-006
Tempo span metrics	Automatic RED metrics from traces, service graph for free	ADR-007
Datasource cross-links	Full bidirectional navigation: metrics ↔ logs ↔ traces	ADR-008

Production Path

This stack runs on Docker Compose for simplicity. The natural evolution:

graph TB
  subgraph local["Current — Local Playground"]
    dc1["docker-compose.app.yml"]
    dc2["docker-compose.observability.yml"]
  end

  subgraph prod["Production — Kubernetes"]
    helm["Helm Charts"]
    apps["api / worker<br/>Deployments + HPA"]
    mimir["Grafana Mimir<br/>(replaces Prometheus)"]
    loki_prod["Grafana Loki<br/>SimpleScalable + S3"]
    tempo_prod["Grafana Tempo<br/>(distributed)"]
    alloy_prod["Alloy DaemonSet"]
    grafana_prod["Grafana (HA + SSO)"]
  end

  dc1 -->|"helm install"| helm
  helm --> apps
  helm --> mimir
  helm --> loki_prod
  helm --> tempo_prod
  helm --> alloy_prod
  helm --> grafana_prod

  style local fill:#1a1a2e,stroke:#e94560,color:#fff
  style prod fill:#1a1a2e,stroke:#16c79a,color:#fff

Concern	Local (current)	Production
Orchestration	Docker Compose	Kubernetes (EKS / GKE / AKS)
Metrics	Single Prometheus, 7 d	Grafana Mimir, S3/GCS, multi-tenant
Logs	Loki single-binary	Loki SimpleScalable + S3
Traces	Tempo single-binary	Tempo distributed
Agent	Alloy container	Alloy DaemonSet per node
Alerting	Single Alertmanager	3-node HA cluster + PagerDuty/Slack
Labels	`omd.yaml` manual convention	CI lint enforcing `omd.yaml` at PR time

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
docs		docs
observability		observability
src		src
.DS_Store		.DS_Store
Makefile		Makefile
README.md		README.md
docker-compose.app.yml		docker-compose.app.yml
docker-compose.observability.yml		docker-compose.observability.yml
omd.yaml		omd.yaml

Folders and files

Latest commit

History

Repository files navigation

Grafana Observability Playground

What you'll learn

Architecture

Data flow

Quick Start

All Makefile targets

Service Endpoints

What's Running

Applications

api — Payments API (:8080)

worker — Background Processor (:8081)

Observability Features in Detail

Structured logs with trace context

Distributed tracing (OpenTelemetry)

Metrics ↔ Logs ↔ Traces correlation

Dashboards

Alloy pipeline inspector

Alerting

Trigger alerts manually

SLOs and error budgets

Load Testing (k6)

Useful Queries

PromQL — Application metrics

PromQL — Infrastructure (cAdvisor)

LogQL — Logs

TraceQL — Traces

Learning Guide

1. Start with the pipeline (Alloy)

2. Explore metrics (Prometheus + Grafana)

3. Understand logs (Loki)

4. Follow a trace end-to-end (Tempo)

5. Trigger and investigate an alert

6. Study the SLO implementation

7. Inspect the label strategy

Reference Cards

Repository Structure

Design Decisions

Production Path

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`api` — Payments API (`:8080`)

`worker` — Background Processor (`:8081`)

Packages