A batteries-included, local observability sandbox built on the Grafana OSS stack. Spin it up in two commands and explore metrics, logs, traces, alerting, SLOs, and load testing — all wired together with realistic FastAPI services.
No cloud account needed. Everything runs locally via Docker Compose.
This playground covers the three pillars of observability plus the connective tissue between them:
| Topic | Tools | What to explore |
|---|---|---|
| Metrics | Prometheus + Grafana Alloy | Scraping, PromQL, recording rules, SLIs/SLOs |
| Logs | Loki + Grafana Alloy | Structured logs, LogQL, label design, log→trace links |
| Traces | Tempo + OpenTelemetry | Distributed tracing, span metrics, TraceQL, trace→log links |
| Alerting | Prometheus + Alertmanager | Threshold alerts, multi-window burn-rate SLO alerts |
| Dashboards | Grafana | Correlation: metrics ↔ logs ↔ traces in a single click |
| Load testing | k6 | Realistic traffic patterns, metrics in Prometheus |
| Collection agent | Grafana Alloy | Unified pipeline: scrape + collect logs + receive OTLP |
Two simulated servers connected by a shared Docker network:
graph LR
subgraph server1["Server 1 — docker-compose.app.yml"]
api["api :8080<br/>/payments · /health · /metrics"]
worker["worker :8081<br/>/process · /health · /metrics"]
end
subgraph server2["Server 2 — docker-compose.observability.yml"]
alloy["Grafana Alloy :12345"]
prometheus["Prometheus :9090"]
loki["Loki :3100"]
tempo["Tempo :3200"]
alertmanager["Alertmanager :9093"]
grafana["Grafana :3000"]
cadvisor["cAdvisor :8082"]
end
api -- "/metrics" --> alloy
worker -- "/metrics" --> alloy
api -- "OTLP gRPC<br/>traces" --> alloy
worker -- "OTLP gRPC<br/>traces" --> alloy
cadvisor -- "/metrics" --> alloy
alloy -- "remote_write" --> prometheus
alloy -- "Docker stdout<br/>JSON → labels" --> loki
alloy -- "OTLP" --> tempo
tempo -- "span metrics<br/>remote_write" --> prometheus
prometheus -- "alert rules" --> alertmanager
grafana -- "PromQL" --> prometheus
grafana -- "LogQL" --> loki
grafana -- "TraceQL" --> tempo
style server1 fill:#1a1a2e,stroke:#e94560,color:#fff
style server2 fill:#1a1a2e,stroke:#0f3460,color:#fff
sequenceDiagram
participant App as api / worker
participant Alloy as Grafana Alloy
participant Prom as Prometheus
participant Loki as Loki
participant Tempo as Tempo
participant AM as Alertmanager
participant G as Grafana
loop every 15s
Alloy->>App: GET /metrics
Alloy->>Prom: remote_write (metrics)
end
App->>Alloy: OTLP gRPC (traces)
Alloy->>Tempo: forward traces
Tempo->>Prom: span metrics (remote_write)
App-->>Alloy: stdout (JSON logs + trace_id)
Alloy->>Loki: push log streams
loop every 15s
Prom->>Prom: evaluate alert rules
Prom-->>AM: firing alerts
end
G->>Prom: PromQL queries
G->>Loki: LogQL queries
G->>Tempo: TraceQL queries
Requirements: Docker + Docker Compose
git clone <this-repo>
cd <this-repo>
make up # Build images and start both servers
make check # Smoke-test every service endpoint
make open # Open Grafana in the browser (admin / admin)
make load # Run k6 load test — triggers the HighErrorRate alert| Target | Description |
|---|---|
make help |
List all commands |
make build |
Build only the application images |
make up |
Build + start both servers (observability first, then apps) |
make down |
Stop all containers in both servers |
make check |
Smoke-test every service endpoint |
make logs-api |
Tail API logs only |
make logs-worker |
Tail worker logs only |
make load |
Run k6 load test (both servers must be up) |
make open |
Open Grafana in the browser |
make clean |
Stop containers, delete volumes, and remove the shared network |
make screenshots |
Open all UIs for taking fresh screenshots into docs/images/ |
| Service | URL | Notes |
|---|---|---|
| API | http://localhost:8080 | GET /payments, POST /payments, GET /health, GET /metrics |
| Worker | http://localhost:8081 | POST /process, GET /health, GET /metrics |
| Grafana | http://localhost:3000 | admin / admin |
| Prometheus | http://localhost:9090 | Metrics + alert rules |
| Alertmanager | http://localhost:9093 | Alert routing |
| Loki | http://localhost:3100 | Log storage |
| Tempo | http://localhost:3200 | Trace storage + service graph |
| Alloy UI | http://localhost:12345 | Live pipeline inspector |
| cAdvisor | http://localhost:8082 | Container resource metrics |
| Method | Path | Description |
|---|---|---|
GET |
/payments |
List payments |
GET |
/payments/{id} |
Get payment by ID |
POST |
/payments |
Create payment (3% simulated error rate) |
GET |
/health |
Health check |
GET |
/metrics |
Prometheus metrics |
| Method | Path | Description |
|---|---|---|
POST |
/process |
Process a task (5% simulated error rate) |
GET |
/health |
Health check |
GET |
/metrics |
Prometheus metrics |
Both services expose:
| Metric | Type | Labels |
|---|---|---|
http_requests_total |
Counter | method, endpoint, status_code |
http_request_duration_seconds |
Histogram | method, endpoint |
Every request emits a JSON log line with trace_id and span_id injected by a custom TraceContextFormatter:
{
"message": "request",
"method": "POST",
"path": "/payments",
"status": 200,
"trace_id": "a1b2c3d4e5f6...",
"span_id": "1a2b3c4d..."
}Alloy parses these: level, service, and container become Loki labels (low cardinality); trace_id and span_id become structured metadata (high cardinality, not indexed) — enabling log→trace navigation without exploding the Loki index.
Both services auto-instrument with FastAPIInstrumentor — zero changes to endpoint code:
_provider = TracerProvider(resource=_resource)
_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(_provider)
FastAPIInstrumentor.instrument_app(app)OTEL_EXPORTER_OTLP_ENDPOINT points to Alloy (:4317), which forwards to Tempo.
Tempo also generates span metrics (rate, error, duration histograms) and a service graph — both remote-written to Prometheus so they appear in Grafana dashboards automatically.
Full bidirectional navigation configured in observability/grafana/provisioning/datasources/datasources.yml:
| Navigation | Mechanism |
|---|---|
| Metrics → Traces | Prometheus exemplar annotations link to the trace that caused the data point |
| Traces → Logs | Tempo "Logs for this trace" queries Loki by trace_id structured metadata |
| Traces → Metrics | Tempo links to Prometheus span metrics for the same service |
| Logs → Traces | derivedFields on trace_id field opens the trace in Tempo |
Try it: open the Application Overview dashboard, click a data point on the error rate panel, click the exemplar → you jump to the exact trace. From the trace, click "Logs" → you see the log lines for that request.
| Dashboard | URL | Shows |
|---|---|---|
| Application Overview | http://localhost:3000/d/app-overview | Request rate, error rate, P95 latency, live logs panel |
| Infrastructure Overview | http://localhost:3000/d/infra-overview | CPU, memory, network I/O, restarts per container |
Both dashboards auto-provision on startup via Grafana provisioning. No manual import needed.
Open http://localhost:12345 to see the live component graph — every metrics scraper, log processor, and trace pipeline visualised as a DAG:
Four alert rules in observability/prometheus/rules/alerts.yml:
| Alert | Condition | For | Severity |
|---|---|---|---|
ServiceDown |
up{job="prometheus.scrape.apps"} == 0 |
1 min | critical |
HighErrorRate |
5xx / total > 5% | 2 min | warning |
HighCpuUsage |
container CPU > 0.8 cores | 5 min | warning |
HighMemoryUsage |
container memory > 512 MiB | 5 min | warning |
# HighErrorRate — run load test; spike scenario fires it in ~2 min
make load
# ServiceDown — stop the api container
docker compose -f docker-compose.app.yml -p server-1 stop api
# Restore:
docker compose -f docker-compose.app.yml -p server-1 start apiobservability/prometheus/rules/slo.yml implements full SLI/SLO tracking with multi-window burn-rate alerts (Google SRE Workbook pattern):
| Service | SLI | Target |
|---|---|---|
api |
Availability | 99.0% non-5xx requests |
api |
Latency | 99.0% requests < 300 ms |
worker |
Availability | 98.0% non-5xx requests |
worker |
Latency | 95.0% requests < 500 ms |
Health and /metrics endpoints are excluded from SLIs (synthetic traffic, not user requests).
make load runs observability/k6/load.js with four concurrent scenarios:
| Scenario | Service | Pattern | Purpose |
|---|---|---|---|
api_baseline |
api | 2 VUs × 3 min | Stable baseline traffic |
api_ramp |
api | 0 → 10 → 0 over ~2 min | Latency under load |
api_spike |
api | 0 → 30 → 0 over 35 s | Triggers HighErrorRate |
worker_baseline |
worker | 3 VUs × 3 min | Exercises worker service |
k6 streams results to Prometheus via remote_write — VU count, request rate, and error rate appear in Grafana dashboards alongside application metrics.
# Request rate per service (req/s)
sum(rate(http_requests_total[1m])) by (service)
# Error rate %
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) * 100
# P95 latency per service
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# CPU per container (cores)
sum(rate(container_cpu_usage_seconds_total{name=~"obs-.*"}[5m])) by (name)
# Memory per container
container_memory_usage_bytes{name=~"obs-.*"}
# Network I/O
sum(rate(container_network_receive_bytes_total{name=~"obs-.*"}[5m])) by (name)
# All logs from api and worker
{service=~"api|worker"}
# Only errors
{service=~"api|worker"} | json | level = "ERROR"
# Logs for a specific trace
{service=~"api|worker"} | trace_id = "<paste-trace-id>"
# Error rate from logs (metric query)
sum(rate({service=~"api|worker"} | json | level="ERROR" [5m])) by (service)
/ sum(rate({service=~"api|worker"} [5m])) by (service)
# All traces for a service
{ resource.service.name = "api" }
# Traces with errors only
{ status = error }
# Slow requests with errors
{ resource.service.name = "api" && status = error && duration > 50ms }
# Traces involving both services (cross-service calls)
{ resource.service.name =~ "api|worker" } | by(rootServiceName)
This repo is designed as a hands-on learning resource. Here's a suggested exploration path:
Open http://localhost:12345/graph and study how data flows:
- Which components collect metrics? Logs? Traces?
- How are Docker containers discovered automatically?
- Open the component detail for
loki.process.containers— what stages run on each log line?
Read: Alloy Reference Cards
- Go to http://localhost:9090/targets — which targets are UP? Why are they discovered?
- Run
sum(rate(http_requests_total[1m])) by (service)in Prometheus - Open the Application Overview dashboard and understand each panel
- Then run
make loadand watch the error rate spike
Read: Alloy Metrics Pipeline
- In Grafana → Explore → select Loki datasource
- Query
{service="api"}— see structured JSON logs - Now query
{service="api"} | json | level="ERROR"— filtered by parsed field - Click a log line, look for the trace_id link — this jumps to Tempo
Read: Loki Architecture · Labels Design · LogQL Reference
- In Grafana → Explore → select Tempo datasource
- Search for traces by service
apiwithstatus = error - Open a trace — see the span waterfall across the request lifecycle
- Click "Logs for this trace" → jumps to Loki with the exact trace_id
Read: Alloy Traces Pipeline
- Run
make loadto generate spike traffic - Wait ~2 minutes, then check http://localhost:9093 —
HighErrorRateshould fire - In Grafana, go to Alerting → open the firing alert → click the panel link
- From the panel, click an exemplar on the error rate graph → opens the trace
- From the trace, click Logs → see the exact log lines for that request
Read: Loki Alerting & Recording Rules
- Open
observability/prometheus/rules/slo.yml - Understand how SLIs are defined as recording rules
- See how multi-window burn-rate alerts work (1h + 6h windows)
- Understand the error budget calculation
Reference: Google SRE Workbook — Alerting on SLOs
- Open
omd.yaml— the centralized label registry - Understand why
service,container,levelare labels in Loki - Understand why
trace_idis not a label (structured metadata instead) - Look at
observability/alloy/config.alloy— see how labels are assigned
Read: Loki Labels Design
Detailed reference documentation for each tool in the stack:
docs/observability-mastery/
├── README.md ← start here — overview + quick lookup
├── alloy/
│ ├── 01-core-concepts.md component model, River syntax, CLI
│ ├── 02-metrics-pipeline.md discovery, relabeling, scrape, remote_write
│ ├── 03-logs-pipeline.md loki.source, all processing stages, loki.write
│ ├── 04-traces-pipeline.md otelcol components, tail sampling, Tempo export
│ └── 05-production-advanced.md K8s, hot reload, clustering, debugging
└── loki/
├── 01-architecture.md mental model, write/read path, deployment modes
├── 02-labels-design.md cardinality, labels vs structured metadata
├── 03-logql-reference.md full query language with examples
├── 04-production-config.md loki.yaml, schema, storage, ruler, Helm
└── 05-operations-cost.md troubleshooting, cost reduction, sizing
.
├── src/
│ ├── api/ # Payments API — FastAPI + OTel + Prometheus
│ │ ├── main.py
│ │ ├── requirements.txt
│ │ └── Dockerfile
│ └── worker/ # Background processor — same instrumentation
│ ├── main.py
│ ├── requirements.txt
│ └── Dockerfile
├── observability/
│ ├── alloy/
│ │ └── config.alloy # Metrics + logs (Docker) + OTLP traces pipeline
│ ├── prometheus/
│ │ ├── prometheus.yml
│ │ └── rules/
│ │ ├── alerts.yml # ServiceDown, HighErrorRate, CPU, Memory
│ │ └── slo.yml # SLI recording rules + error budget + burn-rate
│ ├── alertmanager/
│ │ └── config.yml
│ ├── tempo/
│ │ └── config.yml # Tracing + span metrics generator
│ ├── loki/
│ │ └── config.yml
│ ├── grafana/
│ │ └── provisioning/
│ │ ├── datasources/ # Prometheus + Loki + Tempo with cross-links
│ │ └── dashboards/
│ │ ├── hello-api.json # Application Overview dashboard
│ │ └── infra.json # Infrastructure Overview dashboard
│ └── k6/
│ └── load.js # Load test: 4 scenarios across both services
├── docs/
│ ├── images/ # Screenshots of the running stack
│ ├── decisions.md # Architecture Decision Records (ADRs)
│ └── observability-mastery/ # Reference cards for the stack
│ ├── README.md
│ ├── alloy/ # 5 reference cards for Grafana Alloy
│ └── loki/ # 5 reference cards for Grafana Loki
├── docker-compose.app.yml # Server 1 — applications
├── docker-compose.observability.yml # Server 2 — observability stack
├── omd.yaml # Observability Metadata Definition — label standards
├── Makefile
└── README.md
Full rationale in docs/decisions.md. Summary:
| Decision | Summary | ADR |
|---|---|---|
| Grafana Alloy | Single agent for metrics + logs + traces | ADR-001 |
| OpenTelemetry SDK | CNCF-standard auto-instrumentation, backend-agnostic | ADR-002 |
omd.yaml |
Centralized label registry to prevent drift and cardinality issues | ADR-003 |
| Two Compose files | Faithful two-server simulation with independent lifecycles | ADR-004 |
| Prometheus single-node | No object storage needed; one-line migration to Mimir | ADR-005 |
| k6 + remote_write | Load test metrics in the same Grafana dashboards | ADR-006 |
| Tempo span metrics | Automatic RED metrics from traces, service graph for free | ADR-007 |
| Datasource cross-links | Full bidirectional navigation: metrics ↔ logs ↔ traces | ADR-008 |
This stack runs on Docker Compose for simplicity. The natural evolution:
graph TB
subgraph local["Current — Local Playground"]
dc1["docker-compose.app.yml"]
dc2["docker-compose.observability.yml"]
end
subgraph prod["Production — Kubernetes"]
helm["Helm Charts"]
apps["api / worker<br/>Deployments + HPA"]
mimir["Grafana Mimir<br/>(replaces Prometheus)"]
loki_prod["Grafana Loki<br/>SimpleScalable + S3"]
tempo_prod["Grafana Tempo<br/>(distributed)"]
alloy_prod["Alloy DaemonSet"]
grafana_prod["Grafana (HA + SSO)"]
end
dc1 -->|"helm install"| helm
helm --> apps
helm --> mimir
helm --> loki_prod
helm --> tempo_prod
helm --> alloy_prod
helm --> grafana_prod
style local fill:#1a1a2e,stroke:#e94560,color:#fff
style prod fill:#1a1a2e,stroke:#16c79a,color:#fff
| Concern | Local (current) | Production |
|---|---|---|
| Orchestration | Docker Compose | Kubernetes (EKS / GKE / AKS) |
| Metrics | Single Prometheus, 7 d | Grafana Mimir, S3/GCS, multi-tenant |
| Logs | Loki single-binary | Loki SimpleScalable + S3 |
| Traces | Tempo single-binary | Tempo distributed |
| Agent | Alloy container | Alloy DaemonSet per node |
| Alerting | Single Alertmanager | 3-node HA cluster + PagerDuty/Slack |
| Labels | omd.yaml manual convention |
CI lint enforcing omd.yaml at PR time |







