An observability stack based on OpenTelemetry (OTel) and the Grafana "LGTM" suite.
- Loki: Log aggregation.
- Grafana: Visualization and dashboards.
- Tempo: Distributed tracing.
- Mimir: Scalable Long-term storage for Prometheus metrics.
- OTel Collector: Central gateway for receiving and routing telemetry data.
- MinIO/S3: Object storage backend for long-term data retention.
- Docker and Docker Compose.
- uv (for running the example app).
Copy the example environment file and adjust if necessary:
cp .env.example .envFollow the MinIO setup instructions below if you want to use MinIO for local development.
docker-compose up -dThis starts Loki, Tempo, Mimir, Grafana, the OTel Collector, and a local MinIO instance.
cd example/fastapi-app
uv sync
uv run python main.pyTrigger some data by visiting http://localhost:8000/process.
The stack is currently configured to use MinIO for local development.
In your .env file:
S3_ENDPOINT=host.docker.internal:9000
S3_INSECURE=true
S3_FORCE_PATH_STYLE=true
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadminRun the container docker compose up -d from example/minio to start the MinIO instance.
The MinIO instance is running at http://localhost:9000 with the default credentials minioadmin/minioadmin.
Go to the MinIO dashboard, and create the buckets loki-logs, tempo-traces, and mimir-metrics.
To switch to production AWS S3:
- Update
.env:S3_ENDPOINT:s3.us-east-1.amazonaws.com(or your region's endpoint).S3_INSECURE:false.S3_FORCE_PATH_STYLE:false.AWS_ACCESS_KEY_ID&AWS_SECRET_ACCESS_KEY: Your AWS credentials.
- Ensure the buckets (
loki-logs,tempo-traces,mimir-metrics) exist in your AWS account or update the bucket name variables in.env.
Use the following table to set up your primary observability dashboard. These metrics are exported by the FastAPI application.
| Panel Name | Visualization | Query (PromQL) | Description |
|---|---|---|---|
| Total Request Rate | Time series | sum(rate(http_requests_total[$__rate_interval])) by (http_target) |
Real-time traffic per endpoint (Requests/sec). |
| Error Rate (%) | Stat | sum(rate(http_errors_total[$__range])) / sum(rate(http_requests_total[$__range])) |
Percentage of requests resulting in 4xx/5xx errors over the selected time range. |
| P95 Latency | Time series | histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le)) |
95th percentile response time for all endpoints. |
| Active Requests | Gauge | sum(http_server_active_requests) |
Number of concurrent requests being processed. |
| Errors by Endpoint | Bar chart | sum(increase(http_errors_total[$__range])) by (http_target) |
Total errors grouped by path over the selected time range. |
| Top 5 Slowest Paths | Table | topk(5, sum(rate(http_request_duration_seconds_sum[$__range])) by (http_target) / sum(rate(http_request_duration_seconds_count[$__range])) by (http_target)) |
List of endpoints with the highest average latency. |
- Click + Add in the top right of your dashboard -> Visualization.
- Select Mimir as the data source.
- Paste the Query from the table above.
- Set the Title to the Panel Name.
- Select the Visualization type from the right sidebar.
- Click Save or Apply.
Loki allows you to query logs using LogQL. The stack is configured to automatically label logs with metadata like service_name and deployment_environment.
| Panel Name | Visualization | Query (LogQL) | Description |
|---|---|---|---|
| Application Logs | Logs | {service_name="fastapi-service"} |
Live stream of all logs from the FastAPI app. |
| Error Log Stream | Logs | {service_name="fastapi-service"} |= "error" |
Filtered stream showing only lines containing "error" (case-insensitive). |
| Log Volume | Time series | count_over_time({service_name="fastapi-service"}[$__interval]) |
Bar chart showing the number of log lines produced per interval. |
| Severity Distribution | Pie chart | sum by (level) (count_over_time({service_name="fastapi-service"}[$__range])) |
Breakdown of log levels (INFO, ERROR, WARN) for the selected time range. |
| Error Frequency | Time series | count_over_time({service_name="fastapi-service"} |= "error" [$__interval]) |
Specifically tracks the rate of error-level logs. |
- Click + Add -> Visualization.
- Select Loki as the data source.
- Paste one of the Queries above.
- Select the matching Visualization type from the right sidebar.
When viewing logs in the Explore tab or a Logs panel:
- Click on a log line to expand it.
- Look for the
trace_idfield. - Click the Tempo button next to the ID to instantly see the full distributed trace for that specific log entry.
Beyond basic metrics, you can leverage the full power of the LGTM stack with these advanced patterns:
| Pattern / Metric | Visualization | Query | Description |
|---|---|---|---|
| RED: Rate | Time series | sum(rate(http_requests_total[$__rate_interval])) |
Request rate per second. |
| RED: Errors | Time series | sum(rate(http_errors_total[$__rate_interval])) |
Error rate per second. |
| RED: Duration | Time series | histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le)) |
90th percentile response time. |
| Latency Heatmap | Heatmap | sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le) |
Visual distribution of latency buckets. |
| Log Severity | Time series / Bar gauge | sum by (level) (count_over_time({service_name="fastapi-service"} [$__range])) |
Monitor log health by severity over time. |
| Apdex Score | Stat | (sum(rate(http_request_duration_seconds_bucket{le="0.5"}[$__range])) + sum(rate(http_request_duration_seconds_bucket{le="1.0"}[$__range])) / 2) / sum(rate(http_request_duration_seconds_count[$__range])) |
Single score (0-1) for user satisfaction. |
| Resource Grouping | Time series | sum(rate(http_requests_total[$__rate_interval])) by (service_version, deployment_environment) |
Compare performance across versions/environments. |
Tip
Update the fastapi-service service name to your application name.
- Dynamic Time Ranges: Instead of hardcoding
[5m], use Grafana global variables: [$__range]: Adjusts to the exact time period selected in the dashboard picker (e.g., Last 1 hour). Use this for total counts (withincrease()) or "Stat" panels.[$__rate_interval]: Automatically calculates the best interval forrate()based on the graph's time range and resolution. Use this for Time series graphs.
- Unhealthy Ring: If Mimir/Loki report ring issues, ensure
replication_factoris set to1in the YAML configs for single-node setups. - Log Ingestion: Check the OTel Collector logs (
docker logs otel-collector) to see if data is being received and exported correctly. - S3 Connectivity: Ensure the S3 endpoint is reachable from within the Docker containers. On MacOS,
host.docker.internalis used to reach the host's port 9000.