Skip to content

ikramhasan/lgtm-stack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LGTM Stack POC (Loki, Grafana, Tempo, Mimir)

An observability stack based on OpenTelemetry (OTel) and the Grafana "LGTM" suite.

Architecture

  • Loki: Log aggregation.
  • Grafana: Visualization and dashboards.
  • Tempo: Distributed tracing.
  • Mimir: Scalable Long-term storage for Prometheus metrics.
  • OTel Collector: Central gateway for receiving and routing telemetry data.
  • MinIO/S3: Object storage backend for long-term data retention.

Quick Start

1. Prerequisites

  • Docker and Docker Compose.
  • uv (for running the example app).

2. Environment Setup

Copy the example environment file and adjust if necessary:

cp .env.example .env

2.1 MinIO Setup (Optional)

Follow the MinIO setup instructions below if you want to use MinIO for local development.

3. Start the Stack

docker-compose up -d

This starts Loki, Tempo, Mimir, Grafana, the OTel Collector, and a local MinIO instance.

4. Run the Example Application

cd example/fastapi-app
uv sync
uv run python main.py

Trigger some data by visiting http://localhost:8000/process.

Configuration: MinIO vs. AWS S3

The stack is currently configured to use MinIO for local development.

Using Local MinIO (Default)

In your .env file:

S3_ENDPOINT=host.docker.internal:9000
S3_INSECURE=true
S3_FORCE_PATH_STYLE=true
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadmin

Run the container docker compose up -d from example/minio to start the MinIO instance.

The MinIO instance is running at http://localhost:9000 with the default credentials minioadmin/minioadmin.

Go to the MinIO dashboard, and create the buckets loki-logs, tempo-traces, and mimir-metrics.

Using AWS S3

To switch to production AWS S3:

  1. Update .env:
    • S3_ENDPOINT: s3.us-east-1.amazonaws.com (or your region's endpoint).
    • S3_INSECURE: false.
    • S3_FORCE_PATH_STYLE: false.
    • AWS_ACCESS_KEY_ID & AWS_SECRET_ACCESS_KEY: Your AWS credentials.
  2. Ensure the buckets (loki-logs, tempo-traces, mimir-metrics) exist in your AWS account or update the bucket name variables in .env.

Mimir Metrics & Dashboards

Use the following table to set up your primary observability dashboard. These metrics are exported by the FastAPI application.

Panel Name Visualization Query (PromQL) Description
Total Request Rate Time series sum(rate(http_requests_total[$__rate_interval])) by (http_target) Real-time traffic per endpoint (Requests/sec).
Error Rate (%) Stat sum(rate(http_errors_total[$__range])) / sum(rate(http_requests_total[$__range])) Percentage of requests resulting in 4xx/5xx errors over the selected time range.
P95 Latency Time series histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le)) 95th percentile response time for all endpoints.
Active Requests Gauge sum(http_server_active_requests) Number of concurrent requests being processed.
Errors by Endpoint Bar chart sum(increase(http_errors_total[$__range])) by (http_target) Total errors grouped by path over the selected time range.
Top 5 Slowest Paths Table topk(5, sum(rate(http_request_duration_seconds_sum[$__range])) by (http_target) / sum(rate(http_request_duration_seconds_count[$__range])) by (http_target)) List of endpoints with the highest average latency.

How to Add a Panel

  1. Click + Add in the top right of your dashboard -> Visualization.
  2. Select Mimir as the data source.
  3. Paste the Query from the table above.
  4. Set the Title to the Panel Name.
  5. Select the Visualization type from the right sidebar.
  6. Click Save or Apply.

Loki Logs & Analysis

Loki allows you to query logs using LogQL. The stack is configured to automatically label logs with metadata like service_name and deployment_environment.

Key Queries

Panel Name Visualization Query (LogQL) Description
Application Logs Logs {service_name="fastapi-service"} Live stream of all logs from the FastAPI app.
Error Log Stream Logs {service_name="fastapi-service"} |= "error" Filtered stream showing only lines containing "error" (case-insensitive).
Log Volume Time series count_over_time({service_name="fastapi-service"}[$__interval]) Bar chart showing the number of log lines produced per interval.
Severity Distribution Pie chart sum by (level) (count_over_time({service_name="fastapi-service"}[$__range])) Breakdown of log levels (INFO, ERROR, WARN) for the selected time range.
Error Frequency Time series count_over_time({service_name="fastapi-service"} |= "error" [$__interval]) Specifically tracks the rate of error-level logs.

How to Add a Log Panel

  1. Click + Add -> Visualization.
  2. Select Loki as the data source.
  3. Paste one of the Queries above.
  4. Select the matching Visualization type from the right sidebar.

Trace Correlation (Loki -> Tempo)

When viewing logs in the Explore tab or a Logs panel:

  1. Click on a log line to expand it.
  2. Look for the trace_id field.
  3. Click the Tempo button next to the ID to instantly see the full distributed trace for that specific log entry.

Advanced Observability Patterns

Beyond basic metrics, you can leverage the full power of the LGTM stack with these advanced patterns:

Pattern / Metric Visualization Query Description
RED: Rate Time series sum(rate(http_requests_total[$__rate_interval])) Request rate per second.
RED: Errors Time series sum(rate(http_errors_total[$__rate_interval])) Error rate per second.
RED: Duration Time series histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le)) 90th percentile response time.
Latency Heatmap Heatmap sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le) Visual distribution of latency buckets.
Log Severity Time series / Bar gauge sum by (level) (count_over_time({service_name="fastapi-service"} [$__range])) Monitor log health by severity over time.
Apdex Score Stat (sum(rate(http_request_duration_seconds_bucket{le="0.5"}[$__range])) + sum(rate(http_request_duration_seconds_bucket{le="1.0"}[$__range])) / 2) / sum(rate(http_request_duration_seconds_count[$__range])) Single score (0-1) for user satisfaction.
Resource Grouping Time series sum(rate(http_requests_total[$__rate_interval])) by (service_version, deployment_environment) Compare performance across versions/environments.

Tip

Update the fastapi-service service name to your application name.

  • Dynamic Time Ranges: Instead of hardcoding [5m], use Grafana global variables:
  • [$__range]: Adjusts to the exact time period selected in the dashboard picker (e.g., Last 1 hour). Use this for total counts (with increase()) or "Stat" panels.
  • [$__rate_interval]: Automatically calculates the best interval for rate() based on the graph's time range and resolution. Use this for Time series graphs.

Debugging Tips

  • Unhealthy Ring: If Mimir/Loki report ring issues, ensure replication_factor is set to 1 in the YAML configs for single-node setups.
  • Log Ingestion: Check the OTel Collector logs (docker logs otel-collector) to see if data is being received and exported correctly.
  • S3 Connectivity: Ensure the S3 endpoint is reachable from within the Docker containers. On MacOS, host.docker.internal is used to reach the host's port 9000.

About

An observability stack based on OpenTelemetry (OTel) and the Grafana "LGTM" suite, with an example app provided to demonstrate orchastration.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages