A simple Python-based microservice project demonstrating how to instrument a simple deployment and user telemetry to visualize data from traces, profiling, and derived metrics, using OpenTelemetry, Grafana Tempo, Mimir, and Pyroscope.
The example includes a FastAPI-based client that triggers telemetry traffic, a FastAPI server that calls both a mock LLM backend (simulating inference with latency and failures), and to OpenAI API to verify or enrich results.
This forms a tracing chain as follows:
Client ➜ Server ➜ Mock LLM ➜ OpenAI
flowchart LR
subgraph Application[Client & Server Apps]
C1[Client App<br/>OTEL SDK + Pyroscope]
S1[Server App<br/>OTEL SDK + Pyroscope]
LLM[Mock LLM Backend<br/>Traceloop SDK]
OAI[External OpenAI API]
C1 -->|HTTP /calculate| S1
S1 -->|POST| LLM
S1 -->|API call| OAI
end
subgraph TelemetryStack[Telemetry Pipeline]
A1[Grafana Alloy<br/>OTEL Collector]
T1[Grafana Tempo<br/>Trace Storage]
M1[Metrics Generator<br/>Service Graphs, Spanmetrics]
M2[Grafana Mimir<br/>Metric Storage]
P1[Pyroscope Server]
C1 -->|OTLP| A1
S1 -->|OTLP| A1
LLM -->|OTLP| A1
A1 -->|Traces| T1
T1 -->|Span Metrics| M1
M1 -->|Remote Write| M2
C1 -->|Profiling| P1
S1 -->|Profiling| P1
end
subgraph UI[Grafana UI]
G1[Dashboards & RED Metrics]
G2[Pyroscope Profiling View]
M2 --> G1
P1 --> G2
end
You must create a .env
file at the root of the project with your OpenAI credentials:
OPENAI_API_KEY=your-api-key-here
To get started, use the provided Makefile
targets. Run make
to see available options:
➜ make
Usage:
make <target>
Targets:
help Show this help
up docker-compose up
stop docker-compose stop
ps docker-compose ps
logs docker-compose logs
make up
Then access:
- Grafana: http://localhost:3000 — no login required
make stop
This PoC sets up a self-contained observability stack that includes Grafana Alloy, Tempo, Mimir, Pyroscope, and Grafana. Each component plays a specific role in the observability pipeline:
Grafana Alloy serves as the collector and forwarder for telemetry data:
- Receives traces from the instrumented Python FastAPI client and server via the OTLP gRPC exporter.
- Forwards spans to Tempo for storage and processing.
- Can be extended to send logs or metrics to other backends (not used here yet).
Alloy simplifies ingestion, handles batching, retries, and offers a central point to configure telemetry pipelines.
Mimir acts as the Prometheus-compatible metrics backend. In this setup, it stores metrics generated by Tempo's metrics_generator
, such as:
traces_spanmetrics_calls_total
— counts, durations, and errors derived from spans.traces_service_graph_request_total
— service graph edge counts.
These are written to Mimir via remote_write
and queried from Grafana to build RED dashboards.
Tempo is the distributed tracing backend:
- Stores and indexes spans received from Alloy.
- Includes a metrics generator that turns traces into Prometheus-style metrics.
- Pushes those metrics to Mimir, enabling alerting and RED dashboarding based on tracing data alone.
This means you can get observability without changing your code to emit Prometheus metrics directly.
Pyroscope continuously captures CPU profiles of both client and server via native Python integration. This enables:
- High-resolution profiling of requests.
- Correlation of heavy code paths with spans and trace IDs.
Profiles are visualized in Grafana using the Pyroscope plugin.
Grafana is the single pane of glass for:
- Viewing traces via the Tempo data source.
- Exploring CPU profiles via the Pyroscope plugin.
- Building dashboards using Mimir metrics.
- Viewing service graphs with RED indicators based on span metrics.
Note: Grafana runs without authentication in this demo and is available on port
3000
.
Traces flow through Alloy → Tempo → Mimir and Grafana, enabling observability with minimal setup.
This PoC includes continuous profiling using Pyroscope, integrated into both the client and server Python applications. Pyroscope is connected via the pyroscope
Python SDK, and profiling data is visualized through Grafana.
The profiling mode used is process_cpu
, capturing CPU usage over time. This is useful for identifying bottlenecks, high CPU usage paths, and inefficiencies in application execution.
In this view, we see the client's CPU usage over time while making repeated requests to the server. The flame graph shows CPU consumption broken down by function calls. Notable highlights:
- The largest CPU-consuming function is the
ping_server
method, which issues a health check. - The
requests
library stack is visible underrequests/api.py:get
andrequests/sessions.py:request
, showing the HTTP GET flow. - Sampling and GRPC exporter spans from the OTLP SDK are visible, confirming OpenTelemetry overhead is being profiled as well.
A variation of the previous graph showing CPU load concentrated during a call to call_test_openai
. This provides insight into how much processing is involved when tracing spans across an external dependency like the OpenAI API.
This profile shows the FastAPI server under typical request load. Key insights:
- The
uvicorn/server.py
event loop is responsible for the majority of processing time. - The
GET /test-openai
endpoint leads into theopenai
Python package, showing how external synchronous calls are processed. - Sampling and export spans indicate OpenTelemetry export activity.
This snapshot shows a GET /calculate
request profile. You can observe time spent parsing the prompt, generating the request to OpenAI, and exporting traces via gRPC.
The final flame graph provides a broader sample window and highlights multiple overlapping requests. Notable here is the CPU consumption across async frameworks (asyncio
, uvicorn
) and OpenTelemetry’s exporter.
This view compares two different CPU profiles collected from the simple.python.server
service.
- Left panel (purple): Baseline time range
- Right panel (blue): Comparison time range
In the flame graph below:
- Green represents functions where CPU time decreased in the comparison.
- Red represents functions where CPU time increased.
- Gray sections are unchanged.
Notable observations:
main.py
anduvicorn
server event loop show significant increases in CPU time during the comparison window.- Async I/O operations such as
run_once
,run_forever
, andworker_thread
are prominently red, indicating increased activity—likely due to a spike in traffic or more expensive operations during that timeframe. - The right-side trace export and OpenAI blocks are still present but mostly unchanged (gray).
This visualization is extremely useful for identifying regressions, performance issues, or improvements introduced by recent code or workload changes.
This PoC includes OpenTelemetry tracing with both client and server instrumented in Python. Traces are exported via OTLP to Grafana Alloy, which then forwards them to Grafana Tempo for storage and Grafana Mimir for generating RED metrics and service graphs.
- Automatic trace context propagation between services using OTLP
- Error-aware spans using
StatusCode.ERROR
for clear failure visibility - Real-time visualizations of service dependencies and latencies via Tempo
- RED metrics (Rate, Error, Duration) derived from spans using Tempo's metrics generator
This dashboard gives an overview of trace activity over time. The top panel displays span rate, error rate, and latency distribution across all spans. The table below lists recent traces initiated by simple.python.client
, including their duration and associated operations.
This view is helpful for:
- Detecting spikes in latency or error rate
- Investigating outlier requests
- Quickly accessing individual traces for analysis
The Service Structure view shows how individual spans are structured across services. This aggregated representation helps you understand:
- Time distribution across components
- Downstream dependencies
- Costly or slow operations within the trace
In this example, two representative trace groups are shown:
client_request.call_fast
traces with a total duration of~3.18s
client_request.call_slow
traces with duration~3.69s
Each call flows through:
server-api
handling/calculate
mock-llm-backend
POSTing to/v1/chat/completions
The Breakdown tab in Grafana shows the RED metrics derived from spans grouped by service.name
. Here you can:
- Monitor request rate trends
- Identify services with errors (in red)
- Track latency distribution visually
This panel helps spot regressions or identify underperforming services quickly.
This panel combines RED metrics and a Tempo-generated service graph.
- The top table lists span names with their rate, error rate, and p90 latency.
- The bottom graph visualizes service dependencies, highlighting successful (green) and failed (red) calls.
In this example, calls to the mock LLM and OpenAI are clearly visible, including failed spans marked in red.
This graph shows the full execution flow for a selected trace (client_request.call_fast
). The Node Graph visualizes span relationships, including timing annotations like self-time and percentage of total trace time.
The Timeline view below shows:
- Each span's duration and offset
- Specific failure points, such as simulated LLM failures
- Span attributes like error.message, service.name, and host info
This detailed breakdown helps debug performance bottlenecks or trace specific failures across services.