-
Notifications
You must be signed in to change notification settings - Fork 232
Description
Epic: OpenLLMetry Integration for LLM Observability
Goal: Integrate OpenLLMetry to provide OpenTelemetry-based observability for MCP Gateway's LLM operations, enabling comprehensive tracing, metrics, and monitoring across all tools, prompts, and resources.
Why now: As MCP Gateway scales to production deployments, we need standardized observability that integrates with existing enterprise monitoring stacks. OpenLLMetry provides vendor-neutral, OpenTelemetry-native instrumentation specifically designed for LLM applications.
Related: #727
Type of Feature
- New Observability Integration
- Plugin Framework Extension
- Security Enhancement
- Performance Optimization
User Story 1: Platform Administrator Observability
As a: Platform Administrator managing MCP Gateway in production
I want: Comprehensive visibility into LLM operations including token usage, costs, and performance metrics
So that: I can optimize resource usage, control costs, and ensure SLA compliance across all MCP servers and tools
Acceptance Criteria
Given I have OpenLLMetry plugin configured and enabled
When tools are invoked through MCP Gateway
Then I should see traces with:
- Tool name, duration, and status
- Token usage (prompt, completion, total)
- Cost calculations based on configured pricing
- User and tenant attribution
- Server and gateway identifiers
Given I have configured token pricing for different models
When a tool using GPT-4 processes 1000 prompt tokens and 500 completion tokens
Then the cost metric should show:
- Prompt cost: 1000 * $0.03/1K = $0.03
- Completion cost: 500 * $0.06/1K = $0.03
- Total cost: $0.06
Given I have set up alerting rules
When tool response time exceeds 5 seconds
Then an alert should be triggered with trace context for debugging
User Story 2: Developer Workflow Tracing
As a: Developer building applications on MCP Gateway
I want: End-to-end tracing of complex multi-tool workflows with distributed trace correlation
So that: I can debug issues, optimize performance, and understand the flow of data through federated gateways
Acceptance Criteria
Given I have a workflow using multiple tools across federated gateways
When I execute the workflow
Then I should see:
- A single trace ID connecting all operations
- Parent-child span relationships
- Cross-gateway trace propagation via W3C headers
- Tool input/output in span attributes (sanitized)
Given I'm using the @workflow decorator
When I mark my function with @workflow(name="data_pipeline")
Then OpenLLMetry should:
- Create a root span for the workflow
- Automatically instrument child operations
- Preserve trace context across async calls
Given I have OpenTelemetry collector configured
When traces are exported
Then they should be available in:
- Jaeger UI for development
- Datadog/Honeycomb/New Relic for production
- With all MCP-specific semantic conventions
Design Sketch
traceloop-sdk
Integration
# plugins/openllmetry_observability/openllmetry_plugin.py
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow, task, agent, tool
from opentelemetry import trace, metrics
from mcpgateway.plugins.framework.base import Plugin
class OpenLLMetryPlugin(Plugin):
"""OpenLLMetry observability plugin implementing MCP hooks."""
async def on_startup(self) -> None:
"""Initialize OpenLLMetry SDK with MCP-specific configuration."""
Traceloop.init(
app_name="mcp-gateway",
api_endpoint=self._config.config.get("api_endpoint"),
api_key=self._config.config.get("api_key"),
disable_batch=False, # Enable batching for performance
telemetry_enabled=False, # Disable anonymous telemetry
resource_attributes={
"service.name": "mcp-gateway",
"service.version": self._config.config.get("gateway_version"),
"deployment.environment": self._config.config.get("environment")
}
)
# Initialize metrics
self.meter = metrics.get_meter("mcp-gateway.openllmetry")
self._setup_metrics()
def _setup_metrics(self):
"""Configure MCP-specific metrics."""
self.tool_invocations = self.meter.create_counter(
"mcp.tool.invocations",
description="Number of tool invocations",
unit="1"
)
self.tool_duration = self.meter.create_histogram(
"mcp.tool.duration",
description="Tool invocation duration",
unit="ms"
)
self.token_usage = self.meter.create_counter(
"mcp.tokens.used",
description="Token usage across tools",
unit="tokens"
)
self.operation_cost = self.meter.create_counter(
"mcp.cost.total",
description="Total cost of operations",
unit="usd"
)
Trace Context Propagation
# mcpgateway/observability/context.py
from opentelemetry import trace, baggage
from opentelemetry.propagate import inject, extract
class MCPTraceContext:
"""Manage trace context for distributed tracing."""
@staticmethod
def inject_context(headers: dict) -> dict:
"""Inject W3C trace context into headers for federation."""
inject(headers)
return headers
@staticmethod
def extract_context(headers: dict):
"""Extract trace context from incoming requests."""
return extract(headers)
@staticmethod
def add_baggage(key: str, value: str):
"""Add baggage for cross-service context propagation."""
ctx = baggage.set_baggage(key, value)
return ctx
Semantic Conventions for MCP
# mcpgateway/observability/semantic_conventions.py
class MCPSemanticConventions:
"""MCP-specific semantic conventions extending OpenTelemetry standards."""
# Span names
TOOL_INVOKE = "mcp.tool.invoke"
PROMPT_RENDER = "mcp.prompt.render"
RESOURCE_FETCH = "mcp.resource.fetch"
FEDERATION_CALL = "mcp.federation.call"
# Attributes
MCP_TOOL_NAME = "mcp.tool.name"
MCP_TOOL_NAMESPACE = "mcp.tool.namespace"
MCP_SERVER_ID = "mcp.server.id"
MCP_SERVER_TYPE = "mcp.server.type" # virtual, federated, local
MCP_GATEWAY_ID = "mcp.gateway.id"
MCP_TENANT_ID = "mcp.tenant.id"
MCP_USER_ID = "mcp.user.id"
MCP_INTEGRATION_TYPE = "mcp.integration.type" # REST, MCP
MCP_CACHE_HIT = "mcp.cache.hit"
MCP_FEDERATION_DEPTH = "mcp.federation.depth"
# LLM-specific (extending OpenLLMetry)
LLM_PROMPT_TOKENS = "llm.usage.prompt_tokens"
LLM_COMPLETION_TOKENS = "llm.usage.completion_tokens"
LLM_TOTAL_TOKENS = "llm.usage.total_tokens"
LLM_COST_USD = "llm.cost.usd"
LLM_MODEL = "llm.model"
LLM_PROVIDER = "llm.provider"
Configuration
New .env
Variables
Variable | Description | Default | Example |
---|---|---|---|
OPENLLMETRY_ENABLED |
Enable OpenLLMetry plugin | false |
true |
TRACELOOP_BASE_URL |
Traceloop backend URL (optional) | - | https://api.traceloop.com |
TRACELOOP_API_KEY |
API key for Traceloop (optional) | - | tl_xxxx |
TRACELOOP_TELEMETRY |
Enable anonymous telemetry | false |
false |
OTEL_SERVICE_NAME |
Service name for traces | mcp-gateway |
mcp-gateway-prod |
OTEL_EXPORTER_OTLP_ENDPOINT |
OTLP collector endpoint | http://localhost:4317 |
http://otel-collector:4317 |
OTEL_EXPORTER_OTLP_HEADERS |
Headers for OTLP exporter | - | api-key=xxx |
OTEL_RESOURCE_ATTRIBUTES |
Resource attributes | - | env=prod,region=us-east |
OTEL_TRACES_SAMPLER |
Sampling strategy | traceidratio |
traceidratio |
OTEL_TRACES_SAMPLER_ARG |
Sampling rate (0-1) | 1.0 |
0.1 |
MCP_TRACE_TOOL_IO |
Include tool I/O in traces | false |
true |
MCP_TRACE_SANITIZE_PII |
Sanitize PII in traces | true |
true |
MCP_TOKEN_PRICING_CONFIG |
Path to pricing config | - | /etc/mcp/pricing.yaml |
Plugin Configuration
# plugins/config.yaml
- name: "OpenLLMetryObservability"
kind: "plugins.openllmetry_observability.openllmetry_plugin.OpenLLMetryPlugin"
description: "OpenTelemetry-based LLM observability with OpenLLMetry SDK"
version: "1.0.0"
author: "MCP Gateway Team"
hooks:
- "tool_pre_invoke"
- "tool_post_invoke"
- "prompt_pre_fetch"
- "prompt_post_fetch"
- "resource_pre_fetch"
- "resource_post_fetch"
tags: ["observability", "tracing", "metrics", "opentelemetry", "llm"]
mode: "permissive"
priority: 210 # After security plugins
config:
# Service identification
environment: "${DEPLOYMENT_ENV:-production}"
gateway_version: "${MCPGATEWAY_VERSION:-unknown}"
# Sampling
sample_rate: ${OTEL_TRACES_SAMPLER_ARG:-1.0}
# Data sanitization
sanitize_pii: ${MCP_TRACE_SANITIZE_PII:-true}
include_tool_io: ${MCP_TRACE_TOOL_IO:-false}
max_attribute_length: 1000
# Token pricing (per 1K tokens)
pricing:
default:
prompt_token_cost: 0.0001
completion_token_cost: 0.0002
models:
"gpt-4":
prompt_token_cost: 0.03
completion_token_cost: 0.06
"gpt-3.5-turbo":
prompt_token_cost: 0.0015
completion_token_cost: 0.002
"claude-3-opus":
prompt_token_cost: 0.015
completion_token_cost: 0.075
"claude-3-sonnet":
prompt_token_cost: 0.003
completion_token_cost: 0.015
Test Scenarios
- Unit Tests: Mock Traceloop SDK, verify span creation and attributes
- Integration Tests: Test with real OpenTelemetry collector
- Federation Tests: Verify trace context propagation across gateways
- Performance Tests: Ensure < 5% overhead with full instrumentation
- Sampling Tests: Verify sampling strategies work correctly
- Cost Calculation Tests: Validate token pricing calculations
- PII Sanitization Tests: Ensure sensitive data is redacted
- Backend Integration Tests: Test with Jaeger, Tempo, Datadog
- Decorator Tests: Verify @workflow, @task decorators function
- Metric Aggregation Tests: Validate metric calculations and exports
Tasks
Area | Task | Notes |
---|---|---|
Dependencies | Add traceloop-sdk and OpenTelemetry packages |
Update pyproject.toml |
Plugin Development | Create OpenLLMetryPlugin class |
Implement all MCP hooks |
Semantic Conventions | Define MCP-specific conventions | Extend OpenTelemetry standards |
Context Propagation | Implement W3C trace context | For federation support |
Metrics | Set up counters and histograms | Token usage, cost, duration |
Configuration | Add environment variables | Support multiple backends |
Docker | Create OpenTelemetry collector setup | Include Jaeger for dev |
Dashboards | Create Grafana dashboard templates | Metrics and trace visualization |
Documentation | Write setup and usage guides | Include backend examples |
Testing | Implement comprehensive test suite | Unit, integration, load tests |
Standards Check
- Follows OpenTelemetry semantic conventions
- Compatible with W3C Trace Context standard
- Implements OpenTelemetry metrics API
- Uses standard OTLP export protocol
- Follows MCP Gateway plugin framework
- Respects privacy regulations (PII sanitization)
- Supports standard observability backends
- Implements proper error handling and fallbacks
- Provides configurable sampling strategies
- Includes comprehensive documentation
Bonus: Advanced Features
LLM-Specific Enhancements
Consider implementing OpenLLMetry's advanced features in Phase 2:
-
Prompt Template Analytics
- Track prompt template effectiveness
- A/B testing support for prompt variations
- Template performance metrics
-
Model Comparison Metrics
- Side-by-side model performance tracking
- Cost/performance trade-off analysis
- Model selection recommendations
-
Anomaly Detection
- Detect unusual token usage patterns
- Alert on cost spikes
- Identify performance degradation
-
Workflow Optimization
- Identify redundant tool calls
- Suggest caching opportunities
- Recommend parallel execution paths
Integration with AI Observability Platforms
# Future enhancement: Direct integration with AI platforms
class AIObservabilityBridge:
"""Bridge OpenLLMetry data to AI-specific platforms."""
async def export_to_langfuse(self, spans):
"""Export to Langfuse for LLM-specific analysis."""
pass
async def export_to_helicone(self, spans):
"""Export to Helicone for cost optimization."""
pass
async def export_to_phoenix(self, spans):
"""Export to Phoenix for evaluation."""
pass
Implementation Timeline
Phase 1: Core Integration (Week 1)
- Install OpenLLMetry SDK
- Create base plugin structure
- Implement tool instrumentation
- Basic span and metric creation
Phase 2: Advanced Features (Week 2)
- Semantic conventions implementation
- Federation trace propagation
- Cost tracking and calculations
- PII sanitization
Phase 3: Observability Stack (Week 3)
- OpenTelemetry collector setup
- Backend integrations (Jaeger, Datadog, etc.)
- Grafana dashboards
- Alerting rules
Phase 4: Production Readiness (Week 4)
- Performance optimization
- Comprehensive testing
- Documentation completion
- Load testing and tuning
Success Metrics
Metric | Target | Measurement |
---|---|---|
Performance Overhead | < 5% | Load test comparison |
Trace Completeness | 100% | All operations traced |
Backend Compatibility | 3+ backends | Integration tests |
Test Coverage | > 90% | pytest coverage |
Documentation | Complete | Review checklist |
PII Leakage | 0 incidents | Security audit |
Federation Support | Full | Cross-gateway tests |
Cost Accuracy | ±1% | Billing comparison |
Related Issues
- [Feature Request]: Add OpenLLMetry Integration for Observability #175 - Add OpenLLMetry Integration for Observability (this issue)
- [Feature Request]: Prometheus Metrics Instrumentation using prometheus-fastapi-instrumentator #218 - Prometheus Metrics Instrumentation
- [Feature Request]: Structured JSON Logging with Correlation IDs #300 - Structured JSON Logging with Correlation IDs
- [Feature Request]: AI Middleware Integration / Plugin Framework for extensible gateway capabilities #319 - AI Middleware Integration / Plugin Framework
- [Feature]: Phoenix Observability Integration plugin #727 - Phoenix Observability Integration