Skip to content

[Feature Request]: Add OpenLLMetry Integration for Observability #175

@crivetimihai

Description

@crivetimihai

Epic: OpenLLMetry Integration for LLM Observability

Goal: Integrate OpenLLMetry to provide OpenTelemetry-based observability for MCP Gateway's LLM operations, enabling comprehensive tracing, metrics, and monitoring across all tools, prompts, and resources.

Why now: As MCP Gateway scales to production deployments, we need standardized observability that integrates with existing enterprise monitoring stacks. OpenLLMetry provides vendor-neutral, OpenTelemetry-native instrumentation specifically designed for LLM applications.

Related: #727

Type of Feature

  • New Observability Integration
  • Plugin Framework Extension
  • Security Enhancement
  • Performance Optimization

User Story 1: Platform Administrator Observability

As a: Platform Administrator managing MCP Gateway in production

I want: Comprehensive visibility into LLM operations including token usage, costs, and performance metrics

So that: I can optimize resource usage, control costs, and ensure SLA compliance across all MCP servers and tools

Acceptance Criteria

Given I have OpenLLMetry plugin configured and enabled
When tools are invoked through MCP Gateway
Then I should see traces with:
  - Tool name, duration, and status
  - Token usage (prompt, completion, total)
  - Cost calculations based on configured pricing
  - User and tenant attribution
  - Server and gateway identifiers

Given I have configured token pricing for different models
When a tool using GPT-4 processes 1000 prompt tokens and 500 completion tokens
Then the cost metric should show:
  - Prompt cost: 1000 * $0.03/1K = $0.03
  - Completion cost: 500 * $0.06/1K = $0.03
  - Total cost: $0.06

Given I have set up alerting rules
When tool response time exceeds 5 seconds
Then an alert should be triggered with trace context for debugging

User Story 2: Developer Workflow Tracing

As a: Developer building applications on MCP Gateway

I want: End-to-end tracing of complex multi-tool workflows with distributed trace correlation

So that: I can debug issues, optimize performance, and understand the flow of data through federated gateways

Acceptance Criteria

Given I have a workflow using multiple tools across federated gateways
When I execute the workflow
Then I should see:
  - A single trace ID connecting all operations
  - Parent-child span relationships
  - Cross-gateway trace propagation via W3C headers
  - Tool input/output in span attributes (sanitized)

Given I'm using the @workflow decorator
When I mark my function with @workflow(name="data_pipeline")
Then OpenLLMetry should:
  - Create a root span for the workflow
  - Automatically instrument child operations
  - Preserve trace context across async calls

Given I have OpenTelemetry collector configured
When traces are exported
Then they should be available in:
  - Jaeger UI for development
  - Datadog/Honeycomb/New Relic for production
  - With all MCP-specific semantic conventions

Design Sketch

traceloop-sdk Integration

# plugins/openllmetry_observability/openllmetry_plugin.py

from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow, task, agent, tool
from opentelemetry import trace, metrics
from mcpgateway.plugins.framework.base import Plugin

class OpenLLMetryPlugin(Plugin):
    """OpenLLMetry observability plugin implementing MCP hooks."""
    
    async def on_startup(self) -> None:
        """Initialize OpenLLMetry SDK with MCP-specific configuration."""
        Traceloop.init(
            app_name="mcp-gateway",
            api_endpoint=self._config.config.get("api_endpoint"),
            api_key=self._config.config.get("api_key"),
            disable_batch=False,  # Enable batching for performance
            telemetry_enabled=False,  # Disable anonymous telemetry
            resource_attributes={
                "service.name": "mcp-gateway",
                "service.version": self._config.config.get("gateway_version"),
                "deployment.environment": self._config.config.get("environment")
            }
        )
        
        # Initialize metrics
        self.meter = metrics.get_meter("mcp-gateway.openllmetry")
        self._setup_metrics()
    
    def _setup_metrics(self):
        """Configure MCP-specific metrics."""
        self.tool_invocations = self.meter.create_counter(
            "mcp.tool.invocations",
            description="Number of tool invocations",
            unit="1"
        )
        
        self.tool_duration = self.meter.create_histogram(
            "mcp.tool.duration",
            description="Tool invocation duration",
            unit="ms"
        )
        
        self.token_usage = self.meter.create_counter(
            "mcp.tokens.used",
            description="Token usage across tools",
            unit="tokens"
        )
        
        self.operation_cost = self.meter.create_counter(
            "mcp.cost.total",
            description="Total cost of operations",
            unit="usd"
        )

Trace Context Propagation

# mcpgateway/observability/context.py

from opentelemetry import trace, baggage
from opentelemetry.propagate import inject, extract

class MCPTraceContext:
    """Manage trace context for distributed tracing."""
    
    @staticmethod
    def inject_context(headers: dict) -> dict:
        """Inject W3C trace context into headers for federation."""
        inject(headers)
        return headers
    
    @staticmethod
    def extract_context(headers: dict):
        """Extract trace context from incoming requests."""
        return extract(headers)
    
    @staticmethod
    def add_baggage(key: str, value: str):
        """Add baggage for cross-service context propagation."""
        ctx = baggage.set_baggage(key, value)
        return ctx

Semantic Conventions for MCP

# mcpgateway/observability/semantic_conventions.py

class MCPSemanticConventions:
    """MCP-specific semantic conventions extending OpenTelemetry standards."""
    
    # Span names
    TOOL_INVOKE = "mcp.tool.invoke"
    PROMPT_RENDER = "mcp.prompt.render"
    RESOURCE_FETCH = "mcp.resource.fetch"
    FEDERATION_CALL = "mcp.federation.call"
    
    # Attributes
    MCP_TOOL_NAME = "mcp.tool.name"
    MCP_TOOL_NAMESPACE = "mcp.tool.namespace"
    MCP_SERVER_ID = "mcp.server.id"
    MCP_SERVER_TYPE = "mcp.server.type"  # virtual, federated, local
    MCP_GATEWAY_ID = "mcp.gateway.id"
    MCP_TENANT_ID = "mcp.tenant.id"
    MCP_USER_ID = "mcp.user.id"
    MCP_INTEGRATION_TYPE = "mcp.integration.type"  # REST, MCP
    MCP_CACHE_HIT = "mcp.cache.hit"
    MCP_FEDERATION_DEPTH = "mcp.federation.depth"
    
    # LLM-specific (extending OpenLLMetry)
    LLM_PROMPT_TOKENS = "llm.usage.prompt_tokens"
    LLM_COMPLETION_TOKENS = "llm.usage.completion_tokens"
    LLM_TOTAL_TOKENS = "llm.usage.total_tokens"
    LLM_COST_USD = "llm.cost.usd"
    LLM_MODEL = "llm.model"
    LLM_PROVIDER = "llm.provider"

Configuration

New .env Variables

Variable Description Default Example
OPENLLMETRY_ENABLED Enable OpenLLMetry plugin false true
TRACELOOP_BASE_URL Traceloop backend URL (optional) - https://api.traceloop.com
TRACELOOP_API_KEY API key for Traceloop (optional) - tl_xxxx
TRACELOOP_TELEMETRY Enable anonymous telemetry false false
OTEL_SERVICE_NAME Service name for traces mcp-gateway mcp-gateway-prod
OTEL_EXPORTER_OTLP_ENDPOINT OTLP collector endpoint http://localhost:4317 http://otel-collector:4317
OTEL_EXPORTER_OTLP_HEADERS Headers for OTLP exporter - api-key=xxx
OTEL_RESOURCE_ATTRIBUTES Resource attributes - env=prod,region=us-east
OTEL_TRACES_SAMPLER Sampling strategy traceidratio traceidratio
OTEL_TRACES_SAMPLER_ARG Sampling rate (0-1) 1.0 0.1
MCP_TRACE_TOOL_IO Include tool I/O in traces false true
MCP_TRACE_SANITIZE_PII Sanitize PII in traces true true
MCP_TOKEN_PRICING_CONFIG Path to pricing config - /etc/mcp/pricing.yaml

Plugin Configuration

# plugins/config.yaml
- name: "OpenLLMetryObservability"
  kind: "plugins.openllmetry_observability.openllmetry_plugin.OpenLLMetryPlugin"
  description: "OpenTelemetry-based LLM observability with OpenLLMetry SDK"
  version: "1.0.0"
  author: "MCP Gateway Team"
  hooks:
    - "tool_pre_invoke"
    - "tool_post_invoke"
    - "prompt_pre_fetch"
    - "prompt_post_fetch"
    - "resource_pre_fetch"
    - "resource_post_fetch"
  tags: ["observability", "tracing", "metrics", "opentelemetry", "llm"]
  mode: "permissive"
  priority: 210  # After security plugins
  config:
    # Service identification
    environment: "${DEPLOYMENT_ENV:-production}"
    gateway_version: "${MCPGATEWAY_VERSION:-unknown}"
    
    # Sampling
    sample_rate: ${OTEL_TRACES_SAMPLER_ARG:-1.0}
    
    # Data sanitization
    sanitize_pii: ${MCP_TRACE_SANITIZE_PII:-true}
    include_tool_io: ${MCP_TRACE_TOOL_IO:-false}
    max_attribute_length: 1000
    
    # Token pricing (per 1K tokens)
    pricing:
      default:
        prompt_token_cost: 0.0001
        completion_token_cost: 0.0002
      models:
        "gpt-4":
          prompt_token_cost: 0.03
          completion_token_cost: 0.06
        "gpt-3.5-turbo":
          prompt_token_cost: 0.0015
          completion_token_cost: 0.002
        "claude-3-opus":
          prompt_token_cost: 0.015
          completion_token_cost: 0.075
        "claude-3-sonnet":
          prompt_token_cost: 0.003
          completion_token_cost: 0.015

Test Scenarios

  • Unit Tests: Mock Traceloop SDK, verify span creation and attributes
  • Integration Tests: Test with real OpenTelemetry collector
  • Federation Tests: Verify trace context propagation across gateways
  • Performance Tests: Ensure < 5% overhead with full instrumentation
  • Sampling Tests: Verify sampling strategies work correctly
  • Cost Calculation Tests: Validate token pricing calculations
  • PII Sanitization Tests: Ensure sensitive data is redacted
  • Backend Integration Tests: Test with Jaeger, Tempo, Datadog
  • Decorator Tests: Verify @workflow, @task decorators function
  • Metric Aggregation Tests: Validate metric calculations and exports

Tasks

Area Task Notes
Dependencies Add traceloop-sdk and OpenTelemetry packages Update pyproject.toml
Plugin Development Create OpenLLMetryPlugin class Implement all MCP hooks
Semantic Conventions Define MCP-specific conventions Extend OpenTelemetry standards
Context Propagation Implement W3C trace context For federation support
Metrics Set up counters and histograms Token usage, cost, duration
Configuration Add environment variables Support multiple backends
Docker Create OpenTelemetry collector setup Include Jaeger for dev
Dashboards Create Grafana dashboard templates Metrics and trace visualization
Documentation Write setup and usage guides Include backend examples
Testing Implement comprehensive test suite Unit, integration, load tests

Standards Check

  • Follows OpenTelemetry semantic conventions
  • Compatible with W3C Trace Context standard
  • Implements OpenTelemetry metrics API
  • Uses standard OTLP export protocol
  • Follows MCP Gateway plugin framework
  • Respects privacy regulations (PII sanitization)
  • Supports standard observability backends
  • Implements proper error handling and fallbacks
  • Provides configurable sampling strategies
  • Includes comprehensive documentation

Bonus: Advanced Features

LLM-Specific Enhancements

Consider implementing OpenLLMetry's advanced features in Phase 2:

  1. Prompt Template Analytics

    • Track prompt template effectiveness
    • A/B testing support for prompt variations
    • Template performance metrics
  2. Model Comparison Metrics

    • Side-by-side model performance tracking
    • Cost/performance trade-off analysis
    • Model selection recommendations
  3. Anomaly Detection

    • Detect unusual token usage patterns
    • Alert on cost spikes
    • Identify performance degradation
  4. Workflow Optimization

    • Identify redundant tool calls
    • Suggest caching opportunities
    • Recommend parallel execution paths

Integration with AI Observability Platforms

# Future enhancement: Direct integration with AI platforms
class AIObservabilityBridge:
    """Bridge OpenLLMetry data to AI-specific platforms."""
    
    async def export_to_langfuse(self, spans):
        """Export to Langfuse for LLM-specific analysis."""
        pass
    
    async def export_to_helicone(self, spans):
        """Export to Helicone for cost optimization."""
        pass
    
    async def export_to_phoenix(self, spans):
        """Export to Phoenix for evaluation."""
        pass

Implementation Timeline

Phase 1: Core Integration (Week 1)

  • Install OpenLLMetry SDK
  • Create base plugin structure
  • Implement tool instrumentation
  • Basic span and metric creation

Phase 2: Advanced Features (Week 2)

  • Semantic conventions implementation
  • Federation trace propagation
  • Cost tracking and calculations
  • PII sanitization

Phase 3: Observability Stack (Week 3)

  • OpenTelemetry collector setup
  • Backend integrations (Jaeger, Datadog, etc.)
  • Grafana dashboards
  • Alerting rules

Phase 4: Production Readiness (Week 4)

  • Performance optimization
  • Comprehensive testing
  • Documentation completion
  • Load testing and tuning

Success Metrics

Metric Target Measurement
Performance Overhead < 5% Load test comparison
Trace Completeness 100% All operations traced
Backend Compatibility 3+ backends Integration tests
Test Coverage > 90% pytest coverage
Documentation Complete Review checklist
PII Leakage 0 incidents Security audit
Federation Support Full Cross-gateway tests
Cost Accuracy ±1% Billing comparison

Related Issues

References

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestobservabilityObservability, logging, monitoringpythonPython / backend development (FastAPI)triageIssues / Features awaiting triage

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions