[Feature Request]: Add OpenLLMetry Integration for Observability

# Epic: OpenLLMetry Integration for LLM Observability

**Goal:** Integrate [OpenLLMetry](https://github.com/traceloop/openllmetry) to provide OpenTelemetry-based observability for MCP Gateway's LLM operations, enabling comprehensive tracing, metrics, and monitoring across all tools, prompts, and resources.

**Why now:** As MCP Gateway scales to production deployments, we need standardized observability that integrates with existing enterprise monitoring stacks. OpenLLMetry provides vendor-neutral, OpenTelemetry-native instrumentation specifically designed for LLM applications.

**Related**: #727 

## Type of Feature
- [x] New Observability Integration
- [x] Plugin Framework Extension
- [ ] Security Enhancement
- [ ] Performance Optimization

## User Story 1: Platform Administrator Observability

**As a:** Platform Administrator managing MCP Gateway in production

**I want:** Comprehensive visibility into LLM operations including token usage, costs, and performance metrics

**So that:** I can optimize resource usage, control costs, and ensure SLA compliance across all MCP servers and tools

### Acceptance Criteria

```gherkin
Given I have OpenLLMetry plugin configured and enabled
When tools are invoked through MCP Gateway
Then I should see traces with:
  - Tool name, duration, and status
  - Token usage (prompt, completion, total)
  - Cost calculations based on configured pricing
  - User and tenant attribution
  - Server and gateway identifiers

Given I have configured token pricing for different models
When a tool using GPT-4 processes 1000 prompt tokens and 500 completion tokens
Then the cost metric should show:
  - Prompt cost: 1000 * $0.03/1K = $0.03
  - Completion cost: 500 * $0.06/1K = $0.03
  - Total cost: $0.06

Given I have set up alerting rules
When tool response time exceeds 5 seconds
Then an alert should be triggered with trace context for debugging
```

## User Story 2: Developer Workflow Tracing

**As a:** Developer building applications on MCP Gateway

**I want:** End-to-end tracing of complex multi-tool workflows with distributed trace correlation

**So that:** I can debug issues, optimize performance, and understand the flow of data through federated gateways

### Acceptance Criteria

```gherkin
Given I have a workflow using multiple tools across federated gateways
When I execute the workflow
Then I should see:
  - A single trace ID connecting all operations
  - Parent-child span relationships
  - Cross-gateway trace propagation via W3C headers
  - Tool input/output in span attributes (sanitized)

Given I'm using the @workflow decorator
When I mark my function with @workflow(name="data_pipeline")
Then OpenLLMetry should:
  - Create a root span for the workflow
  - Automatically instrument child operations
  - Preserve trace context across async calls

Given I have OpenTelemetry collector configured
When traces are exported
Then they should be available in:
  - Jaeger UI for development
  - Datadog/Honeycomb/New Relic for production
  - With all MCP-specific semantic conventions
```

## Design Sketch

### `traceloop-sdk` Integration

```python
# plugins/openllmetry_observability/openllmetry_plugin.py

from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow, task, agent, tool
from opentelemetry import trace, metrics
from mcpgateway.plugins.framework.base import Plugin

class OpenLLMetryPlugin(Plugin):
    """OpenLLMetry observability plugin implementing MCP hooks."""
    
    async def on_startup(self) -> None:
        """Initialize OpenLLMetry SDK with MCP-specific configuration."""
        Traceloop.init(
            app_name="mcp-gateway",
            api_endpoint=self._config.config.get("api_endpoint"),
            api_key=self._config.config.get("api_key"),
            disable_batch=False,  # Enable batching for performance
            telemetry_enabled=False,  # Disable anonymous telemetry
            resource_attributes={
                "service.name": "mcp-gateway",
                "service.version": self._config.config.get("gateway_version"),
                "deployment.environment": self._config.config.get("environment")
            }
        )
        
        # Initialize metrics
        self.meter = metrics.get_meter("mcp-gateway.openllmetry")
        self._setup_metrics()
    
    def _setup_metrics(self):
        """Configure MCP-specific metrics."""
        self.tool_invocations = self.meter.create_counter(
            "mcp.tool.invocations",
            description="Number of tool invocations",
            unit="1"
        )
        
        self.tool_duration = self.meter.create_histogram(
            "mcp.tool.duration",
            description="Tool invocation duration",
            unit="ms"
        )
        
        self.token_usage = self.meter.create_counter(
            "mcp.tokens.used",
            description="Token usage across tools",
            unit="tokens"
        )
        
        self.operation_cost = self.meter.create_counter(
            "mcp.cost.total",
            description="Total cost of operations",
            unit="usd"
        )
```

### Trace Context Propagation

```python
# mcpgateway/observability/context.py

from opentelemetry import trace, baggage
from opentelemetry.propagate import inject, extract

class MCPTraceContext:
    """Manage trace context for distributed tracing."""
    
    @staticmethod
    def inject_context(headers: dict) -> dict:
        """Inject W3C trace context into headers for federation."""
        inject(headers)
        return headers
    
    @staticmethod
    def extract_context(headers: dict):
        """Extract trace context from incoming requests."""
        return extract(headers)
    
    @staticmethod
    def add_baggage(key: str, value: str):
        """Add baggage for cross-service context propagation."""
        ctx = baggage.set_baggage(key, value)
        return ctx
```

### Semantic Conventions for MCP

```python
# mcpgateway/observability/semantic_conventions.py

class MCPSemanticConventions:
    """MCP-specific semantic conventions extending OpenTelemetry standards."""
    
    # Span names
    TOOL_INVOKE = "mcp.tool.invoke"
    PROMPT_RENDER = "mcp.prompt.render"
    RESOURCE_FETCH = "mcp.resource.fetch"
    FEDERATION_CALL = "mcp.federation.call"
    
    # Attributes
    MCP_TOOL_NAME = "mcp.tool.name"
    MCP_TOOL_NAMESPACE = "mcp.tool.namespace"
    MCP_SERVER_ID = "mcp.server.id"
    MCP_SERVER_TYPE = "mcp.server.type"  # virtual, federated, local
    MCP_GATEWAY_ID = "mcp.gateway.id"
    MCP_TENANT_ID = "mcp.tenant.id"
    MCP_USER_ID = "mcp.user.id"
    MCP_INTEGRATION_TYPE = "mcp.integration.type"  # REST, MCP
    MCP_CACHE_HIT = "mcp.cache.hit"
    MCP_FEDERATION_DEPTH = "mcp.federation.depth"
    
    # LLM-specific (extending OpenLLMetry)
    LLM_PROMPT_TOKENS = "llm.usage.prompt_tokens"
    LLM_COMPLETION_TOKENS = "llm.usage.completion_tokens"
    LLM_TOTAL_TOKENS = "llm.usage.total_tokens"
    LLM_COST_USD = "llm.cost.usd"
    LLM_MODEL = "llm.model"
    LLM_PROVIDER = "llm.provider"
```

## Configuration

### New `.env` Variables

| Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `OPENLLMETRY_ENABLED` | Enable OpenLLMetry plugin | `false` | `true` |
| `TRACELOOP_BASE_URL` | Traceloop backend URL (optional) | - | `https://api.traceloop.com` |
| `TRACELOOP_API_KEY` | API key for Traceloop (optional) | - | `tl_xxxx` |
| `TRACELOOP_TELEMETRY` | Enable anonymous telemetry | `false` | `false` |
| `OTEL_SERVICE_NAME` | Service name for traces | `mcp-gateway` | `mcp-gateway-prod` |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP collector endpoint | `http://localhost:4317` | `http://otel-collector:4317` |
| `OTEL_EXPORTER_OTLP_HEADERS` | Headers for OTLP exporter | - | `api-key=xxx` |
| `OTEL_RESOURCE_ATTRIBUTES` | Resource attributes | - | `env=prod,region=us-east` |
| `OTEL_TRACES_SAMPLER` | Sampling strategy | `traceidratio` | `traceidratio` |
| `OTEL_TRACES_SAMPLER_ARG` | Sampling rate (0-1) | `1.0` | `0.1` |
| `MCP_TRACE_TOOL_IO` | Include tool I/O in traces | `false` | `true` |
| `MCP_TRACE_SANITIZE_PII` | Sanitize PII in traces | `true` | `true` |
| `MCP_TOKEN_PRICING_CONFIG` | Path to pricing config | - | `/etc/mcp/pricing.yaml` |

### Plugin Configuration

```yaml
# plugins/config.yaml
- name: "OpenLLMetryObservability"
  kind: "plugins.openllmetry_observability.openllmetry_plugin.OpenLLMetryPlugin"
  description: "OpenTelemetry-based LLM observability with OpenLLMetry SDK"
  version: "1.0.0"
  author: "MCP Gateway Team"
  hooks:
    - "tool_pre_invoke"
    - "tool_post_invoke"
    - "prompt_pre_fetch"
    - "prompt_post_fetch"
    - "resource_pre_fetch"
    - "resource_post_fetch"
  tags: ["observability", "tracing", "metrics", "opentelemetry", "llm"]
  mode: "permissive"
  priority: 210  # After security plugins
  config:
    # Service identification
    environment: "${DEPLOYMENT_ENV:-production}"
    gateway_version: "${MCPGATEWAY_VERSION:-unknown}"
    
    # Sampling
    sample_rate: ${OTEL_TRACES_SAMPLER_ARG:-1.0}
    
    # Data sanitization
    sanitize_pii: ${MCP_TRACE_SANITIZE_PII:-true}
    include_tool_io: ${MCP_TRACE_TOOL_IO:-false}
    max_attribute_length: 1000
    
    # Token pricing (per 1K tokens)
    pricing:
      default:
        prompt_token_cost: 0.0001
        completion_token_cost: 0.0002
      models:
        "gpt-4":
          prompt_token_cost: 0.03
          completion_token_cost: 0.06
        "gpt-3.5-turbo":
          prompt_token_cost: 0.0015
          completion_token_cost: 0.002
        "claude-3-opus":
          prompt_token_cost: 0.015
          completion_token_cost: 0.075
        "claude-3-sonnet":
          prompt_token_cost: 0.003
          completion_token_cost: 0.015
```

## Test Scenarios

- [ ] **Unit Tests**: Mock Traceloop SDK, verify span creation and attributes
- [ ] **Integration Tests**: Test with real OpenTelemetry collector
- [ ] **Federation Tests**: Verify trace context propagation across gateways
- [ ] **Performance Tests**: Ensure < 5% overhead with full instrumentation
- [ ] **Sampling Tests**: Verify sampling strategies work correctly
- [ ] **Cost Calculation Tests**: Validate token pricing calculations
- [ ] **PII Sanitization Tests**: Ensure sensitive data is redacted
- [ ] **Backend Integration Tests**: Test with Jaeger, Tempo, Datadog
- [ ] **Decorator Tests**: Verify @workflow, @task decorators function
- [ ] **Metric Aggregation Tests**: Validate metric calculations and exports

## Tasks

| Area | Task | Notes |
|------|------|-------|
| **Dependencies** | Add `traceloop-sdk` and OpenTelemetry packages | Update `pyproject.toml` |
| **Plugin Development** | Create `OpenLLMetryPlugin` class | Implement all MCP hooks |
| **Semantic Conventions** | Define MCP-specific conventions | Extend OpenTelemetry standards |
| **Context Propagation** | Implement W3C trace context | For federation support |
| **Metrics** | Set up counters and histograms | Token usage, cost, duration |
| **Configuration** | Add environment variables | Support multiple backends |
| **Docker** | Create OpenTelemetry collector setup | Include Jaeger for dev |
| **Dashboards** | Create Grafana dashboard templates | Metrics and trace visualization |
| **Documentation** | Write setup and usage guides | Include backend examples |
| **Testing** | Implement comprehensive test suite | Unit, integration, load tests |

## Standards Check

- [x] Follows OpenTelemetry semantic conventions
- [x] Compatible with W3C Trace Context standard
- [x] Implements OpenTelemetry metrics API
- [x] Uses standard OTLP export protocol
- [x] Follows MCP Gateway plugin framework
- [x] Respects privacy regulations (PII sanitization)
- [x] Supports standard observability backends
- [x] Implements proper error handling and fallbacks
- [x] Provides configurable sampling strategies
- [x] Includes comprehensive documentation

## Bonus: Advanced Features

### LLM-Specific Enhancements
Consider implementing OpenLLMetry's advanced features in Phase 2:

1. **Prompt Template Analytics**
   - Track prompt template effectiveness
   - A/B testing support for prompt variations
   - Template performance metrics

2. **Model Comparison Metrics**
   - Side-by-side model performance tracking
   - Cost/performance trade-off analysis
   - Model selection recommendations

3. **Anomaly Detection**
   - Detect unusual token usage patterns
   - Alert on cost spikes
   - Identify performance degradation

4. **Workflow Optimization**
   - Identify redundant tool calls
   - Suggest caching opportunities
   - Recommend parallel execution paths

### Integration with AI Observability Platforms

```python
# Future enhancement: Direct integration with AI platforms
class AIObservabilityBridge:
    """Bridge OpenLLMetry data to AI-specific platforms."""
    
    async def export_to_langfuse(self, spans):
        """Export to Langfuse for LLM-specific analysis."""
        pass
    
    async def export_to_helicone(self, spans):
        """Export to Helicone for cost optimization."""
        pass
    
    async def export_to_phoenix(self, spans):
        """Export to Phoenix for evaluation."""
        pass
```

## Implementation Timeline

### Phase 1: Core Integration (Week 1)
- Install OpenLLMetry SDK
- Create base plugin structure
- Implement tool instrumentation
- Basic span and metric creation

### Phase 2: Advanced Features (Week 2)
- Semantic conventions implementation
- Federation trace propagation
- Cost tracking and calculations
- PII sanitization

### Phase 3: Observability Stack (Week 3)
- OpenTelemetry collector setup
- Backend integrations (Jaeger, Datadog, etc.)
- Grafana dashboards
- Alerting rules

### Phase 4: Production Readiness (Week 4)
- Performance optimization
- Comprehensive testing
- Documentation completion
- Load testing and tuning

## Success Metrics

| Metric | Target | Measurement |
|--------|--------|-------------|
| Performance Overhead | < 5% | Load test comparison |
| Trace Completeness | 100% | All operations traced |
| Backend Compatibility | 3+ backends | Integration tests |
| Test Coverage | > 90% | pytest coverage |
| Documentation | Complete | Review checklist |
| PII Leakage | 0 incidents | Security audit |
| Federation Support | Full | Cross-gateway tests |
| Cost Accuracy | ±1% | Billing comparison |

## Related Issues
- #175 - Add OpenLLMetry Integration for Observability (this issue)
- #218 - Prometheus Metrics Instrumentation
- #300 - Structured JSON Logging with Correlation IDs
- #319 - AI Middleware Integration / Plugin Framework
- #727 - Phoenix Observability Integration

## References
- [OpenLLMetry GitHub](https://github.com/traceloop/openllmetry)
- [OpenLLMetry Documentation](https://www.traceloop.com/docs/openllmetry)
- [OpenTelemetry Python](https://opentelemetry.io/docs/languages/python/)
- [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/)
- [W3C Trace Context](https://www.w3.org/TR/trace-context/)
- [MCP Gateway Plugin Framework](../mcpgateway/plugins/framework/)

Variable	Description	Default	Example
`OPENLLMETRY_ENABLED`	Enable OpenLLMetry plugin	`false`	`true`
`TRACELOOP_BASE_URL`	Traceloop backend URL (optional)	-	`https://api.traceloop.com`
`TRACELOOP_API_KEY`	API key for Traceloop (optional)	-	`tl_xxxx`
`TRACELOOP_TELEMETRY`	Enable anonymous telemetry	`false`	`false`
`OTEL_SERVICE_NAME`	Service name for traces	`mcp-gateway`	`mcp-gateway-prod`
`OTEL_EXPORTER_OTLP_ENDPOINT`	OTLP collector endpoint	`http://localhost:4317`	`http://otel-collector:4317`
`OTEL_EXPORTER_OTLP_HEADERS`	Headers for OTLP exporter	-	`api-key=xxx`
`OTEL_RESOURCE_ATTRIBUTES`	Resource attributes	-	`env=prod,region=us-east`
`OTEL_TRACES_SAMPLER`	Sampling strategy	`traceidratio`	`traceidratio`
`OTEL_TRACES_SAMPLER_ARG`	Sampling rate (0-1)	`1.0`	`0.1`
`MCP_TRACE_TOOL_IO`	Include tool I/O in traces	`false`	`true`
`MCP_TRACE_SANITIZE_PII`	Sanitize PII in traces	`true`	`true`
`MCP_TOKEN_PRICING_CONFIG`	Path to pricing config	-	`/etc/mcp/pricing.yaml`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request]: Add OpenLLMetry Integration for Observability #175

Epic: OpenLLMetry Integration for LLM Observability

Type of Feature

User Story 1: Platform Administrator Observability

Acceptance Criteria

User Story 2: Developer Workflow Tracing

Acceptance Criteria

Design Sketch

`traceloop-sdk` Integration

Trace Context Propagation

Semantic Conventions for MCP

Configuration

New `.env` Variables

Plugin Configuration

Test Scenarios

Tasks

Standards Check

Bonus: Advanced Features

LLM-Specific Enhancements

Integration with AI Observability Platforms

Implementation Timeline

Phase 1: Core Integration (Week 1)

Phase 2: Advanced Features (Week 2)

Phase 3: Observability Stack (Week 3)

Phase 4: Production Readiness (Week 4)

Success Metrics

Related Issues

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Area	Task	Notes
Dependencies	Add `traceloop-sdk` and OpenTelemetry packages	Update `pyproject.toml`
Plugin Development	Create `OpenLLMetryPlugin` class	Implement all MCP hooks
Semantic Conventions	Define MCP-specific conventions	Extend OpenTelemetry standards
Context Propagation	Implement W3C trace context	For federation support
Metrics	Set up counters and histograms	Token usage, cost, duration
Configuration	Add environment variables	Support multiple backends
Docker	Create OpenTelemetry collector setup	Include Jaeger for dev
Dashboards	Create Grafana dashboard templates	Metrics and trace visualization
Documentation	Write setup and usage guides	Include backend examples
Testing	Implement comprehensive test suite	Unit, integration, load tests

Metric	Target	Measurement
Performance Overhead	< 5%	Load test comparison
Trace Completeness	100%	All operations traced
Backend Compatibility	3+ backends	Integration tests
Test Coverage	> 90%	pytest coverage
Documentation	Complete	Review checklist
PII Leakage	0 incidents	Security audit
Federation Support	Full	Cross-gateway tests
Cost Accuracy	±1%	Billing comparison

[Feature Request]: Add OpenLLMetry Integration for Observability #175

Description

Epic: OpenLLMetry Integration for LLM Observability

Type of Feature

User Story 1: Platform Administrator Observability

Acceptance Criteria

User Story 2: Developer Workflow Tracing

Acceptance Criteria

Design Sketch

traceloop-sdk Integration

Trace Context Propagation

Semantic Conventions for MCP

Configuration

New .env Variables

Plugin Configuration

Test Scenarios

Tasks

Standards Check

Bonus: Advanced Features

LLM-Specific Enhancements

Integration with AI Observability Platforms

Implementation Timeline

Phase 1: Core Integration (Week 1)

Phase 2: Advanced Features (Week 2)

Phase 3: Observability Stack (Week 3)

Phase 4: Production Readiness (Week 4)

Success Metrics

Related Issues

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`traceloop-sdk` Integration

New `.env` Variables