Skip to content

osodevops/enterprise-llm-evals-with-kafka-schema-registry

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Enterprise LLM Evals with Kafka & Schema Registry

This repository ships a runnable example of a customer-support LLM agent that proves three things:

  • Natural language prompts can be turned into schema-governed Kafka events without manual glue code.
  • Every response is validated against a JSON Schema and dead-lettered if it drifts, giving you deterministic observability.
  • Evaluation services consume from Kafka, validate against the Schema Registry, and publish scored results so model performance is traceable.

Quick Start

  1. Clone the repo and create a virtual environment:
    python -m venv .venv
    source .venv/bin/activate
    pip install -e .[dev]
  2. Copy the environment template and update the values:
    cp .env.example .env
    • Always set OPENAI_API_KEY when USE_OPENAI_CLIENT=true. Leave USE_OPENAI_CLIENT=false to use the deterministic built-in model.
  3. Bring up Kafka, Schema Registry, and the UI:
    docker compose up -d
  4. Register schemas and publish sample prompts:
    make bootstrap-schemas
    make demo-prompts
    make publish-demo-prompts  # pushes three valid prompts and one DLQ trigger
  5. Start the agent:
    make run-agent
    Watch the responses appear on the customer-query-responses topic (three valid records), and one failing payload land in customer-query-dlq.

What You Get

  • Agent loopsrc/agent/service.py consumes prompts, calls the LLM client, validates against schemas/customer-query-response.jsonschema, and publishes to Kafka.
  • Structured outputssrc/agent/models.py defines the Pydantic schemas that keep the LLM honest; the synthetic client in src/agent/llm.py will emit an invalid record whenever the prompt includes [invalid-schema-demo].
  • Evaluation hooks – evaluators under src/evaluators/ illustrate how to score responses and push results to customer-query-evaluations.
  • Docs & samples – The docs/ folder explains the architecture, while make demo-prompts saves ready-to-use JSON prompts under demo/prompts/.

Common Tasks

  • Run formatting and static checks: make format
  • Execute the full test suite with coverage: make test
  • Rehydrate the demo prompt set: make demo-prompts
  • Produce prompts onto Kafka (includes the DLQ trigger): make publish-demo-prompts

The combination of Kafka, Schema Registry, and structured LLM responses is meant to show that generative systems can be governed and versioned the same way as any other enterprise data pipeline.

About

End-to-End Evaluation Pipeline for LLM Agents using Apache Kafka & Confluent Schema Registry

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published