Enterprise LLM Evals with Kafka & Schema Registry

This repository ships a runnable example of a customer-support LLM agent that proves three things:

Natural language prompts can be turned into schema-governed Kafka events without manual glue code.
Every response is validated against a JSON Schema and dead-lettered if it drifts, giving you deterministic observability.
Evaluation services consume from Kafka, validate against the Schema Registry, and publish scored results so model performance is traceable.

Quick Start

Clone the repo and create a virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]

Copy the environment template and update the values:
```
cp .env.example .env
```
- Always set OPENAI_API_KEY when USE_OPENAI_CLIENT=true. Leave USE_OPENAI_CLIENT=false to use the deterministic built-in model.
Bring up Kafka, Schema Registry, and the UI:
```
docker compose up -d
```

Register schemas and publish sample prompts:

make bootstrap-schemas
make demo-prompts
make publish-demo-prompts  # pushes three valid prompts and one DLQ trigger

Start the agent:
```
make run-agent
```
Watch the responses appear on the customer-query-responses topic (three valid records), and one failing payload land in customer-query-dlq.

What You Get

Agent loop – src/agent/service.py consumes prompts, calls the LLM client, validates against schemas/customer-query-response.jsonschema, and publishes to Kafka.
Structured outputs – src/agent/models.py defines the Pydantic schemas that keep the LLM honest; the synthetic client in src/agent/llm.py will emit an invalid record whenever the prompt includes [invalid-schema-demo].
Evaluation hooks – evaluators under src/evaluators/ illustrate how to score responses and push results to customer-query-evaluations.
Docs & samples – The docs/ folder explains the architecture, while make demo-prompts saves ready-to-use JSON prompts under demo/prompts/.

Common Tasks

Run formatting and static checks: make format
Execute the full test suite with coverage: make test
Rehydrate the demo prompt set: make demo-prompts
Produce prompts onto Kafka (includes the DLQ trigger): make publish-demo-prompts

The combination of Kafka, Schema Registry, and structured LLM responses is meant to show that generative systems can be governed and versioned the same way as any other enterprise data pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
schemas		schemas
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Enterprise LLM Evals with Kafka & Schema Registry

Quick Start

What You Get

Common Tasks

About

Uh oh!

Releases

Packages

Languages

osodevops/enterprise-llm-evals-with-kafka-schema-registry

Folders and files

Latest commit

History

Repository files navigation

Enterprise LLM Evals with Kafka & Schema Registry

Quick Start

What You Get

Common Tasks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages