llm-serving-lab

A long-term lab for LLM serving experiments: from baselines to multi-GPU scaling with comprehensive monitoring.

Purpose

This repository provides production-ready infrastructure for LLM serving experiments with comprehensive monitoring:

Automated Infrastructure: Terraform + Ansible for GPU VMs and observability stack
Pre-configured vLLM: Ready-to-use inference server with Mistral-7B-Instruct model
Full Observability: GPU metrics, system monitoring, and API analytics in Grafana
Scalable Foundation: From single-node baselines to multi-GPU experiments
Research-Ready: Structured environment for AI infrastructure engineering experiments

Reliability Features

Both infrastructure stacks include reliability features:

Automated Health Checks:

Services are started in dependency order with health checks
Built-in retry logic and timeout handling
Automatic validation after deployment

Robust Deployment:

# Reliable deployment commands for both stacks
make deploy            # Deploy with health checks
make validate          # Quick validation of all services
# (obs only: make restart-services)

Troubleshooting Support:

# Common debugging commands
make ping             # Test connectivity
make logs             # View service logs
make status-services  # Check service status
make ssh              # Direct server access

See individual README files for detailed troubleshooting guides.

Architecture Overview

The lab consists of two main components:

GPU VM: Runs vLLM inference server with automatic GPU monitoring
Observability VM: Collects metrics via OpenTelemetry, stores in ClickHouse, visualizes in Grafana

All metrics are collected off the GPU VM to keep inference performance unaffected by monitoring overhead.

Key Features

Infrastructure & Deployment

Automated Deployment: Complete infrastructure setup with make deploy-all
Production Ready: vLLM service with systemd management and auto-restart
Model Ready: Pre-configured Mistral-7B-Instruct (first startup takes 10-15 minutes for model loading)
Comprehensive Monitoring: GPU utilization, memory usage, API metrics
Interactive Dashboards: Real-time Grafana visualizations
Secure by Default: VPC isolation, restricted access, encrypted secrets, API authentication
Cost Optimized: Easy VM start/stop, resource right-sizing

Python Development

Modular Architecture: Reusable monitoring modules in src/
Local Development: Full IDE support with type hints and autocompletion
Testing Tools: CLI script for local metrics testing and debugging
Code Quality: Ruff integration for consistent formatting and linting
Infrastructure Integration: Python modules deployed seamlessly via Ansible

Getting Started

Important: Deploy in this order as GPU stack depends on OBS network:

Deploy OBS stack first: See obs/README.md for observability stack setup
Update GPU network config: Get network IDs from OBS and update GPU terraform.tfvars
Deploy GPU stack: See gpu/README.md for GPU infrastructure setup
Wait for model loading: First vLLM startup takes 10-15 minutes for model download and loading

Repository Layout

llm-serving-lab/
├── src/                   # Python modules for development
│   ├── monitoring/        # Metrics collection modules
│   │   ├── metrics_exporter.py   # Main exporter class
│   │   ├── gpu_metrics.py        # GPU metrics via NVIDIA ML
│   │   ├── system_metrics.py     # System metrics via psutil
│   │   └── vllm_metrics.py       # vLLM API metrics
│   ├── deployment/        # Deployment utilities (planned)
│   └── utils/             # Common utilities (planned)
├── gpu/                   # GPU infrastructure (Terraform + Ansible)
│   ├── ansible/           # Ansible automation for VM configuration
│   ├── terraform/         # Infrastructure provisioning
│   ├── Makefile           # GPU management commands
│   └── README.md          # GPU setup instructions
├── obs/                   # Observability stack (ClickHouse, Grafana)
│   ├── dashboards/        # Grafana dashboards (JSON format)
│   ├── sql/               # ClickHouse SQL scripts and schema
│   ├── ansible/           # Ansible automation for deployment
│   ├── terraform/         # Infrastructure provisioning
│   ├── Makefile           # OBS management commands
│   └── README.md          # Observability setup instructions
├── benchmarks/            # Performance benchmarks and analysis (planned)
├── notes/                 # Weekly deliverables and research notes
├── metrics-cli.py         # CLI tool for local metrics testing
├── pyproject.toml         # Project dependencies and tool configuration
└── README.md

Environment Setup

Prerequisites

Python 3.13+ (managed with uv)
Terraform
Ansible
Docker (for local development)

Python Environment with uv

Initialize the project environment:

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install project dependencies
uv sync

# Activate environment (optional, uv run handles this automatically)
source .venv/bin/activate

Terraform Configuration for Russia

For users in Russia: Copy the Terraform configuration template to enable local provider mirrors:

# Copy template to home directory
cp config/terraformrc ~/.terraformrc

# Create local provider mirror directory
mkdir -p ~/.local/terraform-providers

This configuration uses a filesystem mirror for the Yandex Cloud provider, which helps with connectivity issues from Russia.

Development Tools

# Install all dependencies
uv sync --extra dev --extra monitoring

# Infrastructure validation
cd gpu/ && uv run ansible-lint ansible/      # Lint GPU Ansible
cd obs/ && uv run ansible-lint ansible/      # Lint OBS Ansible
uv run yamllint gpu/ansible/ obs/ansible/    # Lint YAML files

# Terraform validation
cd gpu/terraform/ && terraform fmt -check && terraform validate
cd obs/terraform/ && terraform fmt -check && terraform validate

# Python development - see src/README.md for details
python metrics-cli.py --dry-run --log-level DEBUG  # Test metrics locally

Python Module Development

For detailed information about Python modules and development workflow, see src/README.md.

Note: Some ansible-lint errors related to vault files are expected when vault passwords are not available.

Usage

Quick Start

Deploy Observability Stack:

cd obs/
# Copy and configure example files
make copy-examples
# Edit terraform.tfvars and inventory.ini with your values
make deploy-all

Get OBS network configuration:

cd obs/terraform/
terraform output obs_network_id obs_subnet_id

Deploy GPU Infrastructure:

cd gpu/
# Copy and configure example files
make copy-examples
# Edit terraform.tfvars with your values AND update obs_network_id/obs_subnet_id from step 2
make deploy-all

Monitor Model Loading and Manage:

# Wait for model loading (first startup takes 10-15 minutes)
cd gpu/ && make logs  # Monitor loading progress

# Check when API is ready (look for "Supported_tasks: ['generate']" in logs)
# Test API with: make ssh then curl with proper auth (see GPU README)

# Check GPU status
make gpu-info

# View Grafana dashboards
# Access http://<obs-vm-ip>:3000

See individual README files for detailed instructions.

Observability Stack

All metrics/logs are collected off the GPU VM into a dedicated Observability VM. This VM runs ClickHouse, Grafana, and an OpenTelemetry Collector (gateway).

See obs/README.md for Terraform + Ansible setup instructions.

Research Notes

Experimental findings and weekly deliverables are tracked in the notes/ directory.

Notes

Week-specific deliverables (e.g. “Week 1: Repo, Baseline, Metrics plumbing”) live under notes/.
Performance notes are tracked in markdown alongside raw JSON run logs.

This is an experimental research repo. Expect rapid iteration.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.vscode		.vscode
config		config
gpu		gpu
notes		notes
obs		obs
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
metrics-cli.py		metrics-cli.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llm-serving-lab

Purpose

Reliability Features

Architecture Overview

Key Features

Infrastructure & Deployment

Python Development

Getting Started

Repository Layout

Environment Setup

Prerequisites

Python Environment with uv

Terraform Configuration for Russia

Development Tools

Python Module Development

Usage

Quick Start

Observability Stack

Research Notes

Notes

About

Uh oh!

Releases

Packages

Languages

License

akrisanov/llm-serving-lab

Folders and files

Latest commit

History

Repository files navigation

llm-serving-lab

Purpose

Reliability Features

Architecture Overview

Key Features

Infrastructure & Deployment

Python Development

Getting Started

Repository Layout

Environment Setup

Prerequisites

Python Environment with uv

Terraform Configuration for Russia

Development Tools

Python Module Development

Usage

Quick Start

Observability Stack

Research Notes

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages