A long-term lab for LLM serving experiments: from baselines to multi-GPU scaling with comprehensive monitoring.
This repository provides production-ready infrastructure for LLM serving experiments with comprehensive monitoring:
- Automated Infrastructure: Terraform + Ansible for GPU VMs and observability stack
- Pre-configured vLLM: Ready-to-use inference server with Mistral-7B-Instruct model
- Full Observability: GPU metrics, system monitoring, and API analytics in Grafana
- Scalable Foundation: From single-node baselines to multi-GPU experiments
- Research-Ready: Structured environment for AI infrastructure engineering experiments
Both infrastructure stacks include reliability features:
Automated Health Checks:
- Services are started in dependency order with health checks
- Built-in retry logic and timeout handling
- Automatic validation after deployment
Robust Deployment:
# Reliable deployment commands for both stacks
make deploy # Deploy with health checks
make validate # Quick validation of all services
# (obs only: make restart-services)
Troubleshooting Support:
# Common debugging commands
make ping # Test connectivity
make logs # View service logs
make status-services # Check service status
make ssh # Direct server access
See individual README files for detailed troubleshooting guides.
The lab consists of two main components:
- GPU VM: Runs vLLM inference server with automatic GPU monitoring
- Observability VM: Collects metrics via OpenTelemetry, stores in ClickHouse, visualizes in Grafana
All metrics are collected off the GPU VM to keep inference performance unaffected by monitoring overhead.
- Automated Deployment: Complete infrastructure setup with
make deploy-all
- Production Ready: vLLM service with systemd management and auto-restart
- Model Ready: Pre-configured Mistral-7B-Instruct (first startup takes 10-15 minutes for model loading)
- Comprehensive Monitoring: GPU utilization, memory usage, API metrics
- Interactive Dashboards: Real-time Grafana visualizations
- Secure by Default: VPC isolation, restricted access, encrypted secrets, API authentication
- Cost Optimized: Easy VM start/stop, resource right-sizing
- Modular Architecture: Reusable monitoring modules in
src/
- Local Development: Full IDE support with type hints and autocompletion
- Testing Tools: CLI script for local metrics testing and debugging
- Code Quality: Ruff integration for consistent formatting and linting
- Infrastructure Integration: Python modules deployed seamlessly via Ansible
Important: Deploy in this order as GPU stack depends on OBS network:
- Deploy OBS stack first: See obs/README.md for observability stack setup
- Update GPU network config: Get network IDs from OBS and update GPU terraform.tfvars
- Deploy GPU stack: See gpu/README.md for GPU infrastructure setup
- Wait for model loading: First vLLM startup takes 10-15 minutes for model download and loading
llm-serving-lab/
├── src/ # Python modules for development
│ ├── monitoring/ # Metrics collection modules
│ │ ├── metrics_exporter.py # Main exporter class
│ │ ├── gpu_metrics.py # GPU metrics via NVIDIA ML
│ │ ├── system_metrics.py # System metrics via psutil
│ │ └── vllm_metrics.py # vLLM API metrics
│ ├── deployment/ # Deployment utilities (planned)
│ └── utils/ # Common utilities (planned)
├── gpu/ # GPU infrastructure (Terraform + Ansible)
│ ├── ansible/ # Ansible automation for VM configuration
│ ├── terraform/ # Infrastructure provisioning
│ ├── Makefile # GPU management commands
│ └── README.md # GPU setup instructions
├── obs/ # Observability stack (ClickHouse, Grafana)
│ ├── dashboards/ # Grafana dashboards (JSON format)
│ ├── sql/ # ClickHouse SQL scripts and schema
│ ├── ansible/ # Ansible automation for deployment
│ ├── terraform/ # Infrastructure provisioning
│ ├── Makefile # OBS management commands
│ └── README.md # Observability setup instructions
├── benchmarks/ # Performance benchmarks and analysis (planned)
├── notes/ # Weekly deliverables and research notes
├── metrics-cli.py # CLI tool for local metrics testing
├── pyproject.toml # Project dependencies and tool configuration
└── README.md
- Python 3.13+ (managed with
uv
) - Terraform
- Ansible
- Docker (for local development)
Initialize the project environment:
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install project dependencies
uv sync
# Activate environment (optional, uv run handles this automatically)
source .venv/bin/activate
For users in Russia: Copy the Terraform configuration template to enable local provider mirrors:
# Copy template to home directory
cp config/terraformrc ~/.terraformrc
# Create local provider mirror directory
mkdir -p ~/.local/terraform-providers
This configuration uses a filesystem mirror for the Yandex Cloud provider, which helps with connectivity issues from Russia.
# Install all dependencies
uv sync --extra dev --extra monitoring
# Infrastructure validation
cd gpu/ && uv run ansible-lint ansible/ # Lint GPU Ansible
cd obs/ && uv run ansible-lint ansible/ # Lint OBS Ansible
uv run yamllint gpu/ansible/ obs/ansible/ # Lint YAML files
# Terraform validation
cd gpu/terraform/ && terraform fmt -check && terraform validate
cd obs/terraform/ && terraform fmt -check && terraform validate
# Python development - see src/README.md for details
python metrics-cli.py --dry-run --log-level DEBUG # Test metrics locally
For detailed information about Python modules and development workflow, see src/README.md.
Note: Some ansible-lint errors related to vault files are expected when vault passwords are not available.
-
Deploy Observability Stack:
cd obs/ # Copy and configure example files make copy-examples # Edit terraform.tfvars and inventory.ini with your values make deploy-all
-
Get OBS network configuration:
cd obs/terraform/ terraform output obs_network_id obs_subnet_id
-
Deploy GPU Infrastructure:
cd gpu/ # Copy and configure example files make copy-examples # Edit terraform.tfvars with your values AND update obs_network_id/obs_subnet_id from step 2 make deploy-all
-
Monitor Model Loading and Manage:
# Wait for model loading (first startup takes 10-15 minutes) cd gpu/ && make logs # Monitor loading progress # Check when API is ready (look for "Supported_tasks: ['generate']" in logs) # Test API with: make ssh then curl with proper auth (see GPU README) # Check GPU status make gpu-info # View Grafana dashboards # Access http://<obs-vm-ip>:3000
See individual README files for detailed instructions.
All metrics/logs are collected off the GPU VM into a dedicated Observability VM. This VM runs ClickHouse, Grafana, and an OpenTelemetry Collector (gateway).
See obs/README.md for Terraform + Ansible setup instructions.
Experimental findings and weekly deliverables are tracked in the notes/
directory.
- Week-specific deliverables (e.g. “Week 1: Repo, Baseline, Metrics plumbing”) live under
notes/
. - Performance notes are tracked in markdown alongside raw JSON run logs.
This is an experimental research repo. Expect rapid iteration.