A command-line interface tool for serving Large Language Models using vLLM. Provides both interactive and command-line modes with features for configuration profiles, model management, and server monitoring.
Interactive terminal interface with GPU status and system overview
Tip: You can customize the GPU stats bar in settings
- π― Interactive Mode - Rich terminal interface with menu-driven navigation
- β‘ Command-Line Mode - Direct CLI commands for automation and scripting
- π€ Model Management - Automatic discovery of local models with HuggingFace and Ollama support
- π§ Configuration Profiles - Pre-configured and custom server profiles for different use cases
- π Server Monitoring - Real-time monitoring of active vLLM servers
- π₯οΈ System Information - GPU, memory, and CUDA compatibility checking
- π Advanced Configuration - Full control over vLLM parameters with validation
Quick Links: π Docs | π Quick Start | πΈ Screenshots | π Usage Guide | β Troubleshooting | πΊοΈ Roadmap
The Multi-Model Proxy is a new experimental feature that enables serving multiple LLMs through a single unified API endpoint. This feature is currently under active development and available for testing.
What It Does:
- Single Endpoint - All your models accessible through one API
- Live Management - Add or remove models without stopping the service
- Dynamic GPU Management - Efficient GPU resource distribution through vLLM's sleep/wake functionality
- Interactive Setup - User-friendly wizard guides you through configuration
Note: This is an experimental feature under active development. Your feedback helps us improve! Please share your experience through GitHub Issues.
For complete documentation, see the π Multi-Model Proxy Guide.
New built-in profiles specifically optimized for serving GPT-OSS models on different GPU architectures:
gpt_oss_ampere
- Optimized for NVIDIA A100 GPUsgpt_oss_hopper
- Optimized for NVIDIA H100/H200 GPUsgpt_oss_blackwell
- Optimized for NVIDIA Blackwell GPUs
Based on official vLLM GPT recipes for maximum performance.
Save and quickly launch your favorite model + profile combinations:
vllm-cli serve --shortcut my-gpt-server
- Automatic discovery of Ollama models
- GGUF format support (experimental)
- System and user directory scanning
- Environment Variables - Universal and profile-specific environment variable management
- GPU Selection - Choose specific GPUs for model serving (
--device 0,1
) - Enhanced System Info - vLLM feature detection with attention backend availability
See CHANGELOG.md for detailed release notes.
vLLM-CLI will not install vLLM or Pytorch by default.
# Install vLLM -- Skip this step if you have vllm installed in your environment
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
# Or specify a backend: uv pip install vllm --torch-backend=cu128
# Install vLLM CLI
uv pip install --upgrade vllm-cli
uv run vllm-cli
# If you are using conda:
# Activate the environment you have vllm installed in
pip install vllm-cli
vllm-cli
# Install vLLM CLI + vLLM
pip install vllm-cli[vllm]
vllm-cli
git clone https://github.com/Chen-zexi/vllm-cli.git
cd vllm-cli
pip install -e .
# If you do not want to use virtual environment and want to install vLLM along with vLLM CLI
pipx install "vllm-cli[vllm]"
# If you want to install pre-release version
pipx install --pip-args="--pre" "vllm-cli[vllm]"
- Python 3.9+
- CUDA-compatible GPU (recommended)
- vLLM package installed
- For dependency issues, see Troubleshooting Guide
# Interactive mode - menu-driven interface
vllm-cl
# Serve a model
vllm-cli serve --model openai/gpt-oss-20b
# Use a shortcut
vllm-cli serve --shortcut my-model
For detailed usage instructions, see the π Usage Guide and π Multi-Model Proxy Guide.
vLLM CLI includes 7 optimized profiles for different use cases:
General Purpose:
standard
- Minimal configuration with smart defaultshigh_throughput
- Maximum performance configurationlow_memory
- Memory-constrained environmentsmoe_optimized
- Optimized for Mixture of Experts models
Hardware-Specific (GPT-OSS):
gpt_oss_ampere
- NVIDIA A100 GPUsgpt_oss_hopper
- NVIDIA H100/H200 GPUsgpt_oss_blackwell
- NVIDIA Blackwell GPUs
See π Profiles Guide for detailed information.
- Main Config:
~/.config/vllm-cli/config.yaml
- User Profiles:
~/.config/vllm-cli/user_profiles.json
- Shortcuts:
~/.config/vllm-cli/shortcuts.json
- π Usage Guide - Complete usage instructions
- π Multi-Model Proxy - Serve multiple models simultaneously
- π Profiles Guide - Built-in profiles details
- β Troubleshooting - Common issues and solutions
- πΈ Screenshots - Visual feature overview
- π Model Discovery - Model management guide
- π¦ Ollama Integration - Using Ollama models
- βοΈ Custom Models - Serving custom models
- πΊοΈ Roadmap - Future development plans
vLLM CLI uses hf-model-tool for model discovery:
- Comprehensive model scanning
- Ollama model support
- Shared configuration
src/vllm_cli/
βββ cli/ # CLI command handling
βββ config/ # Configuration management
βββ models/ # Model management
βββ server/ # Server lifecycle
βββ ui/ # Terminal interface
βββ schemas/ # JSON schemas
Contributions are welcome! Please feel free to open an issue or submit a pull request.
MIT License - see LICENSE file for details.