diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md new file mode 100644 index 0000000..1def4bd --- /dev/null +++ b/customizations/LLM Router/README.md @@ -0,0 +1,1190 @@ +# LLM Router with NVIDIA Dynamo Cloud Platform +## Kubernetes Deployment Guide + +
+ +[![NVIDIA](https://img.shields.io/badge/NVIDIA-76B900?style=for-the-badge&logo=nvidia&logoColor=white)](https://nvidia.com) +[![Kubernetes](https://img.shields.io/badge/kubernetes-%23326ce5.svg?style=for-the-badge&logo=kubernetes&logoColor=white)](https://kubernetes.io) +[![Docker](https://img.shields.io/badge/docker-%230db7ed.svg?style=for-the-badge&logo=docker&logoColor=white)](https://docker.com) +[![Helm](https://img.shields.io/badge/Helm-0F1689?style=for-the-badge&logo=Helm&labelColor=0F1689)](https://helm.sh) + +**Intelligent LLM Request Routing with Distributed Inference Serving** + +
+ +--- + +This comprehensive guide provides step-by-step instructions for deploying the [**NVIDIA LLM Router**](https://github.com/NVIDIA-AI-Blueprints/llm-router) with the official [**NVIDIA Dynamo Cloud Platform**](https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html) on Kubernetes. + +## NVIDIA LLM Router and Dynamo Integration + +### Overview + +This integration combines two powerful NVIDIA technologies to create an intelligent, scalable LLM serving platform: + +### NVIDIA Dynamo +- **Distributed inference serving framework** +- **Disaggregated serving capabilities** +- **Multi-model deployment support** +- **Kubernetes-native scaling** + +### NVIDIA LLM Router +- **Intelligent request routing** +- **Task classification (12 categories)** +- **Complexity analysis (7 categories)** +- **Rust-based performance** + +> **Result**: A complete solution for deploying multiple LLMs with automatic routing based on request characteristics, maximizing both **performance** and **cost efficiency**. + +### Kubernetes Architecture Overview + +
+ +```mermaid +graph TB + subgraph "Kubernetes Cluster" + subgraph "Ingress Layer" + LB[Load Balancer/Ingress] + end + + subgraph "LLM Router (Helm)" + RC[Router Controller] + RS[Router Server + GPU] + end + + subgraph "Dynamo Platform - Shared Frontend Architecture" + FE[Shared Frontend Service] + PR[Processor] + + subgraph "Model 1 Workers" + VW1[VllmDecodeWorker-8B + GPU] + PW1[VllmPrefillWorker-8B + GPU] + end + + subgraph "Model 2 Workers" + VW2[VllmDecodeWorker-70B + GPU] + PW2[VllmPrefillWorker-70B + GPU] + end + + subgraph "Model 3 Workers" + VW3[VllmDecodeWorker-Mixtral + GPU] + PW3[VllmPrefillWorker-Mixtral + GPU] + end + end + end + + LB --> RC + RC --> RS + RS --> FE + FE --> PR + PR --> VW1 + PR --> VW2 + PR --> VW3 + PR --> PW1 + PR --> PW2 + PR --> PW3 + + style LB fill:#e1f5fe + style RC fill:#f3e5f5 + style RS fill:#f3e5f5 + style FE fill:#e8f5e8 + style PR fill:#e8f5e8 + style VW1 fill:#fff3e0 + style VW2 fill:#fff3e0 + style VW3 fill:#fff3e0 + style PW1 fill:#ffecb3 + style PW2 fill:#ffecb3 + style PW3 fill:#ffecb3 +``` + +
+ +### Key Benefits + +
+ +| **Feature** | **Benefit** | **Impact** | +|:---:|:---:|:---:| +| **Intelligent Routing** | Auto-routes by task/complexity | **Optimal Model Selection** | +| **Cost Optimization** | Small models for simple tasks | **Reduced Infrastructure Costs** | +| **High Performance** | Rust-based minimal latency | **Sub-millisecond Routing** | +| **Scalability** | Disaggregated multi-model serving | **Enterprise-Grade Throughput** | +| **OpenAI Compatible** | Drop-in API replacement | **Zero Code Changes** | + +
+ +### Integration Components + +
+1. NVIDIA Dynamo Cloud Platform + +- **Purpose**: Distributed LLM inference serving +- **Features**: Disaggregated serving, KV cache management, multi-model support +- **Deployment**: Kubernetes-native with custom resources +- **Models Supported**: Multiple LLMs (Llama, Mixtral, Phi, Nemotron, etc.) + +
+ +
+2. NVIDIA LLM Router + +- **Purpose**: Intelligent request routing and model selection +- **Features**: OpenAI API compliant, flexible policy system, configurable backends +- **Architecture**: Rust-based controller + Triton inference server +- **Routing Policies**: Task classification (12 categories), complexity analysis (7 categories) +- **Customization**: Fine-tune models for domain-specific routing (e.g., banking intent classification) + +
+ +
+3. Integration Configuration + +- **Router Policies**: Define routing rules for different task types +- **Model Mapping**: Map router decisions to Dynamo-served models +- **Service Discovery**: Kubernetes-native service communication +- **Security**: API key management via Kubernetes secrets + +
+ +### Routing Strategies + +
+ +#### Task-Based Routing +*Routes requests based on the type of task being performed* + +
+ +
+View Task Routing Table + +| **Task Type** | **Target Model** | **Use Case** | +|:---|:---|:---| +| Code Generation | `llama-3.1-70b-instruct` | Programming tasks | +| Brainstorming | `llama-3.1-70b-instruct` | Creative ideation | +| Chatbot | `mixtral-8x22b-instruct-v0.1` | Conversational AI | +| Summarization | `llama-3.1-8b-instruct` | Text summarization | +| Open QA | `llama-3.1-70b-instruct` | Complex questions | +| Closed QA | `llama-3.1-8b-instruct` | Simple Q&A | +| Classification | `llama-3.1-8b-instruct` | Text classification | +| Extraction | `llama-3.1-8b-instruct` | Information extraction | +| Rewrite | `llama-3.1-8b-instruct` | Text rewriting | +| Text Generation | `mixtral-8x22b-instruct-v0.1` | General text generation | +| Other | `mixtral-8x22b-instruct-v0.1` | Miscellaneous tasks | +| Unknown | `llama-3.1-8b-instruct` | Unclassified tasks | + +
+ +--- + +
+ +#### Complexity-Based Routing +*Routes requests based on the complexity of the task* + +
+ +
+View Complexity Routing Table + +| **Complexity Level** | **Target Model** | **Use Case** | +|:---|:---|:---| +| Creativity | `llama-3.1-70b-instruct` | Creative tasks | +| Reasoning | `llama-3.1-70b-instruct` | Complex reasoning | +| Contextual-Knowledge | `llama-3.1-8b-instruct` | Context-dependent tasks | +| Few-Shot | `llama-3.1-70b-instruct` | Tasks with examples | +| Domain-Knowledge | `mixtral-8x22b-instruct-v0.1` | Specialized knowledge | +| No-Label-Reason | `llama-3.1-8b-instruct` | Unclassified complexity | +| Constraint | `llama-3.1-8b-instruct` | Tasks with constraints | + +
+ +### Performance Benefits + +
+ +| **Metric** | **Improvement** | **How It Works** | +|:---:|:---:|:---| +| **Latency** | `↓ 40-60%` | Smaller models for simple tasks | +| **Cost** | `↓ 30-50%` | Large models only when needed | +| **Throughput** | `↑ 2-3x` | Better resource utilization | +| **Scalability** | `↑ 10x` | Independent component scaling | + +
+ +### API Usage Examples + +
+ +#### Task-Based Routing + +
+ +```bash +# Code generation task → Routes to llama-3.3-nemotron-super-49b-v1 +curl -X POST http://llm-router.local/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "Write a Python function to sort a list"}], + "max_tokens": 512, + "nim-llm-router": { + "policy": "task_router", + "routing_strategy": "triton", + "model": "" + } + }' +``` + +
+ +#### Complexity-Based Routing + +
+ +```bash +# Complex reasoning task → Routes to llama-3.3-nemotron-super-49b-v1 +curl -X POST http://llm-router.local/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "Explain quantum entanglement"}], + "max_tokens": 512, + "nim-llm-router": { + "policy": "complexity_router", + "routing_strategy": "triton", + "model": "" + } + }' +``` + +### How Dynamo Model Routing Works + +The key insight is that Dynamo provides a **single gateway endpoint** that routes to different models based on the `model` parameter in the OpenAI-compatible API request: + +1. **Single Endpoint**: `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8000/v1` +2. **Model-Based Routing**: Dynamo routes internally based on the `model` field in requests +3. **OpenAI Compatibility**: Standard OpenAI API format with model selection + +Example request: +```json +{ + "model": "llama-3.1-70b-instruct", // Dynamo routes based on this + "messages": [...], + "temperature": 0.7 +} +``` + +Dynamo's internal architecture handles: +- Model registry and discovery +- Request parsing and routing +- Load balancing across replicas +- KV cache management +- Disaggregated serving coordination + +## Kubernetes Integration Deployment + +This integration demonstrates how to deploy the official NVIDIA Dynamo Cloud Platform for distributed LLM inference on Kubernetes and route requests intelligently using the NVIDIA LLM Router. The Kubernetes deployment includes: + +1. **NVIDIA Dynamo Cloud Platform**: Distributed inference serving with Kubernetes operators and custom resources +2. **LLM Router**: Helm-deployed intelligent request routing with GPU-accelerated routing models +3. **Multiple LLM Models**: Containerized models deployed via DynamoGraphDeployment CRs + + + +### Key Components + +#### Shared Frontend Architecture + +The deployment now uses a **shared frontend architecture** that splits the original `agg.yaml` into separate components for better resource utilization and model sharing: + +- **frontend.yaml**: Shared OpenAI-compatible API frontend service + - Single frontend instance serves all models + - Handles request routing and load balancing + - Reduces resource overhead compared to per-model frontends + - Uses official NGC Dynamo vLLM Runtime container from `DYNAMO_IMAGE` variable + +- **agg.yaml / disagg.yaml**: Templates for model-specific workers + - **agg.yaml**: Aggregated worker configuration with VllmDecodeWorker (1 GPU per model) + - **disagg.yaml**: Disaggregated worker configuration with separate VllmDecodeWorker and VllmPrefillWorker (1 GPU each) + - Common: Shared configuration (model, block-size, KV connector) + - Deployed per model with unique names using environment variables + +#### Configuration Files + +- **router-config-dynamo.yaml**: Router policies for Dynamo integration (uses `${DYNAMO_API_BASE}` variable) +- **llm-router-values-override.yaml**: Helm values for LLM Router with Dynamo integration (defines `dynamo.api_base` variable) + +### Shared Frontend Benefits + +
+ +| **Benefit** | **Shared Frontend** | **Per-Model Frontend** | **Improvement** | +|:---:|:---:|:---:|:---:| +| **Resource Usage** | 1 Frontend + N Workers | N Frontends + N Workers | **↓ 30-50% CPU/Memory** | +| **Network Complexity** | Single Endpoint | Multiple Endpoints | **Simplified Routing** | +| **Maintenance** | Single Service | Multiple Services | **↓ 70% Ops Overhead** | +| **Load Balancing** | Built-in across models | Per-model only | **Better Utilization** | +| **API Consistency** | Single OpenAI API | Multiple APIs | **Unified Interface** | + +
+ +**Key Advantages:** +- **Resource Efficiency**: Single frontend serves all models, reducing CPU and memory overhead +- **Simplified Operations**: One service to monitor, scale, and maintain instead of multiple frontends +- **Better Load Distribution**: Intelligent request routing across all available model workers +- **Cost Optimization**: Fewer running services means lower infrastructure costs +- **Unified API Gateway**: Single endpoint for all models with consistent OpenAI API interface + +### Disaggregated Serving Configuration + +The deployment uses the official disaggregated serving architecture based on [Dynamo's vLLM backend deployment reference](https://github.com/ai-dynamo/dynamo/tree/main/components/backends/vllm/deploy): + +**Key Features**: +- **Multi-Model Support**: Deploy multiple models (Llama-3.1-8B, Llama-3.1-70B, Mixtral-8x22B) using environment variables +- **KV Transfer**: Uses `DynamoNixlConnector` for high-performance KV cache transfer +- **Conditional Disaggregation**: Automatically switches between prefill and decode workers +- **Remote Prefill**: Offloads prefill operations to dedicated VllmPrefillWorker instances +- **Prefix Caching**: Enables intelligent caching for improved performance +- **Block Size**: 64 tokens for optimal memory utilization +- **Max Model Length**: 16,384+ tokens context window (varies by model) +- **Shared Frontend**: Single frontend serves all deployed models +- **Intelligent Routing**: LLM Router selects optimal model based on task complexity + + + +### Environment Variables + +Set the required environment variables for deployment: + +| Variable | Description | Example | Required | Used In | +|----------|-------------|---------|----------|---------| +| `NAMESPACE` | Kubernetes namespace for deployment | `dynamo-kubernetes` | Yes | All deployments | +| `DYNAMO_VERSION` | Dynamo vLLM runtime version | `0.4.1` | Yes | Platform install | +| `MODEL_NAME` | Hugging Face model to deploy | `meta-llama/Llama-3.1-8B-Instruct` | Yes | Model deployment | +| `DYNAMO_IMAGE` | Full Dynamo runtime image path | `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1` | Yes | Model deployment | +| `HF_TOKEN` | Hugging Face access token | `your_hf_token` | Yes | Model access | +| `NGC_API_KEY` | NVIDIA NGC API key | `your-ngc-api-key` | No | Private images | +| `DYNAMO_API_BASE` | Dynamo service endpoint URL | `http://frontend-service.dynamo-kubernetes.svc.cluster.local:8000` | Yes | LLM Router | +| `DYNAMO_API_KEY` | Dynamo API authentication key | `your-dynamo-api-key-here` | No | LLM Router auth | + +### Model Size Recommendations + +For optimal deployment experience, consider model size vs. resources: + +| Model Size | GPU Memory | Download Time | Recommended For | +|------------|------------|---------------|-----------------| +| **Small (1-2B)** | ~3-4GB | 2-5 minutes | Development, testing | +| **Medium (7-8B)** | ~8-12GB | 10-20 minutes | Production, single GPU | +| **Large (70B+)** | ~40GB+ | 30+ minutes | Multi-GPU setups | + +**Recommended Models:** +- `meta-llama/Llama-3.1-8B-Instruct` - Balanced performance, used in router config (15GB) +- `meta-llama/Llama-3.1-70B-Instruct` - High performance, used in router config (40GB+) +- `mistralai/Mixtral-8x22B-Instruct-v0.1` - Creative tasks, used in router config (90GB+) +- `Qwen/Qwen2.5-1.5B-Instruct` - Fast testing model (3GB) +- `TinyLlama/TinyLlama-1.1B-Chat-v1.0` - Ultra-fast testing (2GB) + +> **💡 Health Check Configuration**: The `frontend.yaml` and `disagg.yaml` include extended health check timeouts (30 minutes) to allow sufficient time for model download and loading. Health checks must be configured at the service level, not in `extraPodSpec`, for the Dynamo operator to respect them. The shared frontend architecture reduces the number of health checks needed compared to per-model frontends. + +**NGC Setup Instructions**: +1. **Choose Dynamo Version**: Visit [NGC Dynamo vLLM Runtime Tags](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime/tags) to see available versions +2. **Set Version**: Export your chosen version: `export DYNAMO_VERSION=0.4.1` (or latest available) +3. **Optional - NGC API Key**: Visit [https://ngc.nvidia.com/setup/api-key](https://ngc.nvidia.com/setup/api-key) if you need private image access +4. **Prebuilt Images**: NGC provides prebuilt CUDA and ML framework images, eliminating the need for local builds + +**Available NGC Dynamo Images**: +- **vLLM Runtime**: `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1` (recommended) +- **SGLang Runtime**: `nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.1` +- **TensorRT-LLM Runtime**: `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.1` +- **Dynamo Kubernetes Operator**: `nvcr.io/nvidia/ai-dynamo/dynamo-operator:latest` +- **Dynamo Deployment API**: `nvcr.io/nvidia/ai-dynamo/dynamo-api-store:latest` + +### Configuration Variables + +The deployment uses a configurable `api_base` variable for flexible endpoint management: + +| Variable | File | Description | Default Value | +|----------|------|-------------|---------------| +| `dynamo.api_base` | `llm-router-values-override.yaml` | Dynamo LLM endpoint URL | `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8000` | +| `${DYNAMO_API_BASE}` | `router-config-dynamo.yaml` | Template variable substituted during deployment | Derived from `dynamo.api_base` | + +This approach allows you to: +- **Switch environments** by changing only the `dynamo.api_base` value +- **Override during deployment** with `--set dynamo.api_base=http://new-endpoint:8000` +- **Use different values files** for different environments (dev/staging/prod) + +### Resource Requirements + +**Kubernetes Production Deployment**: + +**Minimum Requirements**: +- **Kubernetes cluster** with 4+ GPU nodes for disaggregated serving +- **Each node**: 16+ CPU cores, 64GB+ RAM, 2-4 GPUs +- **Storage**: 500GB+ for model storage (SSD recommended) +- **Network**: High-bandwidth interconnect for multi-node setups + +**Component Resource Allocation**: +- **Frontend**: 1-2 CPU cores, 2-4GB RAM (handles HTTP requests) +- **Processor**: 2-4 CPU cores, 4-8GB RAM (request processing) +- **VllmDecodeWorker**: 4+ GPU, 8+ CPU cores, 16GB+ RAM (model inference) +- **VllmPrefillWorker**: 2+ GPU, 4+ CPU cores, 8GB+ RAM (prefill operations) +- **Router**: 1-2 CPU cores, 2-4GB RAM (KV-aware routing) +- **LLM Router**: 1 GPU, 2 CPU cores, 4GB RAM (routing model inference) + +**Scaling Considerations**: +- **Disaggregated Serving**: Separate prefill and decode for better throughput +- **Horizontal Scaling**: Multiple VllmDecodeWorker and VllmPrefillWorker replicas +- **GPU Memory**: Adjust based on model size (70B models need 40GB+ VRAM per GPU) + +## Prerequisites + +
+ +[![Prerequisites](https://img.shields.io/badge/Prerequisites-Check%20List-blue?style=for-the-badge&logo=checkmk)](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/dynamo_cloud.md#prerequisites) + +*Ensure your environment meets all requirements before deployment* + +
+ +### Required Tools + +
+ +**Verify you have the required tools installed:** + +
+ +```bash +# Required tools verification +kubectl version --client +helm version +docker version +``` + +
+ +| **Tool** | **Requirement** | **Status** | +|:---:|:---:|:---:| +| **kubectl** | `v1.24+` | Check with `kubectl version --client` | +| **Helm** | `v3.0+` | Check with `helm version` | +| **Docker** | Running daemon | Check with `docker version` | + +
+ +**Additional Requirements:** +- **NVIDIA GPU nodes** with GPU Operator installed (for LLM inference) +- **Container registry access** (Docker Hub, NVIDIA NGC, etc.) +- **Git** for cloning repositories + +### Inference Runtime Images + +Set your inference runtime image from the available NGC options: + +```bash +# Set your inference runtime image +export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1 +``` + +**Available Runtime Images**: +- `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1` - vLLM backend (recommended) +- `nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.1` - SGLang backend +- `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.1` - TensorRT-LLM backend + +### Hugging Face Token + +For accessing models from Hugging Face Hub, you'll need a Hugging Face token: + +```bash +# Set your Hugging Face token for model access +export HF_TOKEN=your_hf_token +``` + +Get your token from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) + +### Kubernetes Cluster Requirements + +#### PVC Support with Default Storage Class +Dynamo Cloud requires Persistent Volume Claim (PVC) support with a default storage class. Verify your cluster configuration: + +```bash +# Check if default storage class exists +kubectl get storageclass + +# Expected output should show at least one storage class marked as (default) +# Example: +# NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE +# standard (default) kubernetes.io/gce-pd Delete Immediate true 1d +``` + +### Optional Requirements + +#### Service Mesh (Optional) +For advanced networking and security features, you may want to install: +- **Istio service mesh**: For advanced traffic management and security + +```bash +# Check if Istio is installed +kubectl get pods -n istio-system + +# Expected output should show running Istio pods +# istiod-* pods should be in Running state +``` + +If Istio is not installed, follow the [official Istio installation guide](https://istio.io/latest/docs/setup/getting-started/). + +## Pre-Deployment Validation + +
+ +[![Validation](https://img.shields.io/badge/Pre--Deployment-Validation-yellow?style=for-the-badge&logo=checkmarx)](https://kubernetes.io) + +*Validate your environment before starting deployment* + +
+ +Before starting the deployment, validate that your environment meets all requirements: + +### Validate Kubernetes Cluster + +```bash +# Verify Kubernetes cluster access and version +kubectl version --client +kubectl cluster-info + +# Check node resources and GPU availability +kubectl get nodes -o wide +kubectl describe nodes | grep -A 5 "Capacity:" + +# Verify default storage class exists +kubectl get storageclass +``` + +### Validate Container Registry Access + +```bash +# Test NGC registry access (if using NGC images) +docker login nvcr.io --username '$oauthtoken' --password $NGC_API_KEY + +# Verify you can pull the Dynamo runtime image +docker pull $DYNAMO_IMAGE +``` + +### Validate Configuration Files + +```bash +# Navigate to the customization directory +cd customizations/LLM\ Router + +# Check that required files exist +ls -la frontend.yaml agg.yaml disagg.yaml router-config-dynamo.yaml llm-router-values-override.yaml + +# Validate YAML syntax +python -c "import yaml; yaml.safe_load(open('frontend.yaml'))" && echo "frontend.yaml is valid" +python -c "import yaml; yaml.safe_load(open('agg.yaml'))" && echo "agg.yaml is valid" +python -c "import yaml; yaml.safe_load(open('disagg.yaml'))" && echo "disagg.yaml is valid" +python -c "import yaml; yaml.safe_load(open('router-config-dynamo.yaml'))" && echo "router-config-dynamo.yaml is valid" +python -c "import yaml; yaml.safe_load(open('llm-router-values-override.yaml'))" && echo "llm-router-values-override.yaml is valid" +``` + +### Environment Setup + +```bash +# Core deployment variables +export NAMESPACE=dynamo-kubernetes +export DYNAMO_VERSION=0.4.1 # Choose your Dynamo version from NGC catalog +export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} + +# Model deployment variables +export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct # Choose your model (see recommendations above) +export HF_TOKEN=your_hf_token + +# Optional variables +export NGC_API_KEY=your-ngc-api-key # Optional for public images + +# LLM Router variables (set during router deployment) +export DYNAMO_API_BASE="http://frontend-service.${NAMESPACE}.svc.cluster.local:8000" +export DYNAMO_API_KEY="your-dynamo-api-key-here" # Optional for local deployments +``` + +### Validate Environment Variables + +```bash +# Check required environment variables are set +echo "NAMESPACE: ${NAMESPACE:-'NOT SET'}" +echo "DYNAMO_VERSION: ${DYNAMO_VERSION:-'NOT SET'}" +echo "MODEL_NAME: ${MODEL_NAME:-'NOT SET'}" +echo "DYNAMO_IMAGE: ${DYNAMO_IMAGE:-'NOT SET'}" +echo "HF_TOKEN: ${HF_TOKEN:-'NOT SET'}" +echo "NGC_API_KEY: ${NGC_API_KEY:-'NOT SET (optional for public images)'}" +echo "DYNAMO_API_BASE: ${DYNAMO_API_BASE:-'NOT SET (set during router deployment)'}" +echo "DYNAMO_API_KEY: ${DYNAMO_API_KEY:-'NOT SET (optional for local deployments)'}" +``` + +## Deployment Guide + +
+ +[![Deployment](https://img.shields.io/badge/Deployment-Step%20by%20Step-green?style=for-the-badge&logo=kubernetes)](https://kubernetes.io) + +**Complete walkthrough for deploying NVIDIA Dynamo and LLM Router** + +
+ +--- + + +### Deployment Overview + +
+ +```mermaid +graph LR + A[Prerequisites] --> B[Install Platform] + B --> C[Deploy vLLM] + C --> D[Setup Router] + D --> E[Configure Access] + E --> F[Test Integration] + + style A fill:#e3f2fd + style B fill:#f3e5f5 + style C fill:#e8f5e8 + style D fill:#fff3e0 + style E fill:#fce4ec + style F fill:#e0f2f1 +``` + +
+ +### Step 1: Install Dynamo Platform (Path A: Production Install) + +
+ +[![Step 1](https://img.shields.io/badge/Step%201-Install%20Platform-blue?style=for-the-badge&logo=kubernetes)](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/dynamo_cloud.md#path-a-production-install) + +*Deploy the Dynamo Cloud Platform using the official **Path A: Production Install*** + +
+ + + +```bash +# 1. Install CRDs (use 'upgrade' instead of 'install' if already installed) +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${DYNAMO_VERSION}.tgz +helm install dynamo-crds dynamo-crds-${DYNAMO_VERSION}.tgz --namespace default + +# 2. Install Platform (use 'upgrade' instead of 'install' if already installed) +kubectl create namespace ${NAMESPACE} +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${DYNAMO_VERSION}.tgz +helm install dynamo-platform dynamo-platform-${DYNAMO_VERSION}.tgz --namespace ${NAMESPACE} + +# 3. Verify deployment +# Check CRDs +kubectl get crd | grep dynamo +# Check operator and platform pods +kubectl get pods -n ${NAMESPACE} +# Expected: dynamo-operator-* and etcd-* pods Running +kubectl get svc -n ${NAMESPACE} +``` + +### Step 2: Deploy Multiple vLLM Models + +
+ +[![Step 2](https://img.shields.io/badge/Step%202-Deploy%20Multiple%20Models-orange?style=for-the-badge&logo=nvidia)](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/README.md) + +*Deploy multiple vLLM models for intelligent routing* + +
+ + + +Since our LLM Router routes to different models based on task complexity, we can deploy models using the environment variables already set in Step 1. Following the official [vLLM backend deployment guide](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/README.md#3-deploy): + +```bash +# 1. Create Kubernetes secret for Hugging Face token (using variables from Step 1) +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN=${HF_TOKEN} \ + -n ${NAMESPACE} + +# 2. Navigate to your LLM Router directory (where agg.yaml/disagg.yaml are located) +cd "customizations/LLM Router/" +``` + +#### Shared Frontend Deployment + +**Step 1: Deploy Shared Frontend** +```bash +# Deploy the shared frontend service (serves all models) +envsubst < frontend.yaml | kubectl apply -f - -n ${NAMESPACE} +``` + +**Step 2: Deploy Model Workers** + +Choose your worker deployment approach: + +**Option A: Using agg.yaml (aggregated workers)** +```bash +# Deploy model workers only (frontend extracted to frontend.yaml) +export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct +envsubst < agg.yaml | kubectl apply -f - -n ${NAMESPACE} +``` + +**Option B: Using disagg.yaml (disaggregated workers)** +```bash +# Deploy separate prefill and decode workers (frontend extracted to frontend.yaml) +export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct +envsubst < disagg.yaml | kubectl apply -f - -n ${NAMESPACE} +``` + +### Adding More Models (Optional) + +**Current Setup**: We deploy 3 models that cover most use cases: +- **Llama-3.1-8B**: Fast model for simple tasks +- **Llama-3.1-70B**: Powerful model for complex tasks +- **Mixtral-8x22B**: Creative model for conversational tasks + +**To add more models**, follow this pattern: + +#### Example: Adding Phi-3-Mini Model + +```bash +# Simply set the model name and deploy using existing files +export MODEL_NAME=microsoft/Phi-3-mini-128k-instruct + +# Deploy using aggregated workers +envsubst < agg.yaml | kubectl apply -f - -n ${NAMESPACE} + +# OR deploy using disaggregated workers +envsubst < disagg.yaml | kubectl apply -f - -n ${NAMESPACE} +``` + +**Repeat this pattern** for any additional models you want to deploy. + +### Step 3: Verify Shared Frontend Deployment + +
+ +[![Step 3](https://img.shields.io/badge/Step%203-Verify%20Deployments-green?style=for-the-badge&logo=kubernetes)](https://kubernetes.io) + +*Verify that the shared frontend and model workers have been deployed successfully* + +
+ +```bash +# Check deployment status for shared frontend and all model workers +kubectl get pods -n ${NAMESPACE} +kubectl get svc -n ${NAMESPACE} + +# Verify shared frontend is running +kubectl logs deployment/frontend -n ${NAMESPACE} --tail=10 + +# Look for all model worker pods +kubectl get pods -n ${NAMESPACE} | grep -E "(worker|decode|prefill)" + +# Verify the shared frontend service (single port for all models) +kubectl get svc -n ${NAMESPACE} | grep frontend +``` + +### Step 4: Test Shared Frontend Service + +
+ +[![Step 4](https://img.shields.io/badge/Step%204-Test%20Services-purple?style=for-the-badge&logo=checkmarx)](https://checkmarx.com) + +*Test the shared frontend service with different models* + +
+ +```bash +# Forward the shared frontend service port +kubectl port-forward svc/frontend-service 8000:8000 -n ${NAMESPACE} & + +# Test different models through the same endpoint by specifying the model name + +# Test Model 1 (e.g., Llama-3.1-8B) +curl localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Llama-3.1-8B-Instruct", + "messages": [{"role": "user", "content": "Simple question: What is 2+2?"}], + "stream": false, + "max_tokens": 30 + }' | jq + +# Test Model 2 (e.g., different model if deployed) +curl localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "microsoft/Phi-3-mini-128k-instruct", + "messages": [{"role": "user", "content": "Explain quantum computing briefly"}], + "stream": false, + "max_tokens": 100 + }' | jq + +# Check health and available models +curl localhost:8000/health +curl localhost:8000/v1/models | jq +``` + +### Step 5: Set Up LLM Router API Keys + +
+ +[![Step 5](https://img.shields.io/badge/Step%205-Setup%20API%20Keys-red?style=for-the-badge&logo=keycdn)](https://github.com/NVIDIA-AI-Blueprints/llm-router) + +*Configure API keys for LLM Router integration* + +
+ +**IMPORTANT**: The router configuration uses Kubernetes secrets for API key management following the [official NVIDIA pattern](https://github.com/NVIDIA-AI-Blueprints/llm-router/blob/main/deploy/helm/llm-router/templates/router-controller-configmap.yaml). + +```bash +# 1. Create the LLM Router namespace +kubectl create namespace llm-router + +# 2. Create secret for Dynamo API key (if authentication is required) +# Note: For local Dynamo deployments, API keys may not be required +kubectl create secret generic dynamo-api-secret \ + --from-literal=DYNAMO_API_KEY="your-dynamo-api-key-here" \ + --namespace=llm-router + +# 3. (Optional) Create image pull secret for private registries (only if using private container registry) +kubectl create secret docker-registry nvcr-secret \ + --docker-server=nvcr.io \ + --docker-username='$oauthtoken' \ + --docker-password="your-ngc-api-key-here" \ + --namespace=llm-router + +# 4. Verify secrets were created +kubectl get secrets -n llm-router +``` + +### Step 6: Deploy LLM Router + +
+ +[![Step 6](https://img.shields.io/badge/Step%206-Deploy%20Router-indigo?style=for-the-badge&logo=nvidia)](https://github.com/NVIDIA-AI-Blueprints/llm-router) + +*Deploy the NVIDIA LLM Router using Helm* + +
+ +**Note**: The NVIDIA LLM Router requires building images from source and using the official Helm charts from the GitHub repository. + +```bash +# 1. Clone the NVIDIA LLM Router repository (required for Helm charts) +git clone https://github.com/NVIDIA-AI-Blueprints/llm-router.git +cd llm-router + +# 2. Build and push LLM Router images to your registry +docker build -t /router-server:latest -f src/router-server/router-server.dockerfile . +docker build -t /router-controller:latest -f src/router-controller/router-controller.dockerfile . +docker build -t /llm-router-client:app -f demo/app/app.dockerfile . + +# Push to your registry +docker push /router-server:latest +docker push /router-controller:latest +docker push /llm-router-client:app + + +# 3. Create router configuration ConfigMap with environment variable substitution +# Set DYNAMO_API_BASE for template substitution (DYNAMO_API_KEY comes from dynamo-api-secret) +export DYNAMO_API_BASE="http://frontend-service.${NAMESPACE}.svc.cluster.local:8000" + +# Create ConfigMap with substituted values +envsubst < router-config-dynamo.yaml | \ +kubectl create configmap router-config-dynamo \ + --from-file=config.yaml=/dev/stdin \ + --namespace=llm-router + +# 4. Prepare router models (download from NGC) +# Download the NemoCurator Prompt Task and Complexity Classifier model from NGC: +# https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/prompt-task-and-complexity-classifier/version +# Follow the main project README to download models to local 'routers/' directory +# Then create PVC and upload models: + +kubectl apply -f - </router-controller \ + --set routerServer.image.repository=/router-server \ + --set imagePullSecrets[0].name=nvcr-secret \ + --wait --timeout=10m + +# 6. Verify LLM Router deployment +kubectl get pods -n llm-router +kubectl get svc -n llm-router +``` + +### Step 7: Configure External Access + +
+ +[![Step 7](https://img.shields.io/badge/Step%207-Configure%20Access-teal?style=for-the-badge&logo=nginx)](https://kubernetes.io) + +*Configure external access to the LLM Router* + +
+ +```bash +# For development/testing, use port forwarding to access LLM Router +kubectl port-forward svc/llm-router-router-controller 8084:8084 -n llm-router + +# Test the LLM Router API +curl http://localhost:8084/health +``` + +## Configuration + +### Ingress Configuration + +The LLM Router is configured with ingress disabled by default to avoid service name conflicts. To enable external access: + +```yaml +ingress: + enabled: false # Disabled by default - enable after deployment is working + className: "nginx" # Adjust for your ingress controller + hosts: + - host: llm-router.local # Change to your domain + paths: + - path: / + pathType: Prefix +``` + +**Important**: Update the `host` field in `llm-router-values-override.yaml` to match your domain: + +```bash +# For production, replace llm-router.local with your actual domain +sed -i 's/llm-router.local/your-domain.com/g' llm-router-values-override.yaml +``` + +**For local testing**, add the ingress IP to your `/etc/hosts`: + +```bash +# Get the ingress IP and add to hosts file +INGRESS_IP=$(kubectl get ingress llm-router -n llm-router -o jsonpath='{.status.loadBalancer.ingress[0].ip}') +echo "$INGRESS_IP llm-router.local" | sudo tee -a /etc/hosts +``` + +### API Key Management + +The router configuration uses **environment variable substitution** for secure API key management, following the [official NVIDIA LLM Router pattern](https://github.com/NVIDIA-AI-Blueprints/llm-router/blob/main/deploy/helm/llm-router/templates/router-controller-configmap.yaml): + +```yaml +# In router-config-dynamo.yaml +llms: + - name: Brainstorming + api_base: http://frontend-service.dynamo-kubernetes.svc.cluster.local:8000/v1 + api_key: "${DYNAMO_API_KEY}" # Resolved from Kubernetes secret + model: meta-llama/Llama-3.1-70B-Instruct +``` + +The LLM Router controller: +1. Reads `DYNAMO_API_KEY` from the `dynamo-api-secret` Kubernetes secret +2. Replaces `${DYNAMO_API_KEY}` placeholders in the configuration +3. Uses the actual API key value for authentication with Dynamo services + +**Security Note**: Never use empty strings (`""`) for API keys. Always use proper Kubernetes secrets with environment variable references. + +### Router Configuration + +The `router-config-dynamo.yaml` configures routing policies to our deployed models. + +**Current Setup**: The configuration routes to different models based on task complexity and type: +- `meta-llama/Llama-3.1-8B-Instruct` - Fast model for simple tasks +- `meta-llama/Llama-3.1-70B-Instruct` - Powerful model for complex tasks +- `mistralai/Mixtral-8x22B-Instruct-v0.1` - Creative model for conversational tasks + +**Note**: All routing goes through the shared frontend service which handles model selection: + +| **Task Router** | **Model** | **Shared Frontend** | **Use Case** | +|-----------------|-----------|--------------|--------------| +| Brainstorming | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Creative ideation | +| Chatbot | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://frontend-service.${NAMESPACE}:8000/v1` | Conversational AI | +| Code Generation | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Programming tasks | +| Summarization | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Text summarization | +| Text Generation | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://frontend-service.${NAMESPACE}:8000/v1` | General text creation | +| Open QA | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Complex questions | +| Closed QA | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Simple Q&A | +| Classification | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Text classification | +| Extraction | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Information extraction | +| Rewrite | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Text rewriting | + +| **Complexity Router** | **Model** | **Shared Frontend** | **Use Case** | +|----------------------|-----------|--------------|--------------| +| Creativity | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Creative tasks | +| Reasoning | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Complex reasoning | +| Contextual-Knowledge | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Knowledge-intensive | +| Few-Shot | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Few-shot learning | +| Domain-Knowledge | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://frontend-service.${NAMESPACE}:8000/v1` | Specialized domains | +| No-Label-Reason | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Simple reasoning | +| Constraint | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Constrained tasks | + +**Intelligent Routing Strategy**: +- **Simple tasks** → `meta-llama/Llama-3.1-8B-Instruct` (fast, efficient) +- **Complex tasks** → `meta-llama/Llama-3.1-70B-Instruct` (powerful, detailed) +- **Creative/Conversational** → `mistralai/Mixtral-8x22B-Instruct-v0.1` (diverse, creative) +- **Extensible**: Add more models by deploying additional workers and updating router configuration + +## Testing the Integration + +Once both Dynamo and LLM Router are deployed, test the complete integration: + +```bash +# Test LLM Router with task-based routing +curl -X POST http://localhost:8084/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { + "role": "user", + "content": "Write a Python function to calculate fibonacci numbers" + } + ], + "model": "", + "nim-llm-router": { + "policy": "task_router", + "routing_strategy": "triton", + "model": "" + } + }' | jq + +# Test with complexity-based routing +curl -X POST http://localhost:8084/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { + "role": "user", + "content": "Explain quantum computing in simple terms" + } + ], + "model": "", + "nim-llm-router": { + "policy": "complexity_router", + "routing_strategy": "triton", + "model": "" + } + }' | jq + +# Monitor routing decisions in LLM Router logs +kubectl logs -f deployment/llm-router-router-controller -n llm-router + +# Monitor Dynamo inference logs +kubectl logs -f deployment/llm-deployment-frontend -n dynamo-cloud +``` + + + + + +## Troubleshooting + +If you encounter issues, the most common causes are: + +1. **Missing Prerequisites**: Ensure all environment variables are set correctly +2. **Insufficient Resources**: Verify your cluster has enough GPU and memory resources +3. **Network Issues**: Check that services can communicate across namespaces + +### Quick Health Check + +```bash +# Verify all components are running +kubectl get pods -n ${NAMESPACE} +kubectl get pods -n llm-router + +# If something isn't working, check the logs +kubectl logs -f -n +``` + +For detailed debugging, refer to the Kubernetes documentation or the specific component's logs. + +## Cleanup + + + +```bash +# Remove LLM Router +helm uninstall llm-router -n llm-router +kubectl delete namespace llm-router + +# Remove all model deployments (use the same files you deployed with) +# If you used agg.yaml: +# kubectl delete -f agg.yaml -n ${NAMESPACE} +# If you used disagg.yaml: +# kubectl delete -f disagg.yaml -n ${NAMESPACE} +# Remove shared frontend +kubectl delete -f frontend.yaml -n ${NAMESPACE} + +# Remove Hugging Face token secret +kubectl delete secret hf-token-secret -n ${NAMESPACE} + +# Remove Dynamo Cloud Platform (if desired) +helm uninstall dynamo-platform -n ${NAMESPACE} +helm uninstall dynamo-crds -n default +kubectl delete namespace ${NAMESPACE} + +# Stop supporting services (if used) +docker compose -f deploy/metrics/docker-compose.yml down +``` + +## Files in This Directory + +- **`README.md`** - This comprehensive deployment guide +- **`frontend.yaml`** - Shared OpenAI-compatible API frontend service configuration +- **`agg.yaml`** - Aggregated worker configuration (frontend extracted to frontend.yaml) +- **`disagg.yaml`** - Disaggregated worker configuration with separate prefill/decode workers (frontend extracted to frontend.yaml) +- **`router-config-dynamo.yaml`** - Router configuration for Dynamo integration +- **`llm-router-values-override.yaml`** - Helm values override for LLM Router with Dynamo integration + +## Resources + +- [NVIDIA Dynamo Cloud Platform Documentation](https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html) +- [NVIDIA Dynamo Kubernetes Operator](https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_operator.html) +- [NVIDIA Dynamo GitHub Repository](https://github.com/ai-dynamo/dynamo) +- [LLM Router GitHub Repository](https://github.com/NVIDIA-AI-Blueprints/llm-router) +- [LLM Router Helm Chart](https://github.com/NVIDIA-AI-Blueprints/llm-router/tree/main/deploy/helm/llm-router) +- [Kubernetes Documentation](https://kubernetes.io/docs/) +- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) \ No newline at end of file diff --git a/customizations/LLM Router/agg.yaml b/customizations/LLM Router/agg.yaml new file mode 100644 index 0000000..1a6f775 --- /dev/null +++ b/customizations/LLM Router/agg.yaml @@ -0,0 +1,26 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: vllm-agg +spec: + services: + VllmDecodeWorker: + envFromSecret: hf-token-secret + dynamoNamespace: vllm-agg + componentType: worker + replicas: 1 + resources: + limits: + gpu: "1" + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} + workingDir: /workspace/components/backends/vllm + command: + - /bin/sh + - -c + args: + - python3 -m dynamo.vllm --model ${MODEL_NAME} diff --git a/customizations/LLM Router/disagg.yaml b/customizations/LLM Router/disagg.yaml new file mode 100644 index 0000000..0f7d0e9 --- /dev/null +++ b/customizations/LLM Router/disagg.yaml @@ -0,0 +1,43 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: vllm-disagg +spec: + services: + VllmDecodeWorker: + dynamoNamespace: vllm-disagg + envFromSecret: hf-token-secret + componentType: worker + replicas: 1 + resources: + limits: + gpu: "1" + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} + workingDir: /workspace/components/backends/vllm + command: + - /bin/sh + - -c + args: + - "python3 -m dynamo.vllm --model ${MODEL_NAME}" + VllmPrefillWorker: + dynamoNamespace: vllm-disagg + envFromSecret: hf-token-secret + componentType: worker + replicas: 1 + resources: + limits: + gpu: "1" + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} + workingDir: /workspace/components/backends/vllm + command: + - /bin/sh + - -c + args: + - "python3 -m dynamo.vllm --model ${MODEL_NAME} --is-prefill-worker" \ No newline at end of file diff --git a/customizations/LLM Router/frontend.yaml b/customizations/LLM Router/frontend.yaml new file mode 100644 index 0000000..5829c5c --- /dev/null +++ b/customizations/LLM Router/frontend.yaml @@ -0,0 +1,16 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: vllm-agg +spec: + services: + Frontend: + dynamoNamespace: vllm-agg + componentType: frontend + replicas: 1 + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} diff --git a/customizations/LLM Router/llm-router-values-override.yaml b/customizations/LLM Router/llm-router-values-override.yaml new file mode 100644 index 0000000..eaec97f --- /dev/null +++ b/customizations/LLM Router/llm-router-values-override.yaml @@ -0,0 +1,171 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# LLM Router Helm Values for NVIDIA Dynamo Cloud Platform Integration +# This configuration integrates the LLM Router with the official NVIDIA Dynamo deployment +# Based on: https://github.com/NVIDIA-AI-Blueprints/llm-router/tree/main/deploy/helm/llm-router +# +# Pure Kubernetes-Native Approach: Uses ConfigMaps for configuration +# +# IMPORTANT: Update the api_base URL to match your Dynamo deployment +# This should point to your actual Dynamo service endpoint + +# Dynamo Configuration - will be created as ConfigMap +dynamoConfig: + api_base: "http://frontend-service.${NAMESPACE}.svc.cluster.local:8000" + namespace: "${NAMESPACE}" + # For external Dynamo deployments, use: + # api_base: "https://your-dynamo-endpoint.com" + +# Router Controller Configuration +routerController: + replicaCount: 1 + image: + repository: /router-controller # Override with --set during deployment + tag: latest + pullPolicy: IfNotPresent + + service: + type: ClusterIP + port: 8084 + + # Configure environment variables from ConfigMap and Secrets + env: + - name: LOG_LEVEL + value: "INFO" + - name: ENABLE_METRICS + value: "true" + - name: DYNAMO_ENDPOINT + value: "{{ .Values.dynamoConfig.api_base }}" + - name: DYNAMO_NAMESPACE + value: "{{ .Values.dynamoConfig.namespace }}" + - name: DYNAMO_API_BASE + value: "{{ .Values.dynamoConfig.api_base }}" + - name: DYNAMO_API_KEY + valueFrom: + secretKeyRef: + name: dynamo-api-secret + key: DYNAMO_API_KEY + + resources: + requests: + cpu: 500m + memory: 1Gi + limits: + cpu: 1000m + memory: 2Gi + +# Router Server Configuration +routerServer: + replicaCount: 1 + image: + repository: /router-server # Override with --set during deployment + tag: latest + pullPolicy: IfNotPresent + + service: + type: ClusterIP + port: 8000 + + resources: + requests: + cpu: 1000m + memory: 2Gi + nvidia.com/gpu: 1 + limits: + cpu: 2000m + memory: 4Gi + nvidia.com/gpu: 1 + + nodeSelector: + nvidia.com/gpu.present: "true" + + tolerations: + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + +# Configuration Map for Router Policies +config: + # The router configuration will be mounted from router-config-dynamo.yaml + mountPath: /app/config + configMap: + name: router-config-dynamo + +# Ingress Configuration (disabled to avoid service name conflicts) +ingress: + enabled: false + className: "nginx" # Use your cluster's ingress class + annotations: + nginx.ingress.kubernetes.io/rewrite-target: / + nginx.ingress.kubernetes.io/ssl-redirect: "false" + nginx.ingress.kubernetes.io/proxy-body-size: "10m" + nginx.ingress.kubernetes.io/proxy-read-timeout: "300" + nginx.ingress.kubernetes.io/proxy-send-timeout: "300" + hosts: + - host: llm-router.local # CHANGE THIS: Replace with your actual domain + paths: + - path: / + pathType: Prefix + backend: + service: + name: llm-router-router-controller # Match actual service name created by Helm + port: + number: 8084 # Router controller service port + tls: [] + # - secretName: llm-router-tls + # hosts: + # - llm-router.your-domain.com + +# Monitoring +monitoring: + prometheus: + enabled: true + port: 9090 + grafana: + enabled: false # Use existing monitoring stack if available + +# Security +serviceAccount: + create: true + annotations: {} + name: "" + +podSecurityContext: + fsGroup: 2000 + +securityContext: + capabilities: + drop: + - ALL + readOnlyRootFilesystem: false # Router may need to write temporary files + runAsNonRoot: true + runAsUser: 1000 + +# Image Pull Secrets (if needed for private registries) +imagePullSecrets: + - name: nvcr-secret + +# Cross-namespace service access +rbac: + create: true + rules: + - apiGroups: [""] + resources: ["services", "endpoints"] + verbs: ["get", "list", "watch"] + - apiGroups: [""] + resources: ["services"] + resourceNames: ["dynamo-llm-service"] + verbs: ["get"] \ No newline at end of file diff --git a/customizations/LLM Router/router-config-dynamo.yaml b/customizations/LLM Router/router-config-dynamo.yaml new file mode 100644 index 0000000..fa50641 --- /dev/null +++ b/customizations/LLM Router/router-config-dynamo.yaml @@ -0,0 +1,139 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# LLM Router Configuration for NVIDIA Dynamo Integration +# This configuration routes requests to the official NVIDIA Dynamo Cloud Platform +# deployment using the proper service endpoints +# +# Based on: https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html +# API Key pattern follows: https://github.com/NVIDIA-AI-Blueprints/llm-router/blob/main/deploy/helm/llm-router/templates/router-controller-configmap.yaml +# +# IMPORTANT: This config only references the 3 models we actually deploy: +# - meta-llama/Llama-3.1-8B-Instruct (Fast model for simple tasks) +# - meta-llama/Llama-3.1-70B-Instruct (Powerful model for complex tasks) +# - mistralai/Mixtral-8x22B-Instruct-v0.1 (Creative model for conversational tasks) +# +# To add more models: +# 1. Deploy the model using the pattern in Step 2 of README.md +# 2. Add router entries below following the same format +# +# NOTE: Environment variables are resolved at runtime: +# - ${DYNAMO_API_BASE}: Points to the Dynamo service endpoint +# - ${DYNAMO_API_KEY}: API key for authenticating with Dynamo services +# +# These variables are populated from: +# - ConfigMap: DYNAMO_API_BASE (defined in llm-router-values-override.yaml) +# - Secret: DYNAMO_API_KEY (created during deployment setup) + +policies: + - name: "task_router" + url: http://router-server:8000/v2/models/task_router_ensemble/infer + llms: + # === DEPLOYED MODELS ONLY === + # We only use the 3 models we actually deploy + + # Simple tasks → Fast 8B model + - name: "Closed QA" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-8B-Instruct + - name: Classification + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-8B-Instruct + - name: Extraction + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-8B-Instruct + - name: Rewrite + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-8B-Instruct + - name: Summarization + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-8B-Instruct + - name: Unknown + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-8B-Instruct + + # Complex tasks → Powerful 70B model + - name: Brainstorming + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-70B-Instruct + - name: "Code Generation" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-70B-Instruct + - name: "Open QA" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-70B-Instruct + - name: Other + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: mistralai/Mixtral-8x22B-Instruct-v0.1 + + # Creative/Conversational tasks → Mixtral model + - name: Chatbot + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: mistralai/Mixtral-8x22B-Instruct-v0.1 + - name: "Text Generation" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: mistralai/Mixtral-8x22B-Instruct-v0.1 + + - name: "complexity_router" + url: http://router-server:8000/v2/models/complexity_router_ensemble/infer + llms: + # === DEPLOYED MODELS ONLY === + # We only use the 3 models we actually deploy + + # Simple complexity → Fast 8B model + - name: "Contextual-Knowledge" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-8B-Instruct + - name: "No-Label-Reason" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-8B-Instruct + - name: Constraint + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-8B-Instruct + + # High complexity → Powerful 70B model + - name: Creativity + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-70B-Instruct + - name: Reasoning + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-70B-Instruct + - name: "Few-Shot" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: meta-llama/Llama-3.1-70B-Instruct + + # Creative/Domain complexity → Mixtral model + - name: "Domain-Knowledge" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" + model: mistralai/Mixtral-8x22B-Instruct-v0.1 \ No newline at end of file diff --git a/customizations/README.md b/customizations/README.md index cec3047..d4e5989 100644 --- a/customizations/README.md +++ b/customizations/README.md @@ -1 +1,10 @@ -# Dynamo Customizations +# NVIDIA Dynamo Customizations + +This directory contains configuration files and deployment guides for integrating NVIDIA technologies. + +## Available Customizations + +### LLM Router +Integration with NVIDIA LLM Router for intelligent request routing and model selection. + +**Location**: [`LLM Router/`](LLM%20Router/)