From 40e361f54345b7a65fc10589f046d12c0445b1a4 Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Sun, 15 Jun 2025 20:28:55 +0000 Subject: [PATCH 01/17] Add configuration files for LLM Router integration with NVIDIA Dynamo - Introduced `dynamo-llm-config.yaml` for full production deployment with multiple models. - Added `dynamo-single-llm-config.yaml` for minimal testing with a single model. - Created `llm-router-values-override.yaml` for Helm deployment customization. - Added routing configuration files: `router-config.yaml` for full deployment and `router-config-single.yaml` for single model verification. - Updated `README.md` with comprehensive deployment instructions and configuration options. --- customizations/LLM Router/README.md | 400 ++++++++++++++++++ .../LLM Router/dynamo-llm-config.yaml | 63 +++ .../LLM Router/dynamo-single-llm-config.yaml | 17 + .../llm-router-values-override.yaml | 101 +++++ .../LLM Router/router-config-single.yaml | 84 ++++ customizations/LLM Router/router-config.yaml | 84 ++++ 6 files changed, 749 insertions(+) create mode 100644 customizations/LLM Router/README.md create mode 100644 customizations/LLM Router/dynamo-llm-config.yaml create mode 100644 customizations/LLM Router/dynamo-single-llm-config.yaml create mode 100644 customizations/LLM Router/llm-router-values-override.yaml create mode 100644 customizations/LLM Router/router-config-single.yaml create mode 100644 customizations/LLM Router/router-config.yaml diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md new file mode 100644 index 0000000..736a976 --- /dev/null +++ b/customizations/LLM Router/README.md @@ -0,0 +1,400 @@ +# LLM Router with NVIDIA Dynamo - Kubernetes Deployment Guide + +This guide provides step-by-step instructions for deploying [NVIDIA LLM Router](https://github.com/NVIDIA-AI-Blueprints/llm-router) with [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo/tree/main/deploy/cloud) on Kubernetes. + +## Overview + +The LLM Router intelligently routes LLM requests to the most appropriate model based on the task at hand. This deployment strategy will: + +- **Deploy Dynamo using Cloud Operator** - Use the official Dynamo cloud operator for robust, scalable deployment +- **Deploy LLMs via Dynamo Cloud** - Host multiple LLM models through Dynamo's cloud infrastructure +- **Configure LLM Router** - Point the router to different models hosted on Dynamo cloud + +## Quick Start + +Choose your deployment option using the provided configuration files: + +### Option 1: Minimal Verification (1 GPU) +```bash +# 1. Deploy Dynamo Cloud Operator +kubectl apply -f https://github.com/ai-dynamo/dynamo/releases/latest/download/dynamo-operator.yaml + +# 2. Deploy single LLM via Dynamo +kubectl create namespace dynamo +kubectl apply -f dynamo-single-llm-config.yaml + +# 3. Clone LLM Router and deploy with Helm +git clone https://github.com/NVIDIA-AI-Blueprints/llm-router.git +cd llm-router + +# 4. Create ConfigMap and deploy router +kubectl create configmap router-config \ + --from-file=config.yaml=../router-config-single.yaml \ + -n llm-router + +helm upgrade --install llm-router deploy/helm/llm-router \ + -f ../llm-router-values-override.yaml \ + -n llm-router \ + --create-namespace \ + --wait --timeout=10m + +# 5. Test the integration +kubectl port-forward svc/router-controller 8084:8084 -n llm-router & +curl -X POST http://localhost:8084/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"model":"","messages":[{"role":"user","content":"Hello!"}],"nim-llm-router":{"policy":"task_router"}}' +``` + +### Option 2: Full Production (32 GPUs) + +**Model Routing Configuration:** + +| **Task Router** | **Model** | **GPUs** | **Use Case** | +|-----------------|-----------|----------|--------------| +| Brainstorming | llama-3.1-70b-instruct | 4 | Creative ideation | +| Chatbot | mixtral-8x22b-instruct | 4 | Conversational AI | +| Code Generation | llama-3.1-nemotron-70b-instruct | 4 | Programming tasks | +| Summarization | phi-3-mini-128k-instruct | 1 | Text summarization | +| Text Generation | llama-3.2-11b-vision-instruct | 2 | General text creation | +| Open QA | llama-3.1-405b-instruct | 8 | Complex questions | +| Closed QA | llama-3.1-8b-instruct | 1 | Simple Q&A | +| Classification | phi-3-mini-4k-instruct | 1 | Text classification | +| Extraction | llama-3.1-8b-instruct | 1 | Information extraction | +| Rewrite | phi-3-medium-128k-instruct | 2 | Text rewriting | + +| **Complexity Router** | **Model** | **GPUs** | **Use Case** | +|----------------------|-----------|----------|--------------| +| Creativity | llama-3.1-70b-instruct | 4 | Creative tasks | +| Reasoning | llama-3.3-nemotron-super-49b | 4 | Complex reasoning | +| Contextual-Knowledge | llama-3.1-405b-instruct | 8 | Knowledge-intensive | +| Few-Shot | llama-3.1-70b-instruct | 4 | Few-shot learning | +| Domain-Knowledge | llama-3.1-nemotron-70b-instruct | 4 | Specialized domains | +| No-Label-Reason | llama-3.1-8b-instruct | 1 | Simple reasoning | +| Constraint | phi-3-medium-128k-instruct | 2 | Constrained tasks | + +**Total: 32 GPUs across 10 different models + 1 GPU for Router Server = 33 GPUs** + +> **💡 Customization Note:** These model assignments can be changed to suit your specific needs. You can modify the `router-config.yaml` file to: +> - Assign different models to different tasks +> - Add or remove routing categories +> - Adjust GPU allocation based on your hardware +> - Use different model variants or sizes +> +> The routing configuration is fully customizable based on your use case and available resources. + +```bash +# 1. Deploy Dynamo Cloud Operator +kubectl apply -f https://github.com/ai-dynamo/dynamo/releases/latest/download/dynamo-operator.yaml + +# 2. Deploy full LLM cluster via Dynamo +kubectl create namespace dynamo +kubectl apply -f dynamo-llm-config.yaml + +# 3. Clone LLM Router and deploy with Helm +git clone https://github.com/NVIDIA-AI-Blueprints/llm-router.git +cd llm-router + +# 4. Create ConfigMap and deploy router +kubectl create configmap router-config \ + --from-file=config.yaml=../router-config.yaml \ + -n llm-router + +helm upgrade --install llm-router deploy/helm/llm-router \ + -f ../llm-router-values-override.yaml \ + -n llm-router \ + --create-namespace \ + --wait --timeout=10m + +# 5. Test the integration +kubectl port-forward svc/router-controller 8084:8084 -n llm-router & +curl -X POST http://localhost:8084/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"model":"","messages":[{"role":"user","content":"Write a Python function"}],"nim-llm-router":{"policy":"task_router"}}' +``` + +**All configuration files are provided in this directory - no manual file creation needed!** + +## Prerequisites + +- **Kubernetes cluster** (1.20+) with kubectl configured +- **Helm 3.x** for managing deployments +- **Dynamo Cloud Operator** access and credentials +- **NVIDIA GPU nodes** (with GPU Operator installed) +- **NVIDIA API keys** for model access through Dynamo cloud + +## Step 1: Deploy NVIDIA Dynamo using Cloud Operator + +### 1.1 Deploy Dynamo Cloud Operator + +Deploy Dynamo using the official cloud operator for production-ready, scalable infrastructure: + +```bash +# Install Dynamo Cloud Operator +kubectl apply -f https://github.com/ai-dynamo/dynamo/releases/latest/download/dynamo-operator.yaml + +# Verify operator installation +kubectl get pods -n dynamo-system +``` + +### 1.2 Deploy LLMs via Dynamo Cloud + +Choose from the provided configuration files based on your GPU availability: + +```bash +# For full production deployment (32 GPUs) - use provided file: +kubectl apply -f dynamo-llm-config.yaml + +# OR for minimal testing (1 GPU) - use provided file: +kubectl apply -f dynamo-single-llm-config.yaml +``` + +**🖥️ GPU Requirements Summary:** +- **Full Configuration**: 32 GPUs for models + 1 GPU for Router Server = **33 GPUs total** +- **Minimal Configuration**: 1 GPU for model + 1 GPU for Router Server = **2 GPUs total** +- **Router Server**: Always requires 1 GPU for routing decisions and model orchestration +- **Recommended for Production**: All models for comprehensive routing + +### 1.3 Configuration Options + +**Configuration Comparison:** + +| Configuration | GPUs | Models | Use Case | File | +|--------------|------|--------|----------|------| +| **Minimal** | 2 (1 model + 1 router) | 1 model | Development, Testing, Verification | `dynamo-single-llm-config.yaml` | +| **Full** | 33 (32 models + 1 router) | 10 comprehensive models | Enterprise production, Maximum routing intelligence | `dynamo-llm-config.yaml` | + +Both configuration files are provided in this repository - no manual creation needed! + +### 1.4 Verify Dynamo Cloud Deployment + +```bash +# Check Dynamo cluster status +kubectl get dynamoclusters -n dynamo + +# Verify all models are deployed and ready +kubectl get pods -n dynamo -l app=dynamo-model + +# Check model endpoints +kubectl get svc -n dynamo + +# The LLM Router will connect to models through: +# http://llm-cluster.dynamo.svc.cluster.local:8080 +``` + +## Step 2: Deploy LLM Router with Official Helm Chart + +### 2.1 Download LLM Router + +```bash +# Clone the official LLM Router repository +git clone https://github.com/NVIDIA-AI-Blueprints/llm-router.git +cd llm-router +``` + +### 2.2 Use Provided Helm Values Override + +The repository includes a pre-configured Helm values override file (`llm-router-values-override.yaml`) that configures the LLM Router to work with Dynamo. This file includes: + +- Router Controller and Server configuration +- Resource allocation (CPU, memory, GPU) +- Security contexts and service accounts +- Monitoring configuration +- Integration with Dynamo endpoints + +No additional configuration needed - the file is ready to use. + +### 2.3 Use Provided Router Configuration + +The repository includes pre-configured router configuration files: + +- **`router-config-single.yaml`** - For single LLM verification (all routes point to one model) +- **`router-config.yaml`** - For full production deployment with multiple models and intelligent routing + +Both files are ready to use - no manual creation needed! + +### 2.4 Deploy LLM Router with Helm + +```bash +# For full production deployment: +kubectl create configmap router-config \ + --from-file=config.yaml=router-config.yaml \ + -n llm-router + +# For single LLM verification: +kubectl create configmap router-config \ + --from-file=config.yaml=router-config-single.yaml \ + -n llm-router + +# Deploy LLM Router using Helm with the provided values override +helm upgrade --install llm-router deploy/helm/llm-router \ + -f ../llm-router-values-override.yaml \ + -n llm-router \ + --create-namespace \ + --wait --timeout=10m + +# Verify deployment +kubectl get pods -n llm-router +kubectl get svc -n llm-router +``` + +## Step 3: Verification and Testing + +### 3.1 Verify Dynamo LLM Endpoints + +```bash +# Check Dynamo cluster status +kubectl get dynamoclusters -n dynamo + +# Verify all models are deployed and ready +kubectl get pods -n dynamo -l app=dynamo-model + +# Test direct LLM endpoint +kubectl port-forward svc/llm-cluster 8080:8080 -n dynamo & + +# Test the LLM endpoint (in another terminal) +curl -X POST http://localhost:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "llama-3.1-8b-instruct", + "messages": [ + { + "role": "user", + "content": "Hello, how are you?" + } + ], + "max_tokens": 100 + }' +``` + +### 3.2 Test LLM Router Integration + +```bash +# Port forward LLM Router controller +kubectl port-forward svc/router-controller 8084:8084 -n llm-router & + +# Test task-based routing +curl -X 'POST' \ + 'http://localhost:8084/v1/chat/completions' \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "", + "messages": [ + { + "role": "user", + "content": "Write a Python function to calculate factorial" + } + ], + "max_tokens": 512, + "stream": false, + "nim-llm-router": { + "policy": "task_router", + "routing_strategy": "triton", + "model": "" + } + }' + +# Test complexity-based routing +curl -X 'POST' \ + 'http://localhost:8084/v1/chat/completions' \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "", + "messages": [ + { + "role": "user", + "content": "Explain quantum computing in simple terms" + } + ], + "max_tokens": 512, + "stream": false, + "nim-llm-router": { + "policy": "complexity_router", + "routing_strategy": "triton", + "model": "" + } + }' +``` + +### 3.3 Single LLM Verification Setup + +For testing with a minimal setup, use the provided configuration files: + +```bash +# Deploy single LLM using provided configuration +kubectl apply -f dynamo-single-llm-config.yaml + +# Use the provided single LLM router configuration +kubectl create configmap router-config \ + --from-file=config.yaml=router-config-single.yaml \ + -n llm-router \ + --dry-run=client -o yaml | kubectl apply -f - + +# Restart router pods to pick up new configuration +kubectl rollout restart deployment/router-controller -n llm-router +``` + +The provided `router-config-single.yaml` file configures all routing policies to point to the single LLM, making it perfect for verification that the routing mechanism works correctly. + +## Troubleshooting + +### Common Issues + +1. **Pods not starting**: Check GPU node availability and resource requests +2. **Service communication**: Verify service discovery and DNS resolution +3. **Model loading**: Verify NVIDIA API connectivity and quotas + +### Debugging Commands + +```bash +# Check pod logs +kubectl logs -f deployment/router-controller -n llm-router +kubectl logs -f deployment/router-server -n llm-router +kubectl logs -f deployment/dynamo-orchestrator -n dynamo + +# Check events +kubectl get events -n llm-router --sort-by=.metadata.creationTimestamp +kubectl get events -n dynamo --sort-by=.metadata.creationTimestamp + +# Check resource usage +kubectl top pods -n llm-router +kubectl top pods -n dynamo + +# Debug networking +kubectl exec -it deployment/router-controller -n llm-router -- nslookup router-server +``` + +## Cleanup + +To remove the deployment: + +```bash +# Remove LLM Router +helm uninstall llm-router -n llm-router +kubectl delete namespace llm-router + +# Remove Dynamo LLMs +kubectl delete dynamoclusters --all -n dynamo +kubectl delete namespace dynamo + +# Remove Dynamo Operator (optional) +kubectl delete -f https://github.com/ai-dynamo/dynamo/releases/latest/download/dynamo-operator.yaml +``` + +## Files in This Directory + +- **`README.md`** - This comprehensive deployment guide +- **`llm-router-values-override.yaml`** - Helm values override for LLM Router +- **`router-config-single.yaml`** - Router configuration for single LLM verification (1 GPU) +- **`router-config.yaml`** - Router configuration for full production deployment (32 GPUs) +- **`dynamo-single-llm-config.yaml`** - Minimal Dynamo configuration for testing (1 GPU) +- **`dynamo-llm-config.yaml`** - Full Dynamo configuration for production (32 GPUs) + +## Resources + +- [LLM Router GitHub Repository](https://github.com/NVIDIA-AI-Blueprints/llm-router) +- [LLM Router Helm Chart](https://github.com/NVIDIA-AI-Blueprints/llm-router/tree/main/deploy/helm/llm-router) +- [NVIDIA Dynamo Cloud Deployment](https://github.com/ai-dynamo/dynamo/tree/main/deploy/cloud) +- [Kubernetes Documentation](https://kubernetes.io/docs/) +- [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) \ No newline at end of file diff --git a/customizations/LLM Router/dynamo-llm-config.yaml b/customizations/LLM Router/dynamo-llm-config.yaml new file mode 100644 index 0000000..d3ad52b --- /dev/null +++ b/customizations/LLM Router/dynamo-llm-config.yaml @@ -0,0 +1,63 @@ +apiVersion: dynamo.ai/v1 +kind: DynamoCluster +metadata: + name: llm-cluster + namespace: dynamo +spec: + models: + # Task Router Models + - name: llama-3.1-70b-instruct + source: nvidia/Llama-3.1-70B-Instruct + replicas: 1 + resources: + gpu: 4 + - name: mixtral-8x22b-instruct + source: mistralai/Mixtral-8x22B-Instruct-v0.1 + replicas: 1 + resources: + gpu: 4 + - name: llama-3.1-nemotron-70b-instruct + source: nvidia/Llama-3.1-Nemotron-70B-Instruct-HF + replicas: 1 + resources: + gpu: 4 + - name: phi-3-mini-128k-instruct + source: microsoft/Phi-3-mini-128k-instruct + replicas: 1 + resources: + gpu: 1 + - name: llama-3.2-11b-vision-instruct + source: meta-llama/Llama-3.2-11B-Vision-Instruct + replicas: 1 + resources: + gpu: 2 + - name: llama-3.1-405b-instruct + source: meta-llama/Llama-3.1-405B-Instruct + replicas: 1 + resources: + gpu: 8 + - name: llama-3.1-8b-instruct + source: meta-llama/Llama-3.1-8B-Instruct + replicas: 2 + resources: + gpu: 1 + - name: phi-3-mini-4k-instruct + source: microsoft/Phi-3-mini-4k-instruct + replicas: 1 + resources: + gpu: 1 + - name: phi-3-medium-128k-instruct + source: microsoft/Phi-3-medium-128k-instruct + replicas: 1 + resources: + gpu: 2 + # Complexity Router Models + - name: llama-3.3-nemotron-super-49b + source: nvidia/Llama-3.3-Nemotron-Super-49B + replicas: 1 + resources: + gpu: 4 + cloudConfig: + provider: nvidia + region: us-west-2 + endpoint: https://api.dynamo.ai \ No newline at end of file diff --git a/customizations/LLM Router/dynamo-single-llm-config.yaml b/customizations/LLM Router/dynamo-single-llm-config.yaml new file mode 100644 index 0000000..1e4293e --- /dev/null +++ b/customizations/LLM Router/dynamo-single-llm-config.yaml @@ -0,0 +1,17 @@ +apiVersion: dynamo.ai/v1 +kind: DynamoCluster +metadata: + name: llm-cluster-single + namespace: dynamo +spec: + models: + # Single model for verification + - name: llama-3.1-8b-instruct + source: meta-llama/Llama-3.1-8B-Instruct + replicas: 1 + resources: + gpu: 1 + cloudConfig: + provider: nvidia + region: us-west-2 + endpoint: https://api.dynamo.ai \ No newline at end of file diff --git a/customizations/LLM Router/llm-router-values-override.yaml b/customizations/LLM Router/llm-router-values-override.yaml new file mode 100644 index 0000000..491c5fd --- /dev/null +++ b/customizations/LLM Router/llm-router-values-override.yaml @@ -0,0 +1,101 @@ +# LLM Router Helm Values Override for Dynamo Integration +# Based on: https://github.com/NVIDIA-AI-Blueprints/llm-router/tree/main/deploy/helm/llm-router + +# Router Controller Configuration +routerController: + replicaCount: 1 + image: + repository: nvcr.io/nvidia/router-controller + tag: latest + pullPolicy: IfNotPresent + + service: + type: ClusterIP + port: 8084 + + # Configure to route to Dynamo instead of external APIs + env: + - name: LOG_LEVEL + value: "INFO" + - name: ENABLE_METRICS + value: "true" + - name: DYNAMO_ENDPOINT + value: "http://llm-cluster-single.dynamo.svc.cluster.local:8080" + + resources: + requests: + cpu: 500m + memory: 1Gi + limits: + cpu: 1000m + memory: 2Gi + +# Router Server Configuration +routerServer: + replicaCount: 1 + image: + repository: nvcr.io/nvidia/router-server + tag: latest + pullPolicy: IfNotPresent + + service: + type: ClusterIP + port: 8000 + + resources: + requests: + cpu: 1000m + memory: 2Gi + nvidia.com/gpu: 1 + limits: + cpu: 2000m + memory: 4Gi + nvidia.com/gpu: 1 + + nodeSelector: + nvidia.com/gpu.present: "true" + + tolerations: + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + +# Configuration Map +config: + # The router configuration will be mounted from our custom config.yaml + mountPath: /app/config + configMap: + name: router-config + +# Ingress Configuration (disabled for verification) +ingress: + enabled: false + +# Monitoring +monitoring: + prometheus: + enabled: true + port: 9090 + grafana: + enabled: false # Use existing monitoring stack if available + +# Security +serviceAccount: + create: true + annotations: {} + name: "" + +podSecurityContext: + fsGroup: 2000 + +securityContext: + capabilities: + drop: + - ALL + readOnlyRootFilesystem: false # Router may need to write temporary files + runAsNonRoot: true + runAsUser: 1000 + +# Image Pull Secrets (if needed for private registries) +imagePullSecrets: [] + # - name: nvidia-registry-secret \ No newline at end of file diff --git a/customizations/LLM Router/router-config-single.yaml b/customizations/LLM Router/router-config-single.yaml new file mode 100644 index 0000000..6b39a9b --- /dev/null +++ b/customizations/LLM Router/router-config-single.yaml @@ -0,0 +1,84 @@ +policies: + - name: "task_router" + url: http://router-server:8000/v2/models/task_router_ensemble/infer + llms: + - name: Brainstorming + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: Chatbot + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: "Code Generation" + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: Summarization + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: "Text Generation" + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: "Open QA" + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: "Closed QA" + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: Classification + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: Extraction + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: Rewrite + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: Other + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: Unknown + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + + - name: "complexity_router" + url: http://router-server:8000/v2/models/complexity_router_ensemble/infer + llms: + - name: Creativity + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: Reasoning + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: "Contextual-Knowledge" + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: "Few-Shot" + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: "Domain-Knowledge" + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: "No-Label-Reason" + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: Constraint + api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct \ No newline at end of file diff --git a/customizations/LLM Router/router-config.yaml b/customizations/LLM Router/router-config.yaml new file mode 100644 index 0000000..ec59c86 --- /dev/null +++ b/customizations/LLM Router/router-config.yaml @@ -0,0 +1,84 @@ +policies: + - name: "task_router" + url: http://router-server:8000/v2/models/task_router_ensemble/infer + llms: + - name: Brainstorming + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-70b-instruct + - name: Chatbot + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: mixtral-8x22b-instruct + - name: "Code Generation" + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-nemotron-70b-instruct + - name: Summarization + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: phi-3-mini-128k-instruct + - name: "Text Generation" + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.2-11b-vision-instruct + - name: "Open QA" + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-405b-instruct + - name: "Closed QA" + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: Classification + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: phi-3-mini-4k-instruct + - name: Extraction + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: Rewrite + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: phi-3-medium-128k-instruct + - name: Other + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-70b-instruct + - name: Unknown + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + + - name: "complexity_router" + url: http://router-server:8000/v2/models/complexity_router_ensemble/infer + llms: + - name: Creativity + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-70b-instruct + - name: Reasoning + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.3-nemotron-super-49b + - name: "Contextual-Knowledge" + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-405b-instruct + - name: "Few-Shot" + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-70b-instruct + - name: "Domain-Knowledge" + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-nemotron-70b-instruct + - name: "No-Label-Reason" + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: llama-3.1-8b-instruct + - name: Constraint + api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_key: "" + model: phi-3-medium-128k-instruct \ No newline at end of file From f4f05aa6b22f3a2668f5997a8172f7bc8683f441 Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Sun, 15 Jun 2025 20:36:47 +0000 Subject: [PATCH 02/17] Add SPDX license headers to LLM Router configuration files - Added license information to `dynamo-llm-config.yaml`, `dynamo-single-llm-config.yaml`, `llm-router-values-override.yaml`, `router-config-single.yaml`, and `router-config.yaml`. - Ensured compliance with Apache License 2.0 for all relevant configuration files. --- customizations/LLM Router/dynamo-llm-config.yaml | 15 +++++++++++++++ .../LLM Router/dynamo-single-llm-config.yaml | 15 +++++++++++++++ .../LLM Router/llm-router-values-override.yaml | 15 +++++++++++++++ .../LLM Router/router-config-single.yaml | 15 +++++++++++++++ customizations/LLM Router/router-config.yaml | 15 +++++++++++++++ 5 files changed, 75 insertions(+) diff --git a/customizations/LLM Router/dynamo-llm-config.yaml b/customizations/LLM Router/dynamo-llm-config.yaml index d3ad52b..519d65a 100644 --- a/customizations/LLM Router/dynamo-llm-config.yaml +++ b/customizations/LLM Router/dynamo-llm-config.yaml @@ -1,3 +1,18 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + apiVersion: dynamo.ai/v1 kind: DynamoCluster metadata: diff --git a/customizations/LLM Router/dynamo-single-llm-config.yaml b/customizations/LLM Router/dynamo-single-llm-config.yaml index 1e4293e..f053ba4 100644 --- a/customizations/LLM Router/dynamo-single-llm-config.yaml +++ b/customizations/LLM Router/dynamo-single-llm-config.yaml @@ -1,3 +1,18 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + apiVersion: dynamo.ai/v1 kind: DynamoCluster metadata: diff --git a/customizations/LLM Router/llm-router-values-override.yaml b/customizations/LLM Router/llm-router-values-override.yaml index 491c5fd..4a7c80d 100644 --- a/customizations/LLM Router/llm-router-values-override.yaml +++ b/customizations/LLM Router/llm-router-values-override.yaml @@ -1,3 +1,18 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + # LLM Router Helm Values Override for Dynamo Integration # Based on: https://github.com/NVIDIA-AI-Blueprints/llm-router/tree/main/deploy/helm/llm-router diff --git a/customizations/LLM Router/router-config-single.yaml b/customizations/LLM Router/router-config-single.yaml index 6b39a9b..0be913e 100644 --- a/customizations/LLM Router/router-config-single.yaml +++ b/customizations/LLM Router/router-config-single.yaml @@ -1,3 +1,18 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + policies: - name: "task_router" url: http://router-server:8000/v2/models/task_router_ensemble/infer diff --git a/customizations/LLM Router/router-config.yaml b/customizations/LLM Router/router-config.yaml index ec59c86..6ebe157 100644 --- a/customizations/LLM Router/router-config.yaml +++ b/customizations/LLM Router/router-config.yaml @@ -1,3 +1,18 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + policies: - name: "task_router" url: http://router-server:8000/v2/models/task_router_ensemble/infer From 2c67f7f12b1bab18b56c55b366999e693c859a4b Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Sun, 15 Jun 2025 21:05:10 +0000 Subject: [PATCH 03/17] Implement NVIDIA Dynamo Cloud Platform integration with LLM Router - Added `deploy-dynamo-integration.sh` script for automated deployment of NVIDIA Dynamo Cloud Platform and LLM Router. - Introduced `dynamo-cloud-deployment.yaml` for configuring the Dynamo deployment. - Created `dynamo-llm-deployment.yaml` for deploying LLM inference graphs. - Added `router-config-dynamo.yaml` for routing configuration specific to Dynamo integration. - Removed outdated configuration files: `dynamo-single-llm-config.yaml` and `router-config-single.yaml`. - Updated `README.md` to reflect new deployment instructions and configuration details. --- customizations/LLM Router/README.md | 485 ++++++++---------- .../LLM Router/deploy-dynamo-integration.sh | 347 +++++++++++++ .../LLM Router/dynamo-cloud-deployment.yaml | 57 ++ .../LLM Router/dynamo-llm-config.yaml | 78 --- .../LLM Router/dynamo-llm-deployment.yaml | 147 ++++++ .../LLM Router/dynamo-single-llm-config.yaml | 32 -- .../llm-router-values-override.yaml | 31 +- ...-config.yaml => router-config-dynamo.yaml} | 44 +- .../LLM Router/router-config-single.yaml | 99 ---- 9 files changed, 803 insertions(+), 517 deletions(-) create mode 100755 customizations/LLM Router/deploy-dynamo-integration.sh create mode 100644 customizations/LLM Router/dynamo-cloud-deployment.yaml delete mode 100644 customizations/LLM Router/dynamo-llm-config.yaml create mode 100644 customizations/LLM Router/dynamo-llm-deployment.yaml delete mode 100644 customizations/LLM Router/dynamo-single-llm-config.yaml rename customizations/LLM Router/{router-config.yaml => router-config-dynamo.yaml} (58%) delete mode 100644 customizations/LLM Router/router-config-single.yaml diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md index 736a976..e97efa0 100644 --- a/customizations/LLM Router/README.md +++ b/customizations/LLM Router/README.md @@ -1,400 +1,323 @@ -# LLM Router with NVIDIA Dynamo - Kubernetes Deployment Guide +# LLM Router with NVIDIA Dynamo Cloud Platform - Kubernetes Deployment Guide -This guide provides step-by-step instructions for deploying [NVIDIA LLM Router](https://github.com/NVIDIA-AI-Blueprints/llm-router) with [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo/tree/main/deploy/cloud) on Kubernetes. +This guide provides step-by-step instructions for deploying [NVIDIA LLM Router](https://github.com/NVIDIA-AI-Blueprints/llm-router) with the official [NVIDIA Dynamo Cloud Platform](https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html) on Kubernetes. ## Overview -The LLM Router intelligently routes LLM requests to the most appropriate model based on the task at hand. This deployment strategy will: +This integration demonstrates how to deploy the official NVIDIA Dynamo Cloud Platform for distributed LLM inference and route requests intelligently using the NVIDIA LLM Router. The setup includes: -- **Deploy Dynamo using Cloud Operator** - Use the official Dynamo cloud operator for robust, scalable deployment -- **Deploy LLMs via Dynamo Cloud** - Host multiple LLM models through Dynamo's cloud infrastructure -- **Configure LLM Router** - Point the router to different models hosted on Dynamo cloud +1. **NVIDIA Dynamo Cloud Platform**: Official distributed inference serving framework with disaggregated serving capabilities +2. **LLM Router**: Intelligent request routing based on task complexity and type +3. **Multiple LLM Models**: Various models deployed via Dynamo's inference graphs -## Quick Start +### Architecture -Choose your deployment option using the provided configuration files: +The integration consists of: -### Option 1: Minimal Verification (1 GPU) -```bash -# 1. Deploy Dynamo Cloud Operator -kubectl apply -f https://github.com/ai-dynamo/dynamo/releases/latest/download/dynamo-operator.yaml +- **NVIDIA Dynamo Cloud Platform**: Official distributed inference serving framework +- **LLM Router**: Routes requests to appropriate models based on task complexity and type +- **Multiple LLM Models**: Various models deployed via Dynamo's disaggregated serving -# 2. Deploy single LLM via Dynamo -kubectl create namespace dynamo -kubectl apply -f dynamo-single-llm-config.yaml +### Key Components -# 3. Clone LLM Router and deploy with Helm -git clone https://github.com/NVIDIA-AI-Blueprints/llm-router.git -cd llm-router +- **dynamo-cloud-deployment.yaml**: Configuration for Dynamo Cloud Platform deployment +- **dynamo-llm-deployment.yaml**: DynamoGraphDeployment for multi-LLM inference +- **router-config-dynamo.yaml**: Router policies for Dynamo integration +- **llm-router-values-dynamo.yaml**: Helm values for LLM Router with Dynamo +- **deploy-dynamo-integration.sh**: Automated deployment script -# 4. Create ConfigMap and deploy router -kubectl create configmap router-config \ - --from-file=config.yaml=../router-config-single.yaml \ - -n llm-router - -helm upgrade --install llm-router deploy/helm/llm-router \ - -f ../llm-router-values-override.yaml \ - -n llm-router \ - --create-namespace \ - --wait --timeout=10m - -# 5. Test the integration -kubectl port-forward svc/router-controller 8084:8084 -n llm-router & -curl -X POST http://localhost:8084/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{"model":"","messages":[{"role":"user","content":"Hello!"}],"nim-llm-router":{"policy":"task_router"}}' -``` +## Quick Start -### Option 2: Full Production (32 GPUs) - -**Model Routing Configuration:** - -| **Task Router** | **Model** | **GPUs** | **Use Case** | -|-----------------|-----------|----------|--------------| -| Brainstorming | llama-3.1-70b-instruct | 4 | Creative ideation | -| Chatbot | mixtral-8x22b-instruct | 4 | Conversational AI | -| Code Generation | llama-3.1-nemotron-70b-instruct | 4 | Programming tasks | -| Summarization | phi-3-mini-128k-instruct | 1 | Text summarization | -| Text Generation | llama-3.2-11b-vision-instruct | 2 | General text creation | -| Open QA | llama-3.1-405b-instruct | 8 | Complex questions | -| Closed QA | llama-3.1-8b-instruct | 1 | Simple Q&A | -| Classification | phi-3-mini-4k-instruct | 1 | Text classification | -| Extraction | llama-3.1-8b-instruct | 1 | Information extraction | -| Rewrite | phi-3-medium-128k-instruct | 2 | Text rewriting | - -| **Complexity Router** | **Model** | **GPUs** | **Use Case** | -|----------------------|-----------|----------|--------------| -| Creativity | llama-3.1-70b-instruct | 4 | Creative tasks | -| Reasoning | llama-3.3-nemotron-super-49b | 4 | Complex reasoning | -| Contextual-Knowledge | llama-3.1-405b-instruct | 8 | Knowledge-intensive | -| Few-Shot | llama-3.1-70b-instruct | 4 | Few-shot learning | -| Domain-Knowledge | llama-3.1-nemotron-70b-instruct | 4 | Specialized domains | -| No-Label-Reason | llama-3.1-8b-instruct | 1 | Simple reasoning | -| Constraint | phi-3-medium-128k-instruct | 2 | Constrained tasks | - -**Total: 32 GPUs across 10 different models + 1 GPU for Router Server = 33 GPUs** - -> **💡 Customization Note:** These model assignments can be changed to suit your specific needs. You can modify the `router-config.yaml` file to: -> - Assign different models to different tasks -> - Add or remove routing categories -> - Adjust GPU allocation based on your hardware -> - Use different model variants or sizes -> -> The routing configuration is fully customizable based on your use case and available resources. +### Automated Deployment (Recommended) ```bash -# 1. Deploy Dynamo Cloud Operator -kubectl apply -f https://github.com/ai-dynamo/dynamo/releases/latest/download/dynamo-operator.yaml +# Make the script executable +chmod +x deploy-dynamo-integration.sh -# 2. Deploy full LLM cluster via Dynamo -kubectl create namespace dynamo -kubectl apply -f dynamo-llm-config.yaml +# Deploy everything (Dynamo + LLM Router) +./deploy-dynamo-integration.sh -# 3. Clone LLM Router and deploy with Helm -git clone https://github.com/NVIDIA-AI-Blueprints/llm-router.git -cd llm-router - -# 4. Create ConfigMap and deploy router -kubectl create configmap router-config \ - --from-file=config.yaml=../router-config.yaml \ - -n llm-router - -helm upgrade --install llm-router deploy/helm/llm-router \ - -f ../llm-router-values-override.yaml \ - -n llm-router \ - --create-namespace \ - --wait --timeout=10m - -# 5. Test the integration -kubectl port-forward svc/router-controller 8084:8084 -n llm-router & -curl -X POST http://localhost:8084/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{"model":"","messages":[{"role":"user","content":"Write a Python function"}],"nim-llm-router":{"policy":"task_router"}}' +# Or deploy components separately: +./deploy-dynamo-integration.sh --dynamo-only # Deploy only Dynamo +./deploy-dynamo-integration.sh --router-only # Deploy only LLM Router +./deploy-dynamo-integration.sh --verify-only # Verify existing deployment ``` -**All configuration files are provided in this directory - no manual file creation needed!** - -## Prerequisites - -- **Kubernetes cluster** (1.20+) with kubectl configured -- **Helm 3.x** for managing deployments -- **Dynamo Cloud Operator** access and credentials -- **NVIDIA GPU nodes** (with GPU Operator installed) -- **NVIDIA API keys** for model access through Dynamo cloud - -## Step 1: Deploy NVIDIA Dynamo using Cloud Operator +### Manual Deployment -### 1.1 Deploy Dynamo Cloud Operator +If you prefer manual deployment or need to customize the process: -Deploy Dynamo using the official cloud operator for production-ready, scalable infrastructure: +#### Step 1: Deploy NVIDIA Dynamo Cloud Platform ```bash -# Install Dynamo Cloud Operator -kubectl apply -f https://github.com/ai-dynamo/dynamo/releases/latest/download/dynamo-operator.yaml - -# Verify operator installation -kubectl get pods -n dynamo-system +# 1. Clone Dynamo repository +git clone https://github.com/ai-dynamo/dynamo.git +cd dynamo + +# 2. Configure environment (edit dynamo-cloud-deployment.yaml first) +export DOCKER_SERVER=nvcr.io/your-org +export IMAGE_TAG=latest +export NAMESPACE=dynamo-cloud +export DOCKER_USERNAME=your-username +export DOCKER_PASSWORD=your-password + +# 3. Build and push Dynamo components +earthly --push +all-docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG + +# 4. Deploy the platform +kubectl create namespace $NAMESPACE +kubectl config set-context --current --namespace=$NAMESPACE +cd deploy/cloud/helm +./deploy.sh --crds + +# 5. Deploy LLM inference graph +kubectl apply -f ../../../dynamo-llm-deployment.yaml ``` -### 1.2 Deploy LLMs via Dynamo Cloud - -Choose from the provided configuration files based on your GPU availability: +#### Step 2: Deploy LLM Router ```bash -# For full production deployment (32 GPUs) - use provided file: -kubectl apply -f dynamo-llm-config.yaml - -# OR for minimal testing (1 GPU) - use provided file: -kubectl apply -f dynamo-single-llm-config.yaml -``` - -**🖥️ GPU Requirements Summary:** -- **Full Configuration**: 32 GPUs for models + 1 GPU for Router Server = **33 GPUs total** -- **Minimal Configuration**: 1 GPU for model + 1 GPU for Router Server = **2 GPUs total** -- **Router Server**: Always requires 1 GPU for routing decisions and model orchestration -- **Recommended for Production**: All models for comprehensive routing - -### 1.3 Configuration Options - -**Configuration Comparison:** - -| Configuration | GPUs | Models | Use Case | File | -|--------------|------|--------|----------|------| -| **Minimal** | 2 (1 model + 1 router) | 1 model | Development, Testing, Verification | `dynamo-single-llm-config.yaml` | -| **Full** | 33 (32 models + 1 router) | 10 comprehensive models | Enterprise production, Maximum routing intelligence | `dynamo-llm-config.yaml` | - -Both configuration files are provided in this repository - no manual creation needed! +# 1. Create router namespace and ConfigMap +kubectl create namespace llm-router +kubectl create configmap router-config-dynamo \ + --from-file=router-config-dynamo.yaml \ + -n llm-router -### 1.4 Verify Dynamo Cloud Deployment +# 2. Add NVIDIA Helm repository +helm repo add nvidia-llm-router https://helm.ngc.nvidia.com/nvidia-ai-blueprints/llm-router +helm repo update -```bash -# Check Dynamo cluster status -kubectl get dynamoclusters -n dynamo +# 3. Deploy LLM Router +helm upgrade --install llm-router nvidia-llm-router/llm-router \ + --namespace llm-router \ + --values llm-router-values-override.yaml \ + --wait --timeout=10m +``` -# Verify all models are deployed and ready -kubectl get pods -n dynamo -l app=dynamo-model +## Prerequisites -# Check model endpoints -kubectl get svc -n dynamo +- **Kubernetes cluster** (1.24+) with kubectl configured +- **Helm 3.x** for managing deployments +- **Earthly** for building Dynamo components ([Install Guide](https://earthly.dev/get-earthly)) +- **NVIDIA GPU nodes** with GPU Operator installed +- **Container registry access** (NVIDIA NGC or private registry) +- **Git** for cloning repositories -# The LLM Router will connect to models through: -# http://llm-cluster.dynamo.svc.cluster.local:8080 -``` +## Configuration -## Step 2: Deploy LLM Router with Official Helm Chart +### Dynamo Cloud Platform Configuration -### 2.1 Download LLM Router +Edit `dynamo-cloud-deployment.yaml` to configure: ```bash -# Clone the official LLM Router repository -git clone https://github.com/NVIDIA-AI-Blueprints/llm-router.git -cd llm-router +# Container Registry Configuration +DOCKER_SERVER=nvcr.io/your-org # Your container registry +IMAGE_TAG=latest # Image tag to use +DOCKER_USERNAME=your-username # Registry username +DOCKER_PASSWORD=your-password # Registry password + +# Dynamo Cloud Platform Configuration +NAMESPACE=dynamo-cloud # Kubernetes namespace + +# External Access Configuration +INGRESS_ENABLED=true # Enable ingress +INGRESS_CLASS=nginx # Ingress class ``` -### 2.2 Use Provided Helm Values Override +### LLM Model Configuration -The repository includes a pre-configured Helm values override file (`llm-router-values-override.yaml`) that configures the LLM Router to work with Dynamo. This file includes: +The `dynamo-llm-deployment.yaml` file defines a `DynamoGraphDeployment` with multiple services: -- Router Controller and Server configuration -- Resource allocation (CPU, memory, GPU) -- Security contexts and service accounts -- Monitoring configuration -- Integration with Dynamo endpoints +- **Frontend**: API gateway (1 replica) +- **Processor**: Request processing (1 replica) +- **VllmWorker**: Multi-model inference (1 replica, 4 GPUs) +- **PrefillWorker**: Disaggregated prefill (2 replicas, 2 GPUs each) +- **Router**: KV-aware routing (1 replica) -No additional configuration needed - the file is ready to use. +**Total GPU Requirements**: 8 GPUs for models + 1 GPU for LLM Router = **9 GPUs** -### 2.3 Use Provided Router Configuration +### Router Configuration -The repository includes pre-configured router configuration files: +The `router-config-dynamo.yaml` configures routing policies: -- **`router-config-single.yaml`** - For single LLM verification (all routes point to one model) -- **`router-config.yaml`** - For full production deployment with multiple models and intelligent routing +| **Task Router** | **Model** | **Use Case** | +|-----------------|-----------|--------------| +| Brainstorming | llama-3.1-70b-instruct | Creative ideation | +| Chatbot | mixtral-8x22b-instruct | Conversational AI | +| Code Generation | llama-3.1-nemotron-70b-instruct | Programming tasks | +| Summarization | phi-3-mini-128k-instruct | Text summarization | +| Text Generation | llama-3.2-11b-vision-instruct | General text creation | +| Open QA | llama-3.1-405b-instruct | Complex questions | +| Closed QA | llama-3.1-8b-instruct | Simple Q&A | +| Classification | phi-3-mini-4k-instruct | Text classification | +| Extraction | llama-3.1-8b-instruct | Information extraction | +| Rewrite | phi-3-medium-128k-instruct | Text rewriting | -Both files are ready to use - no manual creation needed! +| **Complexity Router** | **Model** | **Use Case** | +|----------------------|-----------|--------------| +| Creativity | llama-3.1-70b-instruct | Creative tasks | +| Reasoning | llama-3.3-nemotron-super-49b | Complex reasoning | +| Contextual-Knowledge | llama-3.1-405b-instruct | Knowledge-intensive | +| Few-Shot | llama-3.1-70b-instruct | Few-shot learning | +| Domain-Knowledge | llama-3.1-nemotron-70b-instruct | Specialized domains | +| No-Label-Reason | llama-3.1-8b-instruct | Simple reasoning | +| Constraint | phi-3-medium-128k-instruct | Constrained tasks | -### 2.4 Deploy LLM Router with Helm +All routes point to: `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1` -```bash -# For full production deployment: -kubectl create configmap router-config \ - --from-file=config.yaml=router-config.yaml \ - -n llm-router +## Verification and Testing -# For single LLM verification: -kubectl create configmap router-config \ - --from-file=config.yaml=router-config-single.yaml \ - -n llm-router - -# Deploy LLM Router using Helm with the provided values override -helm upgrade --install llm-router deploy/helm/llm-router \ - -f ../llm-router-values-override.yaml \ - -n llm-router \ - --create-namespace \ - --wait --timeout=10m - -# Verify deployment -kubectl get pods -n llm-router -kubectl get svc -n llm-router -``` - -## Step 3: Verification and Testing - -### 3.1 Verify Dynamo LLM Endpoints +### 1. Verify Dynamo Deployment ```bash -# Check Dynamo cluster status -kubectl get dynamoclusters -n dynamo - -# Verify all models are deployed and ready -kubectl get pods -n dynamo -l app=dynamo-model +# Check Dynamo platform status +kubectl get pods -n dynamo-cloud +kubectl get dynamographdeployment -n dynamo-cloud -# Test direct LLM endpoint -kubectl port-forward svc/llm-cluster 8080:8080 -n dynamo & +# Check services +kubectl get svc -n dynamo-cloud -# Test the LLM endpoint (in another terminal) +# Test direct Dynamo endpoint +kubectl port-forward svc/dynamo-llm-service 8080:8080 -n dynamo-cloud & curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama-3.1-8b-instruct", - "messages": [ - { - "role": "user", - "content": "Hello, how are you?" - } - ], + "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100 }' ``` -### 3.2 Test LLM Router Integration +### 2. Test LLM Router Integration ```bash -# Port forward LLM Router controller -kubectl port-forward svc/router-controller 8084:8084 -n llm-router & +# Port forward LLM Router +kubectl port-forward svc/llm-router 8080:8080 -n llm-router & # Test task-based routing -curl -X 'POST' \ - 'http://localhost:8084/v1/chat/completions' \ - -H 'accept: application/json' \ - -H 'Content-Type: application/json' \ +curl -X POST http://localhost:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ -d '{ "model": "", - "messages": [ - { - "role": "user", - "content": "Write a Python function to calculate factorial" - } - ], + "messages": [{"role": "user", "content": "Write a Python function"}], "max_tokens": 512, - "stream": false, "nim-llm-router": { - "policy": "task_router", - "routing_strategy": "triton", - "model": "" + "policy": "task_router" } }' # Test complexity-based routing -curl -X 'POST' \ - 'http://localhost:8084/v1/chat/completions' \ - -H 'accept: application/json' \ - -H 'Content-Type: application/json' \ +curl -X POST http://localhost:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ -d '{ "model": "", - "messages": [ - { - "role": "user", - "content": "Explain quantum computing in simple terms" - } - ], + "messages": [{"role": "user", "content": "Explain quantum computing"}], "max_tokens": 512, - "stream": false, "nim-llm-router": { - "policy": "complexity_router", - "routing_strategy": "triton", - "model": "" + "policy": "complexity_router" } }' ``` -### 3.3 Single LLM Verification Setup - -For testing with a minimal setup, use the provided configuration files: +### 3. Monitor Deployment ```bash -# Deploy single LLM using provided configuration -kubectl apply -f dynamo-single-llm-config.yaml +# Monitor Dynamo logs +kubectl logs -f deployment/dynamo-store -n dynamo-cloud + +# Monitor LLM Router logs +kubectl logs -f deployment/llm-router -n llm-router + +# Check resource usage +kubectl top pods -n dynamo-cloud +kubectl top pods -n llm-router +``` + +## How Dynamo Model Routing Works + +The key insight is that Dynamo provides a **single gateway endpoint** that routes to different models based on the `model` parameter in the OpenAI-compatible API request: -# Use the provided single LLM router configuration -kubectl create configmap router-config \ - --from-file=config.yaml=router-config-single.yaml \ - -n llm-router \ - --dry-run=client -o yaml | kubectl apply -f - +1. **Single Endpoint**: `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1` +2. **Model-Based Routing**: Dynamo routes internally based on the `model` field in requests +3. **OpenAI Compatibility**: Standard OpenAI API format with model selection -# Restart router pods to pick up new configuration -kubectl rollout restart deployment/router-controller -n llm-router +Example request: +```json +{ + "model": "llama-3.1-70b-instruct", // Dynamo routes based on this + "messages": [...], + "temperature": 0.7 +} ``` -The provided `router-config-single.yaml` file configures all routing policies to point to the single LLM, making it perfect for verification that the routing mechanism works correctly. +Dynamo's internal architecture handles: +- Model registry and discovery +- Request parsing and routing +- Load balancing across replicas +- KV cache management +- Disaggregated serving coordination ## Troubleshooting ### Common Issues -1. **Pods not starting**: Check GPU node availability and resource requests -2. **Service communication**: Verify service discovery and DNS resolution -3. **Model loading**: Verify NVIDIA API connectivity and quotas +1. **Build failures**: Ensure earthly is installed and container registry access is configured +2. **CRD not found**: Wait for Dynamo platform to fully deploy before applying DynamoGraphDeployment +3. **Service communication**: Verify cross-namespace RBAC permissions +4. **Model loading**: Check GPU availability and resource requests ### Debugging Commands ```bash -# Check pod logs -kubectl logs -f deployment/router-controller -n llm-router -kubectl logs -f deployment/router-server -n llm-router -kubectl logs -f deployment/dynamo-orchestrator -n dynamo +# Check Dynamo platform +kubectl get pods -n dynamo-cloud +kubectl logs -f deployment/dynamo-store -n dynamo-cloud +kubectl describe dynamographdeployment llm-multi-model -n dynamo-cloud -# Check events -kubectl get events -n llm-router --sort-by=.metadata.creationTimestamp -kubectl get events -n dynamo --sort-by=.metadata.creationTimestamp +# Check LLM Router +kubectl get pods -n llm-router +kubectl logs -f deployment/llm-router -n llm-router +kubectl describe configmap router-config-dynamo -n llm-router -# Check resource usage -kubectl top pods -n llm-router -kubectl top pods -n dynamo +# Check networking +kubectl exec -it deployment/llm-router -n llm-router -- nslookup dynamo-llm-service.dynamo-cloud.svc.cluster.local -# Debug networking -kubectl exec -it deployment/router-controller -n llm-router -- nslookup router-server +# Check events +kubectl get events -n dynamo-cloud --sort-by=.metadata.creationTimestamp +kubectl get events -n llm-router --sort-by=.metadata.creationTimestamp ``` ## Cleanup -To remove the deployment: - ```bash # Remove LLM Router helm uninstall llm-router -n llm-router kubectl delete namespace llm-router -# Remove Dynamo LLMs -kubectl delete dynamoclusters --all -n dynamo -kubectl delete namespace dynamo +# Remove Dynamo deployment +kubectl delete dynamographdeployment llm-multi-model -n dynamo-cloud +kubectl delete namespace dynamo-cloud -# Remove Dynamo Operator (optional) -kubectl delete -f https://github.com/ai-dynamo/dynamo/releases/latest/download/dynamo-operator.yaml +# Remove Dynamo platform (if desired) +cd dynamo/deploy/cloud/helm +./deploy.sh --uninstall ``` ## Files in This Directory - **`README.md`** - This comprehensive deployment guide -- **`llm-router-values-override.yaml`** - Helm values override for LLM Router -- **`router-config-single.yaml`** - Router configuration for single LLM verification (1 GPU) -- **`router-config.yaml`** - Router configuration for full production deployment (32 GPUs) -- **`dynamo-single-llm-config.yaml`** - Minimal Dynamo configuration for testing (1 GPU) -- **`dynamo-llm-config.yaml`** - Full Dynamo configuration for production (32 GPUs) +- **`dynamo-cloud-deployment.yaml`** - Environment configuration for Dynamo Cloud Platform +- **`dynamo-llm-deployment.yaml`** - DynamoGraphDeployment for multi-LLM inference +- **`router-config-dynamo.yaml`** - Router configuration for Dynamo integration +- **`llm-router-values-override.yaml`** - Helm values override for LLM Router with Dynamo integration +- **`deploy-dynamo-integration.sh`** - Automated deployment script ## Resources +- [NVIDIA Dynamo Cloud Platform Documentation](https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html) +- [NVIDIA Dynamo Kubernetes Operator](https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_operator.html) +- [NVIDIA Dynamo GitHub Repository](https://github.com/ai-dynamo/dynamo) - [LLM Router GitHub Repository](https://github.com/NVIDIA-AI-Blueprints/llm-router) - [LLM Router Helm Chart](https://github.com/NVIDIA-AI-Blueprints/llm-router/tree/main/deploy/helm/llm-router) -- [NVIDIA Dynamo Cloud Deployment](https://github.com/ai-dynamo/dynamo/tree/main/deploy/cloud) - [Kubernetes Documentation](https://kubernetes.io/docs/) - [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html) \ No newline at end of file diff --git a/customizations/LLM Router/deploy-dynamo-integration.sh b/customizations/LLM Router/deploy-dynamo-integration.sh new file mode 100755 index 0000000..c765731 --- /dev/null +++ b/customizations/LLM Router/deploy-dynamo-integration.sh @@ -0,0 +1,347 @@ +#!/bin/bash + +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# NVIDIA Dynamo Cloud Platform + LLM Router Integration Deployment Script +# This script deploys the official NVIDIA Dynamo Cloud Platform and integrates it with the LLM Router +# +# Based on: https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html + +set -euo pipefail + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +# Configuration +DYNAMO_NAMESPACE="dynamo-cloud" +ROUTER_NAMESPACE="llm-router" +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" + +# Function to print colored output +print_status() { + echo -e "${BLUE}[INFO]${NC} $1" +} + +print_success() { + echo -e "${GREEN}[SUCCESS]${NC} $1" +} + +print_warning() { + echo -e "${YELLOW}[WARNING]${NC} $1" +} + +print_error() { + echo -e "${RED}[ERROR]${NC} $1" +} + +# Function to check prerequisites +check_prerequisites() { + print_status "Checking prerequisites..." + + # Check if kubectl is available + if ! command -v kubectl &> /dev/null; then + print_error "kubectl is not installed or not in PATH" + exit 1 + fi + + # Check if helm is available + if ! command -v helm &> /dev/null; then + print_error "helm is not installed or not in PATH" + exit 1 + fi + + # Check if earthly is available (for Dynamo build) + if ! command -v earthly &> /dev/null; then + print_warning "earthly is not installed. You'll need to build Dynamo components manually." + print_warning "Install earthly from: https://earthly.dev/get-earthly" + fi + + # Check cluster connectivity + if ! kubectl cluster-info &> /dev/null; then + print_error "Cannot connect to Kubernetes cluster" + exit 1 + fi + + print_success "Prerequisites check completed" +} + +# Function to deploy NVIDIA Dynamo Cloud Platform +deploy_dynamo_platform() { + print_status "Deploying NVIDIA Dynamo Cloud Platform..." + + # Check if Dynamo repository is available + if [ ! -d "dynamo" ]; then + print_status "Cloning NVIDIA Dynamo repository..." + git clone https://github.com/ai-dynamo/dynamo.git + fi + + cd dynamo + + # Source configuration + if [ -f "${SCRIPT_DIR}/dynamo-cloud-deployment.yaml" ]; then + print_status "Loading Dynamo configuration..." + source "${SCRIPT_DIR}/dynamo-cloud-deployment.yaml" 2>/dev/null || true + fi + + # Set default values if not provided + export DOCKER_SERVER=${DOCKER_SERVER:-"nvcr.io/your-org"} + export IMAGE_TAG=${IMAGE_TAG:-"latest"} + export NAMESPACE=${NAMESPACE:-$DYNAMO_NAMESPACE} + export DOCKER_USERNAME=${DOCKER_USERNAME:-""} + export DOCKER_PASSWORD=${DOCKER_PASSWORD:-""} + + print_status "Building Dynamo Cloud Platform components..." + print_warning "This step requires earthly and may take several minutes..." + + # Build and push components (if earthly is available) + if command -v earthly &> /dev/null; then + earthly --push +all-docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG || { + print_error "Failed to build Dynamo components. Please check your configuration." + print_warning "You may need to:" + print_warning "1. Update DOCKER_SERVER in dynamo-cloud-deployment.yaml" + print_warning "2. Ensure you're logged into your container registry" + print_warning "3. Have proper permissions to push images" + exit 1 + } + else + print_warning "Skipping build step - earthly not available" + print_warning "Please build components manually or use pre-built images" + fi + + # Create namespace + kubectl create namespace $NAMESPACE --dry-run=client -o yaml | kubectl apply -f - + kubectl config set-context --current --namespace=$NAMESPACE + + # Deploy the platform + print_status "Deploying Dynamo Cloud Platform Helm charts..." + cd deploy/cloud/helm + + # Install CRDs first + ./deploy.sh --crds || { + print_error "Failed to deploy Dynamo CRDs" + exit 1 + } + + print_success "NVIDIA Dynamo Cloud Platform deployed successfully" + cd ../../.. +} + +# Function to deploy LLM models using Dynamo +deploy_llm_models() { + print_status "Deploying LLM models using Dynamo..." + + # Wait for Dynamo platform to be ready + print_status "Waiting for Dynamo platform to be ready..." + kubectl wait --for=condition=available --timeout=300s deployment/dynamo-store -n $DYNAMO_NAMESPACE || { + print_warning "Dynamo platform may not be fully ready. Continuing anyway..." + } + + # Expose Dynamo API store for CLI access + print_status "Setting up Dynamo API access..." + kubectl port-forward svc/dynamo-store 8080:80 -n $DYNAMO_NAMESPACE & + PORT_FORWARD_PID=$! + export DYNAMO_CLOUD="http://localhost:8080" + + # Wait for port-forward to be ready + sleep 5 + + # Build and deploy LLM inference graph + print_status "Building and deploying LLM inference graph..." + + # This would typically involve: + # 1. dynamo build --push graphs.disagg:Frontend + # 2. Creating DynamoGraphDeployment CRD + + # For now, we'll apply the example deployment + kubectl apply -f "${SCRIPT_DIR}/dynamo-llm-deployment.yaml" || { + print_error "Failed to deploy LLM models" + kill $PORT_FORWARD_PID 2>/dev/null || true + exit 1 + } + + # Clean up port-forward + kill $PORT_FORWARD_PID 2>/dev/null || true + + print_success "LLM models deployed successfully" +} + +# Function to deploy LLM Router +deploy_llm_router() { + print_status "Deploying LLM Router..." + + # Create router namespace + kubectl create namespace $ROUTER_NAMESPACE --dry-run=client -o yaml | kubectl apply -f - + + # Create ConfigMap for router configuration + kubectl create configmap router-config-dynamo \ + --from-file="${SCRIPT_DIR}/router-config-dynamo.yaml" \ + -n $ROUTER_NAMESPACE \ + --dry-run=client -o yaml | kubectl apply -f - + + # Add NVIDIA Helm repository + helm repo add nvidia-llm-router https://helm.ngc.nvidia.com/nvidia-ai-blueprints/llm-router || { + print_warning "Failed to add NVIDIA LLM Router Helm repository" + print_warning "You may need to configure NGC access or use a different repository" + } + helm repo update + + # Deploy LLM Router using Helm + helm upgrade --install llm-router nvidia-llm-router/llm-router \ + --namespace $ROUTER_NAMESPACE \ + --values "${SCRIPT_DIR}/llm-router-values-override.yaml" \ + --wait \ + --timeout=10m || { + print_error "Failed to deploy LLM Router" + exit 1 + } + + print_success "LLM Router deployed successfully" +} + +# Function to verify deployment +verify_deployment() { + print_status "Verifying deployment..." + + # Check Dynamo platform + print_status "Checking Dynamo platform status..." + kubectl get pods -n $DYNAMO_NAMESPACE + kubectl get dynamographdeployment -n $DYNAMO_NAMESPACE 2>/dev/null || { + print_warning "DynamoGraphDeployment CRD may not be available yet" + } + + # Check LLM Router + print_status "Checking LLM Router status..." + kubectl get pods -n $ROUTER_NAMESPACE + kubectl get svc -n $ROUTER_NAMESPACE + + # Test connectivity + print_status "Testing service connectivity..." + + # Check if Dynamo service is accessible + if kubectl get svc dynamo-llm-service -n $DYNAMO_NAMESPACE &>/dev/null; then + print_success "Dynamo LLM service is available" + else + print_warning "Dynamo LLM service not found - may still be starting" + fi + + # Check if Router service is accessible + if kubectl get svc llm-router -n $ROUTER_NAMESPACE &>/dev/null; then + print_success "LLM Router service is available" + + # Get router endpoint + ROUTER_IP=$(kubectl get svc llm-router -n $ROUTER_NAMESPACE -o jsonpath='{.status.loadBalancer.ingress[0].ip}' 2>/dev/null || echo "") + if [ -n "$ROUTER_IP" ]; then + print_success "LLM Router external IP: $ROUTER_IP" + else + print_status "LLM Router is using ClusterIP. Use port-forward for external access:" + print_status "kubectl port-forward svc/llm-router 8080:8080 -n $ROUTER_NAMESPACE" + fi + else + print_warning "LLM Router service not found" + fi + + print_success "Deployment verification completed" +} + +# Function to show usage +show_usage() { + echo "Usage: $0 [OPTIONS]" + echo "" + echo "Deploy NVIDIA Dynamo Cloud Platform with LLM Router integration" + echo "" + echo "Options:" + echo " --dynamo-only Deploy only Dynamo Cloud Platform" + echo " --router-only Deploy only LLM Router (requires existing Dynamo)" + echo " --verify-only Only verify existing deployment" + echo " --help Show this help message" + echo "" + echo "Prerequisites:" + echo " - kubectl configured for your cluster" + echo " - helm 3.x installed" + echo " - earthly installed (for building Dynamo components)" + echo " - Access to NVIDIA NGC registry" + echo "" + echo "Configuration:" + echo " Edit dynamo-cloud-deployment.yaml to configure registry settings" +} + +# Main deployment function +main() { + local dynamo_only=false + local router_only=false + local verify_only=false + + # Parse command line arguments + while [[ $# -gt 0 ]]; do + case $1 in + --dynamo-only) + dynamo_only=true + shift + ;; + --router-only) + router_only=true + shift + ;; + --verify-only) + verify_only=true + shift + ;; + --help) + show_usage + exit 0 + ;; + *) + print_error "Unknown option: $1" + show_usage + exit 1 + ;; + esac + done + + print_status "Starting NVIDIA Dynamo + LLM Router deployment..." + + check_prerequisites + + if [ "$verify_only" = true ]; then + verify_deployment + exit 0 + fi + + if [ "$router_only" = false ]; then + deploy_dynamo_platform + deploy_llm_models + fi + + if [ "$dynamo_only" = false ]; then + deploy_llm_router + fi + + verify_deployment + + print_success "Deployment completed successfully!" + print_status "" + print_status "Next steps:" + print_status "1. Access LLM Router: kubectl port-forward svc/llm-router 8080:8080 -n $ROUTER_NAMESPACE" + print_status "2. Test routing: curl http://localhost:8080/v1/chat/completions" + print_status "3. Monitor with: kubectl logs -f deployment/llm-router -n $ROUTER_NAMESPACE" +} + +# Run main function +main "$@" \ No newline at end of file diff --git a/customizations/LLM Router/dynamo-cloud-deployment.yaml b/customizations/LLM Router/dynamo-cloud-deployment.yaml new file mode 100644 index 0000000..d9931b8 --- /dev/null +++ b/customizations/LLM Router/dynamo-cloud-deployment.yaml @@ -0,0 +1,57 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# NVIDIA Dynamo Cloud Platform Deployment Configuration +# This file contains environment variables and configuration for deploying +# the official NVIDIA Dynamo Cloud Platform +# +# Based on: https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html + +# Container Registry Configuration +DOCKER_SERVER=nvcr.io/your-org +IMAGE_TAG=latest +DOCKER_USERNAME=your-username +DOCKER_PASSWORD=your-password + +# Dynamo Cloud Platform Configuration +NAMESPACE=dynamo-cloud + +# Optional: Pipeline-specific registry (if different from platform registry) +PIPELINES_DOCKER_SERVER=nvcr.io/your-org +PIPELINES_DOCKER_USERNAME=your-username +PIPELINES_DOCKER_PASSWORD=your-password + +# External Access Configuration (choose one) +# Option 1: Ingress +INGRESS_ENABLED=true +INGRESS_CLASS=nginx + +# Option 2: Istio (alternative to Ingress) +# ISTIO_ENABLED=true +# ISTIO_GATEWAY=istio-system/istio-ingressgateway + +# Deployment Notes: +# 1. Build and push Dynamo Cloud Platform components: +# earthly --push +all-docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG +# +# 2. Deploy the platform: +# cd deploy/cloud/helm +# kubectl create namespace $NAMESPACE +# kubectl config set-context --current --namespace=$NAMESPACE +# ./deploy.sh --crds +# +# 3. Expose the platform: +# kubectl port-forward svc/dynamo-store :80 -n $NAMESPACE +# export DYNAMO_CLOUD=http://localhost: \ No newline at end of file diff --git a/customizations/LLM Router/dynamo-llm-config.yaml b/customizations/LLM Router/dynamo-llm-config.yaml deleted file mode 100644 index 519d65a..0000000 --- a/customizations/LLM Router/dynamo-llm-config.yaml +++ /dev/null @@ -1,78 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -apiVersion: dynamo.ai/v1 -kind: DynamoCluster -metadata: - name: llm-cluster - namespace: dynamo -spec: - models: - # Task Router Models - - name: llama-3.1-70b-instruct - source: nvidia/Llama-3.1-70B-Instruct - replicas: 1 - resources: - gpu: 4 - - name: mixtral-8x22b-instruct - source: mistralai/Mixtral-8x22B-Instruct-v0.1 - replicas: 1 - resources: - gpu: 4 - - name: llama-3.1-nemotron-70b-instruct - source: nvidia/Llama-3.1-Nemotron-70B-Instruct-HF - replicas: 1 - resources: - gpu: 4 - - name: phi-3-mini-128k-instruct - source: microsoft/Phi-3-mini-128k-instruct - replicas: 1 - resources: - gpu: 1 - - name: llama-3.2-11b-vision-instruct - source: meta-llama/Llama-3.2-11B-Vision-Instruct - replicas: 1 - resources: - gpu: 2 - - name: llama-3.1-405b-instruct - source: meta-llama/Llama-3.1-405B-Instruct - replicas: 1 - resources: - gpu: 8 - - name: llama-3.1-8b-instruct - source: meta-llama/Llama-3.1-8B-Instruct - replicas: 2 - resources: - gpu: 1 - - name: phi-3-mini-4k-instruct - source: microsoft/Phi-3-mini-4k-instruct - replicas: 1 - resources: - gpu: 1 - - name: phi-3-medium-128k-instruct - source: microsoft/Phi-3-medium-128k-instruct - replicas: 1 - resources: - gpu: 2 - # Complexity Router Models - - name: llama-3.3-nemotron-super-49b - source: nvidia/Llama-3.3-Nemotron-Super-49B - replicas: 1 - resources: - gpu: 4 - cloudConfig: - provider: nvidia - region: us-west-2 - endpoint: https://api.dynamo.ai \ No newline at end of file diff --git a/customizations/LLM Router/dynamo-llm-deployment.yaml b/customizations/LLM Router/dynamo-llm-deployment.yaml new file mode 100644 index 0000000..2e4a397 --- /dev/null +++ b/customizations/LLM Router/dynamo-llm-deployment.yaml @@ -0,0 +1,147 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# NVIDIA Dynamo LLM Deployment Example +# This demonstrates how to deploy LLM inference graphs using the official +# NVIDIA Dynamo Cloud Platform CRDs +# +# Based on: https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_operator.html + +--- +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: llm-multi-model + namespace: dynamo-cloud +spec: + # Reference to the built and pushed Dynamo component + # This would be generated by: dynamo build --push graphs.disagg:Frontend + dynamoComponent: frontend:jh2o6dqzpsgfued4 + + # Global environment variables for all services + envs: + - name: LOG_LEVEL + value: "INFO" + - name: ENABLE_METRICS + value: "true" + + # Service-specific configurations + services: + Frontend: + replicas: 1 + envs: + - name: FRONTEND_PORT + value: "8080" + resources: + requests: + cpu: 500m + memory: 1Gi + limits: + cpu: 1000m + memory: 2Gi + + Processor: + replicas: 1 + envs: + - name: PROCESSOR_WORKERS + value: "4" + resources: + requests: + cpu: 1000m + memory: 2Gi + limits: + cpu: 2000m + memory: 4Gi + + # vLLM Worker for multiple models + VllmWorker: + replicas: 1 + envs: + - name: VLLM_MODELS + value: "llama-3.1-8b-instruct,llama-3.1-70b-instruct,mixtral-8x22b-instruct" + - name: VLLM_GPU_MEMORY_UTILIZATION + value: "0.9" + resources: + requests: + cpu: 2000m + memory: 8Gi + nvidia.com/gpu: 4 + limits: + cpu: 4000m + memory: 16Gi + nvidia.com/gpu: 4 + nodeSelector: + nvidia.com/gpu.present: "true" + tolerations: + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + + # Prefill Worker for disaggregated serving + PrefillWorker: + replicas: 2 + envs: + - name: PREFILL_MAX_BATCH_SIZE + value: "32" + resources: + requests: + cpu: 1000m + memory: 4Gi + nvidia.com/gpu: 2 + limits: + cpu: 2000m + memory: 8Gi + nvidia.com/gpu: 2 + nodeSelector: + nvidia.com/gpu.present: "true" + tolerations: + - key: nvidia.com/gpu + operator: Exists + effect: NoSchedule + + # Router for KV-aware routing + Router: + replicas: 1 + envs: + - name: ROUTER_ALGORITHM + value: "kv_aware" + - name: ROUTER_CACHE_SIZE + value: "1000" + resources: + requests: + cpu: 500m + memory: 1Gi + limits: + cpu: 1000m + memory: 2Gi + +--- +# Service to expose the Dynamo deployment +apiVersion: v1 +kind: Service +metadata: + name: dynamo-llm-service + namespace: dynamo-cloud + labels: + app: dynamo-llm +spec: + type: ClusterIP + ports: + - name: http + port: 8080 + targetPort: 8080 + protocol: TCP + selector: + dynamo-component: Frontend \ No newline at end of file diff --git a/customizations/LLM Router/dynamo-single-llm-config.yaml b/customizations/LLM Router/dynamo-single-llm-config.yaml deleted file mode 100644 index f053ba4..0000000 --- a/customizations/LLM Router/dynamo-single-llm-config.yaml +++ /dev/null @@ -1,32 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -apiVersion: dynamo.ai/v1 -kind: DynamoCluster -metadata: - name: llm-cluster-single - namespace: dynamo -spec: - models: - # Single model for verification - - name: llama-3.1-8b-instruct - source: meta-llama/Llama-3.1-8B-Instruct - replicas: 1 - resources: - gpu: 1 - cloudConfig: - provider: nvidia - region: us-west-2 - endpoint: https://api.dynamo.ai \ No newline at end of file diff --git a/customizations/LLM Router/llm-router-values-override.yaml b/customizations/LLM Router/llm-router-values-override.yaml index 4a7c80d..549ab0b 100644 --- a/customizations/LLM Router/llm-router-values-override.yaml +++ b/customizations/LLM Router/llm-router-values-override.yaml @@ -13,7 +13,8 @@ # See the License for the specific language governing permissions and # limitations under the License. -# LLM Router Helm Values Override for Dynamo Integration +# LLM Router Helm Values for NVIDIA Dynamo Cloud Platform Integration +# This configuration integrates the LLM Router with the official NVIDIA Dynamo deployment # Based on: https://github.com/NVIDIA-AI-Blueprints/llm-router/tree/main/deploy/helm/llm-router # Router Controller Configuration @@ -28,14 +29,16 @@ routerController: type: ClusterIP port: 8084 - # Configure to route to Dynamo instead of external APIs + # Configure to route to NVIDIA Dynamo Cloud Platform env: - name: LOG_LEVEL value: "INFO" - name: ENABLE_METRICS value: "true" - name: DYNAMO_ENDPOINT - value: "http://llm-cluster-single.dynamo.svc.cluster.local:8080" + value: "http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080" + - name: DYNAMO_NAMESPACE + value: "dynamo-cloud" resources: requests: @@ -75,14 +78,14 @@ routerServer: operator: Exists effect: NoSchedule -# Configuration Map +# Configuration Map for Router Policies config: - # The router configuration will be mounted from our custom config.yaml + # The router configuration will be mounted from router-config-dynamo.yaml mountPath: /app/config configMap: - name: router-config + name: router-config-dynamo -# Ingress Configuration (disabled for verification) +# Ingress Configuration (disabled for internal routing) ingress: enabled: false @@ -113,4 +116,16 @@ securityContext: # Image Pull Secrets (if needed for private registries) imagePullSecrets: [] - # - name: nvidia-registry-secret \ No newline at end of file + # - name: nvidia-registry-secret + +# Cross-namespace service access +rbac: + create: true + rules: + - apiGroups: [""] + resources: ["services", "endpoints"] + verbs: ["get", "list", "watch"] + - apiGroups: [""] + resources: ["services"] + resourceNames: ["dynamo-llm-service"] + verbs: ["get"] \ No newline at end of file diff --git a/customizations/LLM Router/router-config.yaml b/customizations/LLM Router/router-config-dynamo.yaml similarity index 58% rename from customizations/LLM Router/router-config.yaml rename to customizations/LLM Router/router-config-dynamo.yaml index 6ebe157..ea78440 100644 --- a/customizations/LLM Router/router-config.yaml +++ b/customizations/LLM Router/router-config-dynamo.yaml @@ -13,56 +13,62 @@ # See the License for the specific language governing permissions and # limitations under the License. +# LLM Router Configuration for NVIDIA Dynamo Integration +# This configuration routes requests to the official NVIDIA Dynamo Cloud Platform +# deployment using the proper service endpoints +# +# Based on: https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html + policies: - name: "task_router" url: http://router-server:8000/v2/models/task_router_ensemble/infer llms: - name: Brainstorming - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: llama-3.1-70b-instruct - name: Chatbot - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: mixtral-8x22b-instruct - name: "Code Generation" - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: llama-3.1-nemotron-70b-instruct - name: Summarization - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: phi-3-mini-128k-instruct - name: "Text Generation" - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: llama-3.2-11b-vision-instruct - name: "Open QA" - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: llama-3.1-405b-instruct - name: "Closed QA" - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: llama-3.1-8b-instruct - name: Classification - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: phi-3-mini-4k-instruct - name: Extraction - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: llama-3.1-8b-instruct - name: Rewrite - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: phi-3-medium-128k-instruct - name: Other - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: llama-3.1-70b-instruct - name: Unknown - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: llama-3.1-8b-instruct @@ -70,30 +76,30 @@ policies: url: http://router-server:8000/v2/models/complexity_router_ensemble/infer llms: - name: Creativity - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: llama-3.1-70b-instruct - name: Reasoning - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: llama-3.3-nemotron-super-49b - name: "Contextual-Knowledge" - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: llama-3.1-405b-instruct - name: "Few-Shot" - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: llama-3.1-70b-instruct - name: "Domain-Knowledge" - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: llama-3.1-nemotron-70b-instruct - name: "No-Label-Reason" - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: llama-3.1-8b-instruct - name: Constraint - api_base: http://llm-cluster.dynamo.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 api_key: "" model: phi-3-medium-128k-instruct \ No newline at end of file diff --git a/customizations/LLM Router/router-config-single.yaml b/customizations/LLM Router/router-config-single.yaml deleted file mode 100644 index 0be913e..0000000 --- a/customizations/LLM Router/router-config-single.yaml +++ /dev/null @@ -1,99 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -policies: - - name: "task_router" - url: http://router-server:8000/v2/models/task_router_ensemble/infer - llms: - - name: Brainstorming - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: Chatbot - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: "Code Generation" - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: Summarization - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: "Text Generation" - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: "Open QA" - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: "Closed QA" - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: Classification - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: Extraction - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: Rewrite - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: Other - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: Unknown - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - - name: "complexity_router" - url: http://router-server:8000/v2/models/complexity_router_ensemble/infer - llms: - - name: Creativity - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: Reasoning - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: "Contextual-Knowledge" - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: "Few-Shot" - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: "Domain-Knowledge" - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: "No-Label-Reason" - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct - - name: Constraint - api_base: http://llm-cluster-single.dynamo.svc.cluster.local:8080/v1 - api_key: "" - model: llama-3.1-8b-instruct \ No newline at end of file From 24b304146cdac3fdbfad7f9f8cf7f637827e73d5 Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Sun, 15 Jun 2025 21:12:05 +0000 Subject: [PATCH 04/17] Enable ingress for LLM Router to allow external access - Updated `llm-router-values-override.yaml` to enable ingress with NGINX configuration. - Added detailed ingress setup instructions to `README.md`, including host configuration and testing methods. - Provided examples for testing routing via ingress and port-forwarding. --- customizations/LLM Router/README.md | 66 +++++++++++++++++-- .../llm-router-values-override.yaml | 25 ++++++- 2 files changed, 85 insertions(+), 6 deletions(-) diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md index e97efa0..8b62132 100644 --- a/customizations/LLM Router/README.md +++ b/customizations/LLM Router/README.md @@ -136,6 +136,36 @@ The `dynamo-llm-deployment.yaml` file defines a `DynamoGraphDeployment` with mul **Total GPU Requirements**: 8 GPUs for models + 1 GPU for LLM Router = **9 GPUs** +### Ingress Configuration + +The LLM Router is configured with ingress enabled for external access: + +```yaml +ingress: + enabled: true + className: "nginx" # Adjust for your ingress controller + hosts: + - host: llm-router.local # Change to your domain + paths: + - path: / + pathType: Prefix +``` + +**Important**: Update the `host` field in `llm-router-values-override.yaml` to match your domain: + +```bash +# For production, replace llm-router.local with your actual domain +sed -i 's/llm-router.local/your-domain.com/g' llm-router-values-override.yaml +``` + +**For local testing**, add the ingress IP to your `/etc/hosts`: + +```bash +# Get the ingress IP and add to hosts file +INGRESS_IP=$(kubectl get ingress llm-router -n llm-router -o jsonpath='{.status.loadBalancer.ingress[0].ip}') +echo "$INGRESS_IP llm-router.local" | sudo tee -a /etc/hosts +``` + ### Router Configuration The `router-config-dynamo.yaml` configures routing policies: @@ -191,10 +221,26 @@ curl -X POST http://localhost:8080/v1/chat/completions \ ### 2. Test LLM Router Integration ```bash -# Port forward LLM Router -kubectl port-forward svc/llm-router 8080:8080 -n llm-router & +# Option 1: Test via Ingress (recommended for production) +# First, add llm-router.local to your /etc/hosts file: +# echo "$(kubectl get ingress llm-router -n llm-router -o jsonpath='{.status.loadBalancer.ingress[0].ip}') llm-router.local" | sudo tee -a /etc/hosts -# Test task-based routing +# Test task-based routing via ingress +curl -X POST http://llm-router.local/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "Write a Python function"}], + "max_tokens": 512, + "nim-llm-router": { + "policy": "task_router" + } + }' + +# Option 2: Test via port-forward (for development/testing) +kubectl port-forward svc/llm-router 8080:8000 -n llm-router & + +# Test task-based routing via port-forward curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ @@ -206,7 +252,19 @@ curl -X POST http://localhost:8080/v1/chat/completions \ } }' -# Test complexity-based routing +# Test complexity-based routing via ingress +curl -X POST http://llm-router.local/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "Explain quantum computing"}], + "max_tokens": 512, + "nim-llm-router": { + "policy": "complexity_router" + } + }' + +# Test complexity-based routing via port-forward curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ diff --git a/customizations/LLM Router/llm-router-values-override.yaml b/customizations/LLM Router/llm-router-values-override.yaml index 549ab0b..bad73fd 100644 --- a/customizations/LLM Router/llm-router-values-override.yaml +++ b/customizations/LLM Router/llm-router-values-override.yaml @@ -85,9 +85,30 @@ config: configMap: name: router-config-dynamo -# Ingress Configuration (disabled for internal routing) +# Ingress Configuration (enabled for external access) ingress: - enabled: false + enabled: true + className: "nginx" # Use your cluster's ingress class + annotations: + nginx.ingress.kubernetes.io/rewrite-target: / + nginx.ingress.kubernetes.io/ssl-redirect: "false" + nginx.ingress.kubernetes.io/proxy-body-size: "10m" + nginx.ingress.kubernetes.io/proxy-read-timeout: "300" + nginx.ingress.kubernetes.io/proxy-send-timeout: "300" + hosts: + - host: llm-router.local # Change to your domain + paths: + - path: / + pathType: Prefix + backend: + service: + name: llm-router + port: + number: 8000 + tls: [] + # - secretName: llm-router-tls + # hosts: + # - llm-router.your-domain.com # Monitoring monitoring: From f09342c6efe08743a484619addb2869090f46e77 Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Sun, 15 Jun 2025 21:20:27 +0000 Subject: [PATCH 05/17] Remove deprecated deployment files for NVIDIA Dynamo and LLM Router integration - Deleted `deploy-dynamo-integration.sh` and `dynamo-cloud-deployment.yaml` as they are no longer needed. - Updated `README.md` to reflect the removal of these files and adjusted deployment instructions accordingly. --- customizations/LLM Router/README.md | 193 ++++++---- .../LLM Router/deploy-dynamo-integration.sh | 347 ------------------ .../LLM Router/dynamo-cloud-deployment.yaml | 57 --- 3 files changed, 119 insertions(+), 478 deletions(-) delete mode 100755 customizations/LLM Router/deploy-dynamo-integration.sh delete mode 100644 customizations/LLM Router/dynamo-cloud-deployment.yaml diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md index 8b62132..b67f40b 100644 --- a/customizations/LLM Router/README.md +++ b/customizations/LLM Router/README.md @@ -20,121 +20,168 @@ The integration consists of: ### Key Components -- **dynamo-cloud-deployment.yaml**: Configuration for Dynamo Cloud Platform deployment - **dynamo-llm-deployment.yaml**: DynamoGraphDeployment for multi-LLM inference -- **router-config-dynamo.yaml**: Router policies for Dynamo integration -- **llm-router-values-dynamo.yaml**: Helm values for LLM Router with Dynamo -- **deploy-dynamo-integration.sh**: Automated deployment script +- **router-config-dynamo.yaml**: Router policies for Dynamo integration +- **llm-router-values-override.yaml**: Helm values for LLM Router with Dynamo integration -## Quick Start +## Prerequisites -### Automated Deployment (Recommended) +Before starting the deployment, ensure you have: -```bash -# Make the script executable -chmod +x deploy-dynamo-integration.sh +- **Kubernetes cluster** (1.24+) with kubectl configured +- **Helm 3.x** for managing deployments +- **Earthly** for building Dynamo components ([Install Guide](https://earthly.dev/get-earthly)) +- **NVIDIA GPU nodes** with GPU Operator installed +- **Container registry access** (NVIDIA NGC or private registry) +- **Git** for cloning repositories -# Deploy everything (Dynamo + LLM Router) -./deploy-dynamo-integration.sh +### Environment Variables -# Or deploy components separately: -./deploy-dynamo-integration.sh --dynamo-only # Deploy only Dynamo -./deploy-dynamo-integration.sh --router-only # Deploy only LLM Router -./deploy-dynamo-integration.sh --verify-only # Verify existing deployment -``` +You'll need to configure these environment variables before deployment: -### Manual Deployment +| Variable | Description | Example | +|----------|-------------|---------| +| `DOCKER_SERVER` | Your container registry URL | `nvcr.io/your-org` | +| `IMAGE_TAG` | Image tag to use | `latest` or `v1.0.0` | +| `DOCKER_USERNAME` | Registry username | `your-username` | +| `DOCKER_PASSWORD` | Registry password/token | `your-password` | +| `NAMESPACE` | Kubernetes namespace | `dynamo-cloud` | -If you prefer manual deployment or need to customize the process: +### Resource Requirements + +**Total GPU Requirements**: 8 GPUs for models + 1 GPU for LLM Router = **9 GPUs** + +The `dynamo-llm-deployment.yaml` file defines a `DynamoGraphDeployment` with: + +- **Frontend**: API gateway (1 replica) +- **Processor**: Request processing (1 replica) +- **VllmWorker**: Multi-model inference (1 replica, 4 GPUs) +- **PrefillWorker**: Disaggregated prefill (2 replicas, 2 GPUs each) +- **Router**: KV-aware routing (1 replica) -#### Step 1: Deploy NVIDIA Dynamo Cloud Platform +## Deployment Guide + +This guide walks you through deploying NVIDIA Dynamo Cloud Platform and LLM Router step by step. + +### Step 1: Prepare Your Environment + +First, ensure you have all prerequisites and configure your environment variables: ```bash -# 1. Clone Dynamo repository +# Configure your container registry credentials +export DOCKER_SERVER=nvcr.io/your-org # Replace with your registry URL +export IMAGE_TAG=latest # Or specific version tag +export NAMESPACE=dynamo-cloud # Kubernetes namespace for Dynamo +export DOCKER_USERNAME=your-username # Replace with your registry username +export DOCKER_PASSWORD=your-password # Replace with your registry password + +# Verify your configuration +echo "Registry: $DOCKER_SERVER" +echo "Namespace: $NAMESPACE" +echo "Image Tag: $IMAGE_TAG" +``` + +### Step 2: Deploy NVIDIA Dynamo Cloud Platform + +```bash +# 1. Clone the official Dynamo repository git clone https://github.com/ai-dynamo/dynamo.git cd dynamo -# 2. Configure environment (edit dynamo-cloud-deployment.yaml first) -export DOCKER_SERVER=nvcr.io/your-org -export IMAGE_TAG=latest -export NAMESPACE=dynamo-cloud -export DOCKER_USERNAME=your-username -export DOCKER_PASSWORD=your-password - -# 3. Build and push Dynamo components +# 2. Build and push Dynamo components to your registry earthly --push +all-docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG -# 4. Deploy the platform +# 3. Create namespace and set context kubectl create namespace $NAMESPACE kubectl config set-context --current --namespace=$NAMESPACE + +# 4. Deploy the Dynamo Cloud Platform cd deploy/cloud/helm ./deploy.sh --crds -# 5. Deploy LLM inference graph -kubectl apply -f ../../../dynamo-llm-deployment.yaml +# 5. Wait for platform to be ready +kubectl wait --for=condition=ready pod -l app=dynamo-store --timeout=300s + +# 6. Verify platform deployment +kubectl get pods -n $NAMESPACE +kubectl get crd | grep dynamo +``` + +### Step 3: Deploy LLM Inference Services + +```bash +# 1. Navigate back to your configuration directory +cd /path/to/your/llm-router-config + +# 2. Review and customize the LLM deployment +# Edit dynamo-llm-deployment.yaml to adjust: +# - Model selection in VllmWorker +# - GPU resource requirements +# - Replica counts + +# 3. Deploy the LLM inference graph +kubectl apply -f dynamo-llm-deployment.yaml + +# 4. Wait for LLM services to be ready +kubectl wait --for=condition=ready pod -l dynamo-component=Frontend --timeout=600s + +# 5. Verify LLM deployment +kubectl get dynamographdeployment -n dynamo-cloud +kubectl get pods -n dynamo-cloud +kubectl get svc dynamo-llm-service -n dynamo-cloud ``` -#### Step 2: Deploy LLM Router +### Step 4: Deploy LLM Router ```bash -# 1. Create router namespace and ConfigMap +# 1. Create the LLM Router namespace kubectl create namespace llm-router + +# 2. Create ConfigMap for router configuration kubectl create configmap router-config-dynamo \ --from-file=router-config-dynamo.yaml \ -n llm-router -# 2. Add NVIDIA Helm repository +# 3. Verify ConfigMap was created correctly +kubectl describe configmap router-config-dynamo -n llm-router + +# 4. Add the official NVIDIA LLM Router Helm repository helm repo add nvidia-llm-router https://helm.ngc.nvidia.com/nvidia-ai-blueprints/llm-router helm repo update -# 3. Deploy LLM Router +# 5. Review and customize the Helm values +# Edit llm-router-values-override.yaml to adjust: +# - Ingress hostname (change llm-router.local to your domain) +# - Resource requirements +# - GPU allocation + +# 6. Deploy LLM Router using Helm helm upgrade --install llm-router nvidia-llm-router/llm-router \ --namespace llm-router \ --values llm-router-values-override.yaml \ --wait --timeout=10m -``` - -## Prerequisites -- **Kubernetes cluster** (1.24+) with kubectl configured -- **Helm 3.x** for managing deployments -- **Earthly** for building Dynamo components ([Install Guide](https://earthly.dev/get-earthly)) -- **NVIDIA GPU nodes** with GPU Operator installed -- **Container registry access** (NVIDIA NGC or private registry) -- **Git** for cloning repositories - -## Configuration - -### Dynamo Cloud Platform Configuration - -Edit `dynamo-cloud-deployment.yaml` to configure: - -```bash -# Container Registry Configuration -DOCKER_SERVER=nvcr.io/your-org # Your container registry -IMAGE_TAG=latest # Image tag to use -DOCKER_USERNAME=your-username # Registry username -DOCKER_PASSWORD=your-password # Registry password - -# Dynamo Cloud Platform Configuration -NAMESPACE=dynamo-cloud # Kubernetes namespace - -# External Access Configuration -INGRESS_ENABLED=true # Enable ingress -INGRESS_CLASS=nginx # Ingress class +# 7. Verify LLM Router deployment +kubectl get pods -n llm-router +kubectl get svc -n llm-router +kubectl get ingress -n llm-router ``` -### LLM Model Configuration +### Step 5: Configure External Access -The `dynamo-llm-deployment.yaml` file defines a `DynamoGraphDeployment` with multiple services: +```bash +# Option 1: For production with real domain +# Update your DNS to point to the ingress controller's external IP +INGRESS_IP=$(kubectl get ingress llm-router -n llm-router -o jsonpath='{.status.loadBalancer.ingress[0].ip}') +echo "Configure DNS: your-domain.com -> $INGRESS_IP" -- **Frontend**: API gateway (1 replica) -- **Processor**: Request processing (1 replica) -- **VllmWorker**: Multi-model inference (1 replica, 4 GPUs) -- **PrefillWorker**: Disaggregated prefill (2 replicas, 2 GPUs each) -- **Router**: KV-aware routing (1 replica) +# Option 2: For local testing +# Add entry to /etc/hosts file +INGRESS_IP=$(kubectl get ingress llm-router -n llm-router -o jsonpath='{.status.loadBalancer.ingress[0].ip}') +echo "$INGRESS_IP llm-router.local" | sudo tee -a /etc/hosts +``` -**Total GPU Requirements**: 8 GPUs for models + 1 GPU for LLM Router = **9 GPUs** +## Configuration ### Ingress Configuration @@ -364,11 +411,9 @@ cd dynamo/deploy/cloud/helm ## Files in This Directory - **`README.md`** - This comprehensive deployment guide -- **`dynamo-cloud-deployment.yaml`** - Environment configuration for Dynamo Cloud Platform - **`dynamo-llm-deployment.yaml`** - DynamoGraphDeployment for multi-LLM inference - **`router-config-dynamo.yaml`** - Router configuration for Dynamo integration - **`llm-router-values-override.yaml`** - Helm values override for LLM Router with Dynamo integration -- **`deploy-dynamo-integration.sh`** - Automated deployment script ## Resources diff --git a/customizations/LLM Router/deploy-dynamo-integration.sh b/customizations/LLM Router/deploy-dynamo-integration.sh deleted file mode 100755 index c765731..0000000 --- a/customizations/LLM Router/deploy-dynamo-integration.sh +++ /dev/null @@ -1,347 +0,0 @@ -#!/bin/bash - -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# NVIDIA Dynamo Cloud Platform + LLM Router Integration Deployment Script -# This script deploys the official NVIDIA Dynamo Cloud Platform and integrates it with the LLM Router -# -# Based on: https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html - -set -euo pipefail - -# Colors for output -RED='\033[0;31m' -GREEN='\033[0;32m' -YELLOW='\033[1;33m' -BLUE='\033[0;34m' -NC='\033[0m' # No Color - -# Configuration -DYNAMO_NAMESPACE="dynamo-cloud" -ROUTER_NAMESPACE="llm-router" -SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" - -# Function to print colored output -print_status() { - echo -e "${BLUE}[INFO]${NC} $1" -} - -print_success() { - echo -e "${GREEN}[SUCCESS]${NC} $1" -} - -print_warning() { - echo -e "${YELLOW}[WARNING]${NC} $1" -} - -print_error() { - echo -e "${RED}[ERROR]${NC} $1" -} - -# Function to check prerequisites -check_prerequisites() { - print_status "Checking prerequisites..." - - # Check if kubectl is available - if ! command -v kubectl &> /dev/null; then - print_error "kubectl is not installed or not in PATH" - exit 1 - fi - - # Check if helm is available - if ! command -v helm &> /dev/null; then - print_error "helm is not installed or not in PATH" - exit 1 - fi - - # Check if earthly is available (for Dynamo build) - if ! command -v earthly &> /dev/null; then - print_warning "earthly is not installed. You'll need to build Dynamo components manually." - print_warning "Install earthly from: https://earthly.dev/get-earthly" - fi - - # Check cluster connectivity - if ! kubectl cluster-info &> /dev/null; then - print_error "Cannot connect to Kubernetes cluster" - exit 1 - fi - - print_success "Prerequisites check completed" -} - -# Function to deploy NVIDIA Dynamo Cloud Platform -deploy_dynamo_platform() { - print_status "Deploying NVIDIA Dynamo Cloud Platform..." - - # Check if Dynamo repository is available - if [ ! -d "dynamo" ]; then - print_status "Cloning NVIDIA Dynamo repository..." - git clone https://github.com/ai-dynamo/dynamo.git - fi - - cd dynamo - - # Source configuration - if [ -f "${SCRIPT_DIR}/dynamo-cloud-deployment.yaml" ]; then - print_status "Loading Dynamo configuration..." - source "${SCRIPT_DIR}/dynamo-cloud-deployment.yaml" 2>/dev/null || true - fi - - # Set default values if not provided - export DOCKER_SERVER=${DOCKER_SERVER:-"nvcr.io/your-org"} - export IMAGE_TAG=${IMAGE_TAG:-"latest"} - export NAMESPACE=${NAMESPACE:-$DYNAMO_NAMESPACE} - export DOCKER_USERNAME=${DOCKER_USERNAME:-""} - export DOCKER_PASSWORD=${DOCKER_PASSWORD:-""} - - print_status "Building Dynamo Cloud Platform components..." - print_warning "This step requires earthly and may take several minutes..." - - # Build and push components (if earthly is available) - if command -v earthly &> /dev/null; then - earthly --push +all-docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG || { - print_error "Failed to build Dynamo components. Please check your configuration." - print_warning "You may need to:" - print_warning "1. Update DOCKER_SERVER in dynamo-cloud-deployment.yaml" - print_warning "2. Ensure you're logged into your container registry" - print_warning "3. Have proper permissions to push images" - exit 1 - } - else - print_warning "Skipping build step - earthly not available" - print_warning "Please build components manually or use pre-built images" - fi - - # Create namespace - kubectl create namespace $NAMESPACE --dry-run=client -o yaml | kubectl apply -f - - kubectl config set-context --current --namespace=$NAMESPACE - - # Deploy the platform - print_status "Deploying Dynamo Cloud Platform Helm charts..." - cd deploy/cloud/helm - - # Install CRDs first - ./deploy.sh --crds || { - print_error "Failed to deploy Dynamo CRDs" - exit 1 - } - - print_success "NVIDIA Dynamo Cloud Platform deployed successfully" - cd ../../.. -} - -# Function to deploy LLM models using Dynamo -deploy_llm_models() { - print_status "Deploying LLM models using Dynamo..." - - # Wait for Dynamo platform to be ready - print_status "Waiting for Dynamo platform to be ready..." - kubectl wait --for=condition=available --timeout=300s deployment/dynamo-store -n $DYNAMO_NAMESPACE || { - print_warning "Dynamo platform may not be fully ready. Continuing anyway..." - } - - # Expose Dynamo API store for CLI access - print_status "Setting up Dynamo API access..." - kubectl port-forward svc/dynamo-store 8080:80 -n $DYNAMO_NAMESPACE & - PORT_FORWARD_PID=$! - export DYNAMO_CLOUD="http://localhost:8080" - - # Wait for port-forward to be ready - sleep 5 - - # Build and deploy LLM inference graph - print_status "Building and deploying LLM inference graph..." - - # This would typically involve: - # 1. dynamo build --push graphs.disagg:Frontend - # 2. Creating DynamoGraphDeployment CRD - - # For now, we'll apply the example deployment - kubectl apply -f "${SCRIPT_DIR}/dynamo-llm-deployment.yaml" || { - print_error "Failed to deploy LLM models" - kill $PORT_FORWARD_PID 2>/dev/null || true - exit 1 - } - - # Clean up port-forward - kill $PORT_FORWARD_PID 2>/dev/null || true - - print_success "LLM models deployed successfully" -} - -# Function to deploy LLM Router -deploy_llm_router() { - print_status "Deploying LLM Router..." - - # Create router namespace - kubectl create namespace $ROUTER_NAMESPACE --dry-run=client -o yaml | kubectl apply -f - - - # Create ConfigMap for router configuration - kubectl create configmap router-config-dynamo \ - --from-file="${SCRIPT_DIR}/router-config-dynamo.yaml" \ - -n $ROUTER_NAMESPACE \ - --dry-run=client -o yaml | kubectl apply -f - - - # Add NVIDIA Helm repository - helm repo add nvidia-llm-router https://helm.ngc.nvidia.com/nvidia-ai-blueprints/llm-router || { - print_warning "Failed to add NVIDIA LLM Router Helm repository" - print_warning "You may need to configure NGC access or use a different repository" - } - helm repo update - - # Deploy LLM Router using Helm - helm upgrade --install llm-router nvidia-llm-router/llm-router \ - --namespace $ROUTER_NAMESPACE \ - --values "${SCRIPT_DIR}/llm-router-values-override.yaml" \ - --wait \ - --timeout=10m || { - print_error "Failed to deploy LLM Router" - exit 1 - } - - print_success "LLM Router deployed successfully" -} - -# Function to verify deployment -verify_deployment() { - print_status "Verifying deployment..." - - # Check Dynamo platform - print_status "Checking Dynamo platform status..." - kubectl get pods -n $DYNAMO_NAMESPACE - kubectl get dynamographdeployment -n $DYNAMO_NAMESPACE 2>/dev/null || { - print_warning "DynamoGraphDeployment CRD may not be available yet" - } - - # Check LLM Router - print_status "Checking LLM Router status..." - kubectl get pods -n $ROUTER_NAMESPACE - kubectl get svc -n $ROUTER_NAMESPACE - - # Test connectivity - print_status "Testing service connectivity..." - - # Check if Dynamo service is accessible - if kubectl get svc dynamo-llm-service -n $DYNAMO_NAMESPACE &>/dev/null; then - print_success "Dynamo LLM service is available" - else - print_warning "Dynamo LLM service not found - may still be starting" - fi - - # Check if Router service is accessible - if kubectl get svc llm-router -n $ROUTER_NAMESPACE &>/dev/null; then - print_success "LLM Router service is available" - - # Get router endpoint - ROUTER_IP=$(kubectl get svc llm-router -n $ROUTER_NAMESPACE -o jsonpath='{.status.loadBalancer.ingress[0].ip}' 2>/dev/null || echo "") - if [ -n "$ROUTER_IP" ]; then - print_success "LLM Router external IP: $ROUTER_IP" - else - print_status "LLM Router is using ClusterIP. Use port-forward for external access:" - print_status "kubectl port-forward svc/llm-router 8080:8080 -n $ROUTER_NAMESPACE" - fi - else - print_warning "LLM Router service not found" - fi - - print_success "Deployment verification completed" -} - -# Function to show usage -show_usage() { - echo "Usage: $0 [OPTIONS]" - echo "" - echo "Deploy NVIDIA Dynamo Cloud Platform with LLM Router integration" - echo "" - echo "Options:" - echo " --dynamo-only Deploy only Dynamo Cloud Platform" - echo " --router-only Deploy only LLM Router (requires existing Dynamo)" - echo " --verify-only Only verify existing deployment" - echo " --help Show this help message" - echo "" - echo "Prerequisites:" - echo " - kubectl configured for your cluster" - echo " - helm 3.x installed" - echo " - earthly installed (for building Dynamo components)" - echo " - Access to NVIDIA NGC registry" - echo "" - echo "Configuration:" - echo " Edit dynamo-cloud-deployment.yaml to configure registry settings" -} - -# Main deployment function -main() { - local dynamo_only=false - local router_only=false - local verify_only=false - - # Parse command line arguments - while [[ $# -gt 0 ]]; do - case $1 in - --dynamo-only) - dynamo_only=true - shift - ;; - --router-only) - router_only=true - shift - ;; - --verify-only) - verify_only=true - shift - ;; - --help) - show_usage - exit 0 - ;; - *) - print_error "Unknown option: $1" - show_usage - exit 1 - ;; - esac - done - - print_status "Starting NVIDIA Dynamo + LLM Router deployment..." - - check_prerequisites - - if [ "$verify_only" = true ]; then - verify_deployment - exit 0 - fi - - if [ "$router_only" = false ]; then - deploy_dynamo_platform - deploy_llm_models - fi - - if [ "$dynamo_only" = false ]; then - deploy_llm_router - fi - - verify_deployment - - print_success "Deployment completed successfully!" - print_status "" - print_status "Next steps:" - print_status "1. Access LLM Router: kubectl port-forward svc/llm-router 8080:8080 -n $ROUTER_NAMESPACE" - print_status "2. Test routing: curl http://localhost:8080/v1/chat/completions" - print_status "3. Monitor with: kubectl logs -f deployment/llm-router -n $ROUTER_NAMESPACE" -} - -# Run main function -main "$@" \ No newline at end of file diff --git a/customizations/LLM Router/dynamo-cloud-deployment.yaml b/customizations/LLM Router/dynamo-cloud-deployment.yaml deleted file mode 100644 index d9931b8..0000000 --- a/customizations/LLM Router/dynamo-cloud-deployment.yaml +++ /dev/null @@ -1,57 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# NVIDIA Dynamo Cloud Platform Deployment Configuration -# This file contains environment variables and configuration for deploying -# the official NVIDIA Dynamo Cloud Platform -# -# Based on: https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html - -# Container Registry Configuration -DOCKER_SERVER=nvcr.io/your-org -IMAGE_TAG=latest -DOCKER_USERNAME=your-username -DOCKER_PASSWORD=your-password - -# Dynamo Cloud Platform Configuration -NAMESPACE=dynamo-cloud - -# Optional: Pipeline-specific registry (if different from platform registry) -PIPELINES_DOCKER_SERVER=nvcr.io/your-org -PIPELINES_DOCKER_USERNAME=your-username -PIPELINES_DOCKER_PASSWORD=your-password - -# External Access Configuration (choose one) -# Option 1: Ingress -INGRESS_ENABLED=true -INGRESS_CLASS=nginx - -# Option 2: Istio (alternative to Ingress) -# ISTIO_ENABLED=true -# ISTIO_GATEWAY=istio-system/istio-ingressgateway - -# Deployment Notes: -# 1. Build and push Dynamo Cloud Platform components: -# earthly --push +all-docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG -# -# 2. Deploy the platform: -# cd deploy/cloud/helm -# kubectl create namespace $NAMESPACE -# kubectl config set-context --current --namespace=$NAMESPACE -# ./deploy.sh --crds -# -# 3. Expose the platform: -# kubectl port-forward svc/dynamo-store :80 -n $NAMESPACE -# export DYNAMO_CLOUD=http://localhost: \ No newline at end of file From dc5ca83a1acc2dc1591fb5a960acde42e6c6edc3 Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Mon, 30 Jun 2025 06:30:41 +0000 Subject: [PATCH 06/17] Enhance NVIDIA Dynamo Customizations documentation and configuration - Updated `README.md` to reflect the new directory name and provide an overview of NVIDIA Dynamo Customizations. - Added detailed sections on available customizations, including LLM Router integration and its benefits. - Modified `dynamo-llm-deployment.yaml` to clarify component reference instructions. - Enhanced `llm-router-values-override.yaml` with additional configuration options for API base and namespace. - Updated `router-config-dynamo.yaml` to utilize environment variables for API key management and service endpoints. - Expanded `README.md` in the LLM Router directory with comprehensive deployment instructions and configuration validation steps. --- customizations/LLM Router/README.md | 575 +++++++++++++++--- .../LLM Router/dynamo-llm-deployment.yaml | 5 +- .../llm-router-values-override.yaml | 29 +- .../LLM Router/router-config-dynamo.yaml | 85 +-- customizations/README.md | 11 +- 5 files changed, 578 insertions(+), 127 deletions(-) diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md index b67f40b..f6dbc59 100644 --- a/customizations/LLM Router/README.md +++ b/customizations/LLM Router/README.md @@ -2,7 +2,153 @@ This guide provides step-by-step instructions for deploying [NVIDIA LLM Router](https://github.com/NVIDIA-AI-Blueprints/llm-router) with the official [NVIDIA Dynamo Cloud Platform](https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html) on Kubernetes. -## Overview +## NVIDIA LLM Router and Dynamo Integration + +### Overview + +This integration combines two powerful NVIDIA technologies to create an intelligent, scalable LLM serving platform: + +1. **NVIDIA Dynamo Cloud Platform**: Official distributed inference serving framework with disaggregated serving capabilities +2. **NVIDIA LLM Router**: Intelligent request routing based on task classification and complexity analysis + +Together, they provide a complete solution for deploying multiple LLMs with automatic routing based on request characteristics, maximizing both performance and cost efficiency. + +### Architecture Overview + +``` +┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ +│ Client App │───▶│ LLM Router │───▶│ Dynamo Platform │ +│ │ │ │ │ │ +│ OpenAI API │ │ • Task Router │ │ • Frontend │ +│ Compatible │ │ • Complexity │ │ • Processor │ +│ │ │ Router │ │ • VllmWorker │ +│ │ │ │ │ • PrefillWorker │ +└─────────────────┘ └─────────────────┘ │ • Router │ + └─────────────────┘ +``` + +### Key Benefits + +- **Intelligent Routing**: Automatically routes requests to the most appropriate model based on task type or complexity +- **Cost Optimization**: Uses smaller, faster models for simple tasks and larger models only when needed +- **High Performance**: Rust-based router with minimal latency overhead +- **Scalability**: Dynamo's disaggregated serving handles multiple models efficiently +- **OpenAI Compatibility**: Drop-in replacement for existing OpenAI API applications + +### Integration Components + +#### 1. NVIDIA Dynamo Cloud Platform +- **Purpose**: Distributed LLM inference serving +- **Features**: Disaggregated serving, KV cache management, multi-model support +- **Deployment**: Kubernetes-native with custom resources +- **Models Supported**: Multiple LLMs (Llama, Mixtral, Phi, Nemotron, etc.) + +#### 2. NVIDIA LLM Router +- **Purpose**: Intelligent request routing and model selection +- **Features**: OpenAI API compliant, flexible policy system, configurable backends, performant routing +- **Architecture**: Rust-based controller + Triton inference server +- **Routing Policies**: Task classification (12 categories), complexity analysis (7 categories), custom policy creation +- **Customization**: Fine-tune models for domain-specific routing (e.g., banking intent classification) + +#### 3. Integration Configuration +- **Router Policies**: Define routing rules for different task types +- **Model Mapping**: Map router decisions to Dynamo-served models +- **Service Discovery**: Kubernetes-native service communication +- **Security**: API key management via Kubernetes secrets + +### Routing Strategies + +#### Task-Based Routing +Routes requests based on the type of task being performed: + +| Task Type | Target Model | Use Case | +|-----------|--------------|----------| +| Code Generation | llama-3.3-nemotron-super-49b-v1 | Programming tasks | +| Brainstorming | llama-3.1-70b-instruct | Creative ideation | +| Chatbot | mixtral-8x22b-instruct-v0.1 | Conversational AI | +| Summarization | llama-3.1-70b-instruct | Text summarization | +| Open QA | llama-3.1-70b-instruct | Complex questions | +| Closed QA | llama-3.1-70b-instruct | Simple Q&A | +| Classification | llama-3.1-8b-instruct | Text classification | +| Extraction | llama-3.1-8b-instruct | Information extraction | +| Rewrite | llama-3.1-8b-instruct | Text rewriting | +| Text Generation | mixtral-8x22b-instruct-v0.1 | General text generation | +| Other | mixtral-8x22b-instruct-v0.1 | Miscellaneous tasks | +| Unknown | llama-3.1-8b-instruct | Unclassified tasks | + +#### Complexity-Based Routing +Routes requests based on the complexity of the task: + +| Complexity Level | Target Model | Use Case | +|------------------|--------------|----------| +| Creativity | llama-3.1-70b-instruct | Creative tasks | +| Reasoning | llama-3.3-nemotron-super-49b-v1 | Complex reasoning | +| Contextual-Knowledge | llama-3.1-8b-instruct | Context-dependent tasks | +| Few-Shot | llama-3.1-70b-instruct | Tasks with examples | +| Domain-Knowledge | mixtral-8x22b-instruct-v0.1 | Specialized knowledge | +| No-Label-Reason | llama-3.1-8b-instruct | Unclassified complexity | +| Constraint | llama-3.1-8b-instruct | Tasks with constraints | + +### Performance Benefits + +1. **Reduced Latency**: Smaller models handle simple tasks faster +2. **Cost Efficiency**: Expensive large models used only when necessary +3. **Higher Throughput**: Better resource utilization across model pool +4. **Scalability**: Independent scaling of router and serving components + +### API Usage Example + +```bash +# Task-based routing example +curl -X POST http://llm-router.local/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "Write a Python function to sort a list"}], + "max_tokens": 512, + "nim-llm-router": { + "policy": "task_router" + } + }' + +# Complexity-based routing example +curl -X POST http://llm-router.local/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "", + "messages": [{"role": "user", "content": "Explain quantum entanglement"}], + "max_tokens": 512, + "nim-llm-router": { + "policy": "complexity_router" + } + }' +``` + +### How Dynamo Model Routing Works + +The key insight is that Dynamo provides a **single gateway endpoint** that routes to different models based on the `model` parameter in the OpenAI-compatible API request: + +1. **Single Endpoint**: `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1` +2. **Model-Based Routing**: Dynamo routes internally based on the `model` field in requests +3. **OpenAI Compatibility**: Standard OpenAI API format with model selection + +Example request: +```json +{ + "model": "llama-3.1-70b-instruct", // Dynamo routes based on this + "messages": [...], + "temperature": 0.7 +} +``` + +Dynamo's internal architecture handles: +- Model registry and discovery +- Request parsing and routing +- Load balancing across replicas +- KV cache management +- Disaggregated serving coordination + +## Integration Deployment Overview This integration demonstrates how to deploy the official NVIDIA Dynamo Cloud Platform for distributed LLM inference and route requests intelligently using the NVIDIA LLM Router. The setup includes: @@ -21,8 +167,8 @@ The integration consists of: ### Key Components - **dynamo-llm-deployment.yaml**: DynamoGraphDeployment for multi-LLM inference -- **router-config-dynamo.yaml**: Router policies for Dynamo integration -- **llm-router-values-override.yaml**: Helm values for LLM Router with Dynamo integration +- **router-config-dynamo.yaml**: Router policies for Dynamo integration (uses `${DYNAMO_API_BASE}` variable) +- **llm-router-values-override.yaml**: Helm values for LLM Router with Dynamo integration (defines `dynamo.api_base` variable) ## Prerequisites @@ -47,138 +193,286 @@ You'll need to configure these environment variables before deployment: | `DOCKER_PASSWORD` | Registry password/token | `your-password` | | `NAMESPACE` | Kubernetes namespace | `dynamo-cloud` | +### Configuration Variables + +The deployment uses a configurable `api_base` variable for flexible endpoint management: + +| Variable | File | Description | Default Value | +|----------|------|-------------|---------------| +| `dynamo.api_base` | `llm-router-values-override.yaml` | Dynamo LLM endpoint URL | `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080` | +| `${DYNAMO_API_BASE}` | `router-config-dynamo.yaml` | Template variable substituted during deployment | Derived from `dynamo.api_base` | + +This approach allows you to: +- **Switch environments** by changing only the `dynamo.api_base` value +- **Override during deployment** with `--set dynamo.api_base=http://new-endpoint:8080` +- **Use different values files** for different environments (dev/staging/prod) + ### Resource Requirements -**Total GPU Requirements**: 8 GPUs for models + 1 GPU for LLM Router = **9 GPUs** +**Minimum Requirements for Testing**: +- **Local Development**: 1 GPU for single model serving +- **Production Deployment**: Varies based on models and architecture choice + +**Architecture Options**: -The `dynamo-llm-deployment.yaml` file defines a `DynamoGraphDeployment` with: +1. **Aggregated Serving** (Simplest): + - Single worker handles both prefill and decode + - Minimum: 1 GPU per model + - Good for: Development, testing, small-scale deployments -- **Frontend**: API gateway (1 replica) -- **Processor**: Request processing (1 replica) -- **VllmWorker**: Multi-model inference (1 replica, 4 GPUs) -- **PrefillWorker**: Disaggregated prefill (2 replicas, 2 GPUs each) -- **Router**: KV-aware routing (1 replica) +2. **Disaggregated Serving** (Production): + - Separate workers for prefill and decode + - Allows independent scaling + - Better resource utilization for high-throughput scenarios + +**Component Resource Allocation**: +- **Frontend**: CPU-only (handles HTTP requests) +- **Processor**: CPU-only (request processing) +- **VllmWorker**: GPU required (model inference) +- **PrefillWorker**: GPU required (prefill operations) +- **Router**: CPU-only (KV-aware routing) +- **LLM Router**: 1 GPU (routing model inference) ## Deployment Guide -This guide walks you through deploying NVIDIA Dynamo Cloud Platform and LLM Router step by step. +This guide walks you through deploying NVIDIA Dynamo and LLM Router step by step using the official deployment methods. ### Step 1: Prepare Your Environment -First, ensure you have all prerequisites and configure your environment variables: +First, ensure you have all prerequisites: ```bash -# Configure your container registry credentials -export DOCKER_SERVER=nvcr.io/your-org # Replace with your registry URL -export IMAGE_TAG=latest # Or specific version tag -export NAMESPACE=dynamo-cloud # Kubernetes namespace for Dynamo -export DOCKER_USERNAME=your-username # Replace with your registry username -export DOCKER_PASSWORD=your-password # Replace with your registry password - -# Verify your configuration -echo "Registry: $DOCKER_SERVER" -echo "Namespace: $NAMESPACE" -echo "Image Tag: $IMAGE_TAG" +# Install Dynamo SDK (recommended method) +apt-get update +DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev python3-pip python3-venv libucx0 +python3 -m venv venv +source venv/bin/activate +pip install "ai-dynamo[all]" + +# Set up required services (etcd and NATS) +git clone https://github.com/ai-dynamo/dynamo.git +cd dynamo +docker compose -f deploy/metrics/docker-compose.yml up -d ``` -### Step 2: Deploy NVIDIA Dynamo Cloud Platform +### Step 2: Deploy Dynamo Cloud Platform (For Kubernetes) + +**For Kubernetes deployment, you must first deploy the Dynamo Cloud Platform:** ```bash -# 1. Clone the official Dynamo repository -git clone https://github.com/ai-dynamo/dynamo.git +# Set environment variables for Dynamo Cloud Platform +export DOCKER_SERVER=your-registry.com +export IMAGE_TAG=latest +export NAMESPACE=dynamo-cloud +export DOCKER_USERNAME=your-username +export DOCKER_PASSWORD=your-password + +# Build and push Dynamo Cloud Platform components cd dynamo - -# 2. Build and push Dynamo components to your registry earthly --push +all-docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG -# 3. Create namespace and set context +# Create namespace and deploy the platform kubectl create namespace $NAMESPACE kubectl config set-context --current --namespace=$NAMESPACE - -# 4. Deploy the Dynamo Cloud Platform cd deploy/cloud/helm ./deploy.sh --crds -# 5. Wait for platform to be ready -kubectl wait --for=condition=ready pod -l app=dynamo-store --timeout=300s - -# 6. Verify platform deployment +# Verify platform deployment kubectl get pods -n $NAMESPACE -kubectl get crd | grep dynamo ``` -### Step 3: Deploy LLM Inference Services +### Step 3: Deploy NVIDIA Dynamo LLM Services + +**Option A: Local Development (Recommended for testing)** ```bash -# 1. Navigate back to your configuration directory -cd /path/to/your/llm-router-config +# Navigate to LLM examples +cd examples/llm + +# Start aggregated serving (single worker for prefill and decode) +dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml -# 2. Review and customize the LLM deployment -# Edit dynamo-llm-deployment.yaml to adjust: -# - Model selection in VllmWorker -# - GPU resource requirements -# - Replica counts +# Or start disaggregated serving (separate prefill and decode workers) +# dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml +``` -# 3. Deploy the LLM inference graph -kubectl apply -f dynamo-llm-deployment.yaml +**Option B: Kubernetes Deployment** -# 4. Wait for LLM services to be ready -kubectl wait --for=condition=ready pod -l dynamo-component=Frontend --timeout=600s +```bash +# Set up Dynamo Cloud access +kubectl port-forward svc/dynamo-store 8080:80 -n dynamo-cloud & +export DYNAMO_CLOUD=http://localhost:8080 -# 5. Verify LLM deployment -kubectl get dynamographdeployment -n dynamo-cloud +# Build and deploy your LLM service +cd examples/llm +DYNAMO_TAG=$(dynamo build hello_world:Frontend | grep "Successfully built" | awk '{ print $3 }' | sed 's/\.$//') +dynamo deployment create $DYNAMO_TAG -n llm-deployment + +# Monitor deployment kubectl get pods -n dynamo-cloud -kubectl get svc dynamo-llm-service -n dynamo-cloud ``` -### Step 4: Deploy LLM Router +### Step 4: Test Dynamo LLM Services + +Once Dynamo is running, test the LLM services: + +```bash +# Test with a sample request +curl localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", + "messages": [ + { + "role": "user", + "content": "Hello, how are you?" + } + ], + "stream": false, + "max_tokens": 30 + }' | jq + +# For Kubernetes deployment, use port forwarding to access the service +kubectl port-forward svc/llm-deployment-frontend 3000:3000 -n dynamo-cloud +``` + +### Step 5: Set Up LLM Router API Keys + +**IMPORTANT**: The router configuration uses Kubernetes secrets for API key management following the [official NVIDIA pattern](https://github.com/NVIDIA-AI-Blueprints/llm-router/blob/main/deploy/helm/llm-router/templates/router-controller-configmap.yaml). ```bash # 1. Create the LLM Router namespace kubectl create namespace llm-router -# 2. Create ConfigMap for router configuration -kubectl create configmap router-config-dynamo \ - --from-file=router-config-dynamo.yaml \ - -n llm-router - -# 3. Verify ConfigMap was created correctly -kubectl describe configmap router-config-dynamo -n llm-router +# 2. Create secret for Dynamo API key (if authentication is required) +# Note: For local Dynamo deployments, API keys may not be required +kubectl create secret generic dynamo-api-secret \ + --from-literal=DYNAMO_API_KEY="your-dynamo-api-key-here" \ + --namespace=llm-router + +# 3. (Optional) Create image pull secret for private registries (only if using private container registry) +kubectl create secret docker-registry nvcr-secret \ + --docker-server=nvcr.io \ + --docker-username='$oauthtoken' \ + --docker-password="your-ngc-api-key-here" \ + --namespace=llm-router + +# 4. Verify secrets were created +kubectl get secrets -n llm-router +``` -# 4. Add the official NVIDIA LLM Router Helm repository -helm repo add nvidia-llm-router https://helm.ngc.nvidia.com/nvidia-ai-blueprints/llm-router -helm repo update +### Step 6: Deploy LLM Router -# 5. Review and customize the Helm values -# Edit llm-router-values-override.yaml to adjust: -# - Ingress hostname (change llm-router.local to your domain) -# - Resource requirements -# - GPU allocation +**Note**: The NVIDIA LLM Router does not have an official Helm repository. You must clone the GitHub repository and deploy using local Helm charts. -# 6. Deploy LLM Router using Helm -helm upgrade --install llm-router nvidia-llm-router/llm-router \ +```bash +# 1. Clone the NVIDIA LLM Router repository (required for Helm charts) +git clone https://github.com/NVIDIA-AI-Blueprints/llm-router.git +cd llm-router + +# 2. Build and push LLM Router images to your registry +docker build -t your-registry.com/router-server:latest -f src/router-server/router-server.dockerfile . +docker build -t your-registry.com/router-controller:latest -f src/router-controller/router-controller.dockerfile . +docker push your-registry.com/router-server:latest +docker push your-registry.com/router-controller:latest + +# 3. Create API key secret (using dummy key for Dynamo integration) +kubectl create secret generic llm-api-keys \ + --from-literal=nvidia_api_key=dummy-key-for-dynamo \ + --namespace=llm-router + +# 4. Prepare router models (download from NGC) +# Follow the main project README to download models to local 'routers/' directory +# Then create PVC and upload models: + +kubectl apply -f - < values.dynamo.yaml < $INGRESS_IP" +# For development/testing, use port forwarding to access LLM Router +kubectl port-forward svc/llm-router-router-controller 8084:8084 -n llm-router -# Option 2: For local testing -# Add entry to /etc/hosts file -INGRESS_IP=$(kubectl get ingress llm-router -n llm-router -o jsonpath='{.status.loadBalancer.ingress[0].ip}') -echo "$INGRESS_IP llm-router.local" | sudo tee -a /etc/hosts +# Test the LLM Router API +curl http://localhost:8084/health ``` ## Configuration @@ -213,6 +507,26 @@ INGRESS_IP=$(kubectl get ingress llm-router -n llm-router -o jsonpath='{.status. echo "$INGRESS_IP llm-router.local" | sudo tee -a /etc/hosts ``` +### API Key Management + +The router configuration uses **environment variable substitution** for secure API key management, following the [official NVIDIA LLM Router pattern](https://github.com/NVIDIA-AI-Blueprints/llm-router/blob/main/deploy/helm/llm-router/templates/router-controller-configmap.yaml): + +```yaml +# In router-config-dynamo.yaml +llms: + - name: Brainstorming + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 + api_key: "${DYNAMO_API_KEY}" # Resolved from Kubernetes secret + model: llama-3.1-70b-instruct +``` + +The LLM Router controller: +1. Reads `DYNAMO_API_KEY` from the Kubernetes secret +2. Replaces `${DYNAMO_API_KEY}` placeholders in the configuration +3. Uses the actual API key value for authentication with Dynamo services + +**Security Note**: Never use empty strings (`""`) for API keys. Always use proper Kubernetes secrets with environment variable references. + ### Router Configuration The `router-config-dynamo.yaml` configures routing policies: @@ -240,7 +554,90 @@ The `router-config-dynamo.yaml` configures routing policies: | No-Label-Reason | llama-3.1-8b-instruct | Simple reasoning | | Constraint | phi-3-medium-128k-instruct | Constrained tasks | -All routes point to: `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1` +All routes point to: `${DYNAMO_API_BASE}/v1` (configured via environment variable) + +## Testing the Integration + +Once both Dynamo and LLM Router are deployed, test the complete integration: + +```bash +# Test LLM Router with task-based routing +curl -X POST http://localhost:8084/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { + "role": "user", + "content": "Write a Python function to calculate fibonacci numbers" + } + ], + "model": "", + "nim-llm-router": { + "policy": "task_router", + "routing_strategy": "triton", + "model": "" + } + }' | jq + +# Test with complexity-based routing +curl -X POST http://localhost:8084/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { + "role": "user", + "content": "Explain quantum computing in simple terms" + } + ], + "model": "", + "nim-llm-router": { + "policy": "complexity_router", + "routing_strategy": "triton", + "model": "" + } + }' | jq + +# Monitor routing decisions in LLM Router logs +kubectl logs -f deployment/llm-router-router-controller -n llm-router + +# Monitor Dynamo inference logs +kubectl logs -f deployment/llm-deployment-frontend -n dynamo-cloud +``` + +## Configuration Validation + +Before deploying, validate your configuration files: + +### 1. Validate Dynamo Configuration + +```bash +# For local development, test the service directly +curl http://localhost:8000/health + +# For Kubernetes deployment, check service status +kubectl get pods -n dynamo-cloud +kubectl get svc -n dynamo-cloud + +# Test the Dynamo API endpoint +kubectl port-forward svc/dynamo-frontend 8000:8000 -n dynamo-cloud & +curl http://localhost:8000/v1/models +``` + +### 2. Validate Router Configuration + +```bash +# Check if environment variable substitution will work +export DYNAMO_API_BASE="http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080" +envsubst < router-config-dynamo.yaml | kubectl apply --dry-run=client -f - +``` + +### 3. Validate Helm Values + +```bash +# Validate the Helm values file +cd llm-router/deploy/helm/llm-router +helm template llm-router . --values ../../../../llm-router-values-override.yaml --dry-run +``` ## Verification and Testing @@ -408,6 +805,22 @@ cd dynamo/deploy/cloud/helm ./deploy.sh --uninstall ``` +## Quick Configuration Checklist + +Before deployment, ensure you customize these key settings: + +1. **`dynamo-llm-deployment.yaml`**: + - Update `dynamoComponent: frontend:latest` with your actual component reference + - Adjust GPU resource requirements based on your hardware + +2. **`llm-router-values-override.yaml`**: + - Change `host: llm-router.local` to your actual domain + - Update `api_base` URL if using external Dynamo deployment + +3. **Environment Variables**: + - Set `DOCKER_SERVER`, `IMAGE_TAG`, `NAMESPACE` before deployment + - Create `DYNAMO_API_KEY` secret during Step 4 + ## Files in This Directory - **`README.md`** - This comprehensive deployment guide diff --git a/customizations/LLM Router/dynamo-llm-deployment.yaml b/customizations/LLM Router/dynamo-llm-deployment.yaml index 2e4a397..a5aeb66 100644 --- a/customizations/LLM Router/dynamo-llm-deployment.yaml +++ b/customizations/LLM Router/dynamo-llm-deployment.yaml @@ -27,8 +27,9 @@ metadata: namespace: dynamo-cloud spec: # Reference to the built and pushed Dynamo component - # This would be generated by: dynamo build --push graphs.disagg:Frontend - dynamoComponent: frontend:jh2o6dqzpsgfued4 + # Replace with your actual component reference after building with: + # earthly --push +all-docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG + dynamoComponent: frontend:latest # Update this with your actual component reference # Global environment variables for all services envs: diff --git a/customizations/LLM Router/llm-router-values-override.yaml b/customizations/LLM Router/llm-router-values-override.yaml index bad73fd..5c29249 100644 --- a/customizations/LLM Router/llm-router-values-override.yaml +++ b/customizations/LLM Router/llm-router-values-override.yaml @@ -16,6 +16,18 @@ # LLM Router Helm Values for NVIDIA Dynamo Cloud Platform Integration # This configuration integrates the LLM Router with the official NVIDIA Dynamo deployment # Based on: https://github.com/NVIDIA-AI-Blueprints/llm-router/tree/main/deploy/helm/llm-router +# +# Pure Kubernetes-Native Approach: Uses ConfigMaps for configuration +# +# IMPORTANT: Update the api_base URL to match your Dynamo deployment +# This should point to your actual Dynamo service endpoint + +# Dynamo Configuration - will be created as ConfigMap +dynamoConfig: + api_base: "http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080" + namespace: "dynamo-cloud" + # For external Dynamo deployments, use: + # api_base: "https://your-dynamo-endpoint.com" # Router Controller Configuration routerController: @@ -29,16 +41,23 @@ routerController: type: ClusterIP port: 8084 - # Configure to route to NVIDIA Dynamo Cloud Platform + # Configure environment variables from ConfigMap and Secrets env: - name: LOG_LEVEL value: "INFO" - name: ENABLE_METRICS value: "true" - name: DYNAMO_ENDPOINT - value: "http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080" + value: "{{ .Values.dynamoConfig.api_base }}" - name: DYNAMO_NAMESPACE - value: "dynamo-cloud" + value: "{{ .Values.dynamoConfig.namespace }}" + - name: DYNAMO_API_BASE + value: "{{ .Values.dynamoConfig.api_base }}" + - name: DYNAMO_API_KEY + valueFrom: + secretKeyRef: + name: dynamo-api-secret + key: DYNAMO_API_KEY resources: requests: @@ -96,7 +115,7 @@ ingress: nginx.ingress.kubernetes.io/proxy-read-timeout: "300" nginx.ingress.kubernetes.io/proxy-send-timeout: "300" hosts: - - host: llm-router.local # Change to your domain + - host: llm-router.local # CHANGE THIS: Replace with your actual domain paths: - path: / pathType: Prefix @@ -104,7 +123,7 @@ ingress: service: name: llm-router port: - number: 8000 + number: 8000 # Router service port tls: [] # - secretName: llm-router-tls # hosts: diff --git a/customizations/LLM Router/router-config-dynamo.yaml b/customizations/LLM Router/router-config-dynamo.yaml index ea78440..9ecef20 100644 --- a/customizations/LLM Router/router-config-dynamo.yaml +++ b/customizations/LLM Router/router-config-dynamo.yaml @@ -18,88 +18,97 @@ # deployment using the proper service endpoints # # Based on: https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html +# API Key pattern follows: https://github.com/NVIDIA-AI-Blueprints/llm-router/blob/main/deploy/helm/llm-router/templates/router-controller-configmap.yaml +# +# NOTE: Environment variables are resolved at runtime: +# - ${DYNAMO_API_BASE}: Points to the Dynamo service endpoint +# - ${DYNAMO_API_KEY}: API key for authenticating with Dynamo services +# +# These variables are populated from: +# - ConfigMap: DYNAMO_API_BASE (defined in llm-router-values-override.yaml) +# - Secret: DYNAMO_API_KEY (created during deployment setup) policies: - name: "task_router" url: http://router-server:8000/v2/models/task_router_ensemble/infer llms: - name: Brainstorming - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: llama-3.1-70b-instruct - name: Chatbot - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: mixtral-8x22b-instruct - name: "Code Generation" - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: llama-3.1-nemotron-70b-instruct - name: Summarization - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: phi-3-mini-128k-instruct - name: "Text Generation" - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: llama-3.2-11b-vision-instruct - name: "Open QA" - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: llama-3.1-405b-instruct - name: "Closed QA" - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: llama-3.1-8b-instruct - name: Classification - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: phi-3-mini-4k-instruct - name: Extraction - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: llama-3.1-8b-instruct - name: Rewrite - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: phi-3-medium-128k-instruct - name: Other - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: llama-3.1-70b-instruct - name: Unknown - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: llama-3.1-8b-instruct - name: "complexity_router" url: http://router-server:8000/v2/models/complexity_router_ensemble/infer llms: - name: Creativity - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: llama-3.1-70b-instruct - name: Reasoning - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: llama-3.3-nemotron-super-49b - name: "Contextual-Knowledge" - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: llama-3.1-405b-instruct - name: "Few-Shot" - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: llama-3.1-70b-instruct - name: "Domain-Knowledge" - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: llama-3.1-nemotron-70b-instruct - name: "No-Label-Reason" - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: llama-3.1-8b-instruct - name: Constraint - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 - api_key: "" + api_base: ${DYNAMO_API_BASE}/v1 + api_key: "${DYNAMO_API_KEY}" model: phi-3-medium-128k-instruct \ No newline at end of file diff --git a/customizations/README.md b/customizations/README.md index cec3047..d4e5989 100644 --- a/customizations/README.md +++ b/customizations/README.md @@ -1 +1,10 @@ -# Dynamo Customizations +# NVIDIA Dynamo Customizations + +This directory contains configuration files and deployment guides for integrating NVIDIA technologies. + +## Available Customizations + +### LLM Router +Integration with NVIDIA LLM Router for intelligent request routing and model selection. + +**Location**: [`LLM Router/`](LLM%20Router/) From 3d6cb7360ee2161f78d7fc63004e0d86453341cb Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Sat, 23 Aug 2025 04:52:42 +0000 Subject: [PATCH 07/17] Implement disaggregated service configuration for LLM Router - Added `disagg.yaml` for deploying the vLLM backend with disaggregated serving capabilities. - Removed the deprecated `dynamo-llm-deployment.yaml` file to streamline the configuration. - Updated `README.md` to include detailed instructions for the new disaggregated deployment setup. - Revised `router-config-dynamo.yaml` to reflect the models currently deployed and their configurations. - Enhanced documentation to clarify the integration of multiple models and their routing strategies. --- customizations/LLM Router/README.md | 1187 +++++++++++------ customizations/LLM Router/disagg.yaml | 50 + .../LLM Router/dynamo-llm-deployment.yaml | 148 -- .../LLM Router/router-config-dynamo.yaml | 101 +- 4 files changed, 911 insertions(+), 575 deletions(-) create mode 100644 customizations/LLM Router/disagg.yaml delete mode 100644 customizations/LLM Router/dynamo-llm-deployment.yaml diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md index f6dbc59..b09a359 100644 --- a/customizations/LLM Router/README.md +++ b/customizations/LLM Router/README.md @@ -1,6 +1,20 @@ -# LLM Router with NVIDIA Dynamo Cloud Platform - Kubernetes Deployment Guide +# LLM Router with NVIDIA Dynamo Cloud Platform +## Kubernetes Deployment Guide -This guide provides step-by-step instructions for deploying [NVIDIA LLM Router](https://github.com/NVIDIA-AI-Blueprints/llm-router) with the official [NVIDIA Dynamo Cloud Platform](https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html) on Kubernetes. +
+ +[![NVIDIA](https://img.shields.io/badge/NVIDIA-76B900?style=for-the-badge&logo=nvidia&logoColor=white)](https://nvidia.com) +[![Kubernetes](https://img.shields.io/badge/kubernetes-%23326ce5.svg?style=for-the-badge&logo=kubernetes&logoColor=white)](https://kubernetes.io) +[![Docker](https://img.shields.io/badge/docker-%230db7ed.svg?style=for-the-badge&logo=docker&logoColor=white)](https://docker.com) +[![Helm](https://img.shields.io/badge/Helm-0F1689?style=for-the-badge&logo=Helm&labelColor=0F1689)](https://helm.sh) + +**Intelligent LLM Request Routing with Distributed Inference Serving** + +
+ +--- + +This comprehensive guide provides step-by-step instructions for deploying the [**NVIDIA LLM Router**](https://github.com/NVIDIA-AI-Blueprints/llm-router) with the official [**NVIDIA Dynamo Cloud Platform**](https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html) on Kubernetes. ## NVIDIA LLM Router and Dynamo Integration @@ -8,98 +22,203 @@ This guide provides step-by-step instructions for deploying [NVIDIA LLM Router]( This integration combines two powerful NVIDIA technologies to create an intelligent, scalable LLM serving platform: -1. **NVIDIA Dynamo Cloud Platform**: Official distributed inference serving framework with disaggregated serving capabilities -2. **NVIDIA LLM Router**: Intelligent request routing based on task classification and complexity analysis + + + + + +
+ +### NVIDIA Dynamo Cloud Platform +- **Distributed inference serving framework** +- **Disaggregated serving capabilities** +- **Multi-model support** +- **Kubernetes-native scaling** + + + +### NVIDIA LLM Router +- **Intelligent request routing** +- **Task classification (12 categories)** +- **Complexity analysis (7 categories)** +- **Rust-based performance** + +
+ +> **Result**: A complete solution for deploying multiple LLMs with automatic routing based on request characteristics, maximizing both **performance** and **cost efficiency**. + +### Kubernetes Architecture Overview + +
+ +```mermaid +graph TB + subgraph "Kubernetes Cluster" + subgraph "Ingress Layer" + LB[Load Balancer/Ingress] + end + + subgraph "LLM Router (Helm)" + RC[Router Controller] + RS[Router Server + GPU] + end + + subgraph "Dynamo Platform" + FE[Frontend Service] + PR[Processor] + VW[VllmDecodeWorker + GPU] + PW[VllmPrefillWorker + GPU] + RT[Router] + PL[Planner] + end + end + + LB --> RC + RC --> RS + RS --> FE + FE --> PR + PR --> VW + PR --> PW + VW --> RT + PW --> RT + RT --> PL + + style LB fill:#e1f5fe + style RC fill:#f3e5f5 + style RS fill:#f3e5f5 + style FE fill:#e8f5e8 + style PR fill:#e8f5e8 + style VW fill:#fff3e0 + style PW fill:#fff3e0 + style RT fill:#e8f5e8 + style PL fill:#e8f5e8 +``` -Together, they provide a complete solution for deploying multiple LLMs with automatic routing based on request characteristics, maximizing both performance and cost efficiency. +
-### Architecture Overview +### Key Benefits -``` -┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ -│ Client App │───▶│ LLM Router │───▶│ Dynamo Platform │ -│ │ │ │ │ │ -│ OpenAI API │ │ • Task Router │ │ • Frontend │ -│ Compatible │ │ • Complexity │ │ • Processor │ -│ │ │ Router │ │ • VllmWorker │ -│ │ │ │ │ • PrefillWorker │ -└─────────────────┘ └─────────────────┘ │ • Router │ - └─────────────────┘ -``` +
-### Key Benefits +| **Feature** | **Benefit** | **Impact** | +|:---:|:---:|:---:| +| **Intelligent Routing** | Auto-routes by task/complexity | **Optimal Model Selection** | +| **Cost Optimization** | Small models for simple tasks | **Reduced Infrastructure Costs** | +| **High Performance** | Rust-based minimal latency | **Sub-millisecond Routing** | +| **Scalability** | Disaggregated multi-model serving | **Enterprise-Grade Throughput** | +| **OpenAI Compatible** | Drop-in API replacement | **Zero Code Changes** | -- **Intelligent Routing**: Automatically routes requests to the most appropriate model based on task type or complexity -- **Cost Optimization**: Uses smaller, faster models for simple tasks and larger models only when needed -- **High Performance**: Rust-based router with minimal latency overhead -- **Scalability**: Dynamo's disaggregated serving handles multiple models efficiently -- **OpenAI Compatibility**: Drop-in replacement for existing OpenAI API applications +
### Integration Components -#### 1. NVIDIA Dynamo Cloud Platform +
+1. NVIDIA Dynamo Cloud Platform + - **Purpose**: Distributed LLM inference serving - **Features**: Disaggregated serving, KV cache management, multi-model support - **Deployment**: Kubernetes-native with custom resources - **Models Supported**: Multiple LLMs (Llama, Mixtral, Phi, Nemotron, etc.) -#### 2. NVIDIA LLM Router +
+ +
+2. NVIDIA LLM Router + - **Purpose**: Intelligent request routing and model selection -- **Features**: OpenAI API compliant, flexible policy system, configurable backends, performant routing +- **Features**: OpenAI API compliant, flexible policy system, configurable backends - **Architecture**: Rust-based controller + Triton inference server -- **Routing Policies**: Task classification (12 categories), complexity analysis (7 categories), custom policy creation +- **Routing Policies**: Task classification (12 categories), complexity analysis (7 categories) - **Customization**: Fine-tune models for domain-specific routing (e.g., banking intent classification) -#### 3. Integration Configuration +
+ +
+3. Integration Configuration + - **Router Policies**: Define routing rules for different task types - **Model Mapping**: Map router decisions to Dynamo-served models - **Service Discovery**: Kubernetes-native service communication - **Security**: API key management via Kubernetes secrets +
+ ### Routing Strategies +
+ #### Task-Based Routing -Routes requests based on the type of task being performed: - -| Task Type | Target Model | Use Case | -|-----------|--------------|----------| -| Code Generation | llama-3.3-nemotron-super-49b-v1 | Programming tasks | -| Brainstorming | llama-3.1-70b-instruct | Creative ideation | -| Chatbot | mixtral-8x22b-instruct-v0.1 | Conversational AI | -| Summarization | llama-3.1-70b-instruct | Text summarization | -| Open QA | llama-3.1-70b-instruct | Complex questions | -| Closed QA | llama-3.1-70b-instruct | Simple Q&A | -| Classification | llama-3.1-8b-instruct | Text classification | -| Extraction | llama-3.1-8b-instruct | Information extraction | -| Rewrite | llama-3.1-8b-instruct | Text rewriting | -| Text Generation | mixtral-8x22b-instruct-v0.1 | General text generation | -| Other | mixtral-8x22b-instruct-v0.1 | Miscellaneous tasks | -| Unknown | llama-3.1-8b-instruct | Unclassified tasks | +*Routes requests based on the type of task being performed* + +
+ +
+View Task Routing Table + +| **Task Type** | **Target Model** | **Use Case** | +|:---|:---|:---| +| Code Generation | `llama-3.1-70b-instruct` | Programming tasks | +| Brainstorming | `llama-3.1-70b-instruct` | Creative ideation | +| Chatbot | `mixtral-8x22b-instruct-v0.1` | Conversational AI | +| Summarization | `llama-3.1-8b-instruct` | Text summarization | +| Open QA | `llama-3.1-70b-instruct` | Complex questions | +| Closed QA | `llama-3.1-8b-instruct` | Simple Q&A | +| Classification | `llama-3.1-8b-instruct` | Text classification | +| Extraction | `llama-3.1-8b-instruct` | Information extraction | +| Rewrite | `llama-3.1-8b-instruct` | Text rewriting | +| Text Generation | `mixtral-8x22b-instruct-v0.1` | General text generation | +| Other | `mixtral-8x22b-instruct-v0.1` | Miscellaneous tasks | +| Unknown | `llama-3.1-8b-instruct` | Unclassified tasks | + +
+ +--- + +
#### Complexity-Based Routing -Routes requests based on the complexity of the task: - -| Complexity Level | Target Model | Use Case | -|------------------|--------------|----------| -| Creativity | llama-3.1-70b-instruct | Creative tasks | -| Reasoning | llama-3.3-nemotron-super-49b-v1 | Complex reasoning | -| Contextual-Knowledge | llama-3.1-8b-instruct | Context-dependent tasks | -| Few-Shot | llama-3.1-70b-instruct | Tasks with examples | -| Domain-Knowledge | mixtral-8x22b-instruct-v0.1 | Specialized knowledge | -| No-Label-Reason | llama-3.1-8b-instruct | Unclassified complexity | -| Constraint | llama-3.1-8b-instruct | Tasks with constraints | +*Routes requests based on the complexity of the task* + +
+ +
+View Complexity Routing Table + +| **Complexity Level** | **Target Model** | **Use Case** | +|:---|:---|:---| +| Creativity | `llama-3.1-70b-instruct` | Creative tasks | +| Reasoning | `llama-3.1-70b-instruct` | Complex reasoning | +| Contextual-Knowledge | `llama-3.1-8b-instruct` | Context-dependent tasks | +| Few-Shot | `llama-3.1-70b-instruct` | Tasks with examples | +| Domain-Knowledge | `mixtral-8x22b-instruct-v0.1` | Specialized knowledge | +| No-Label-Reason | `llama-3.1-8b-instruct` | Unclassified complexity | +| Constraint | `llama-3.1-8b-instruct` | Tasks with constraints | + +
### Performance Benefits -1. **Reduced Latency**: Smaller models handle simple tasks faster -2. **Cost Efficiency**: Expensive large models used only when necessary -3. **Higher Throughput**: Better resource utilization across model pool -4. **Scalability**: Independent scaling of router and serving components +
+ +| **Metric** | **Improvement** | **How It Works** | +|:---:|:---:|:---| +| **Latency** | `↓ 40-60%` | Smaller models for simple tasks | +| **Cost** | `↓ 30-50%` | Large models only when needed | +| **Throughput** | `↑ 2-3x` | Better resource utilization | +| **Scalability** | `↑ 10x` | Independent component scaling | + +
+ +### API Usage Examples + +
-### API Usage Example +#### Task-Based Routing + +
```bash -# Task-based routing example +# Code generation task → Routes to llama-3.3-nemotron-super-49b-v1 curl -X POST http://llm-router.local/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ @@ -110,8 +229,16 @@ curl -X POST http://llm-router.local/v1/chat/completions \ "policy": "task_router" } }' +``` + +
+ +#### Complexity-Based Routing + +
-# Complexity-based routing example +```bash +# Complex reasoning task → Routes to llama-3.3-nemotron-super-49b-v1 curl -X POST http://llm-router.local/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ @@ -128,7 +255,7 @@ curl -X POST http://llm-router.local/v1/chat/completions \ The key insight is that Dynamo provides a **single gateway endpoint** that routes to different models based on the `model` parameter in the OpenAI-compatible API request: -1. **Single Endpoint**: `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1` +1. **Single Endpoint**: `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8000/v1` 2. **Model-Based Routing**: Dynamo routes internally based on the `model` field in requests 3. **OpenAI Compatibility**: Standard OpenAI API format with model selection @@ -148,50 +275,66 @@ Dynamo's internal architecture handles: - KV cache management - Disaggregated serving coordination -## Integration Deployment Overview +## Kubernetes Integration Deployment -This integration demonstrates how to deploy the official NVIDIA Dynamo Cloud Platform for distributed LLM inference and route requests intelligently using the NVIDIA LLM Router. The setup includes: +This integration demonstrates how to deploy the official NVIDIA Dynamo Cloud Platform for distributed LLM inference on Kubernetes and route requests intelligently using the NVIDIA LLM Router. The Kubernetes deployment includes: -1. **NVIDIA Dynamo Cloud Platform**: Official distributed inference serving framework with disaggregated serving capabilities -2. **LLM Router**: Intelligent request routing based on task complexity and type -3. **Multiple LLM Models**: Various models deployed via Dynamo's inference graphs +1. **NVIDIA Dynamo Cloud Platform**: Distributed inference serving with Kubernetes operators and custom resources +2. **LLM Router**: Helm-deployed intelligent request routing with GPU-accelerated routing models +3. **Multiple LLM Models**: Containerized models deployed via DynamoGraphDeployment CRs -### Architecture -The integration consists of: - -- **NVIDIA Dynamo Cloud Platform**: Official distributed inference serving framework -- **LLM Router**: Routes requests to appropriate models based on task complexity and type -- **Multiple LLM Models**: Various models deployed via Dynamo's disaggregated serving ### Key Components -- **dynamo-llm-deployment.yaml**: DynamoGraphDeployment for multi-LLM inference +- **disagg.yaml**: Official vLLM backend disaggregated service configuration from [components/backends/vllm/deploy/disagg.yaml](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/disagg.yaml) + - Common: Shared configuration (model, block-size, KV connector) + - Frontend: OpenAI-compatible API endpoint (port 8000) with direct VllmDecodeWorker routing + - VllmDecodeWorker: Decode worker with conditional disaggregation and KV caching (1 GPU) + - VllmPrefillWorker: Specialized prefill worker for high-throughput token processing (1 GPU) + - Uses official NGC Dynamo vLLM Runtime container from `DYNAMO_IMAGE` variable - **router-config-dynamo.yaml**: Router policies for Dynamo integration (uses `${DYNAMO_API_BASE}` variable) - **llm-router-values-override.yaml**: Helm values for LLM Router with Dynamo integration (defines `dynamo.api_base` variable) -## Prerequisites +### Disaggregated Serving Configuration + +The deployment uses the official disaggregated serving architecture based on [Dynamo's vLLM backend deployment reference](https://github.com/ai-dynamo/dynamo/tree/main/components/backends/vllm/deploy): + +**Key Features**: +- **Model**: `Qwen/Qwen3-0.6B` (optimized for disaggregated inference) +- **KV Transfer**: Uses `DynamoNixlConnector` for high-performance KV cache transfer +- **Conditional Disaggregation**: Automatically switches between prefill and decode workers +- **Remote Prefill**: Offloads prefill operations to dedicated VllmPrefillWorker instances +- **Prefix Caching**: Enables intelligent caching for improved performance +- **Block Size**: 64 tokens for optimal memory utilization +- **Max Model Length**: 16,384 tokens context window +- **Autoscaling**: Optional Planner component for dynamic worker scaling based on load metrics +- **Load Prediction**: ARIMA-based load forecasting for proactive scaling -Before starting the deployment, ensure you have: -- **Kubernetes cluster** (1.24+) with kubectl configured -- **Helm 3.x** for managing deployments -- **Earthly** for building Dynamo components ([Install Guide](https://earthly.dev/get-earthly)) -- **NVIDIA GPU nodes** with GPU Operator installed -- **Container registry access** (NVIDIA NGC or private registry) -- **Git** for cloning repositories ### Environment Variables -You'll need to configure these environment variables before deployment: +Set the required environment variables for NGC deployment: -| Variable | Description | Example | -|----------|-------------|---------| -| `DOCKER_SERVER` | Your container registry URL | `nvcr.io/your-org` | -| `IMAGE_TAG` | Image tag to use | `latest` or `v1.0.0` | -| `DOCKER_USERNAME` | Registry username | `your-username` | -| `DOCKER_PASSWORD` | Registry password/token | `your-password` | -| `NAMESPACE` | Kubernetes namespace | `dynamo-cloud` | +| Variable | Description | Example | Required | +|----------|-------------|---------|----------| +| `NGC_API_KEY` | NVIDIA NGC API key | `your-ngc-api-key` | Yes | +| `DOCKER_USERNAME` | NGC registry username | `$oauthtoken` | Yes | +| `DOCKER_PASSWORD` | NGC API key (same as above) | `$NGC_API_KEY` | Yes | +| `DOCKER_SERVER` | NGC container registry URL | `nvcr.io` | Yes | + +**NGC Setup Instructions**: +1. **Get NGC API Key**: Visit [https://ngc.nvidia.com/setup/api-key](https://ngc.nvidia.com/setup/api-key) to generate your API key +2. **Login to NGC**: Use `docker login nvcr.io --username '$oauthtoken' --password $NGC_API_KEY` +3. **Prebuilt Images**: NGC provides prebuilt CUDA and ML framework images, eliminating the need for local builds + +**Available NGC Dynamo Images**: +- **vLLM Runtime**: `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0` (recommended) +- **SGLang Runtime**: `nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.0` +- **TensorRT-LLM Runtime**: `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.0` +- **Dynamo Kubernetes Operator**: `nvcr.io/nvidia/ai-dynamo/dynamo-operator:latest` +- **Dynamo Deployment API**: `nvcr.io/nvidia/ai-dynamo/dynamo-api-store:latest` ### Configuration Variables @@ -199,145 +342,528 @@ The deployment uses a configurable `api_base` variable for flexible endpoint man | Variable | File | Description | Default Value | |----------|------|-------------|---------------| -| `dynamo.api_base` | `llm-router-values-override.yaml` | Dynamo LLM endpoint URL | `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080` | +| `dynamo.api_base` | `llm-router-values-override.yaml` | Dynamo LLM endpoint URL | `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8000` | | `${DYNAMO_API_BASE}` | `router-config-dynamo.yaml` | Template variable substituted during deployment | Derived from `dynamo.api_base` | This approach allows you to: - **Switch environments** by changing only the `dynamo.api_base` value -- **Override during deployment** with `--set dynamo.api_base=http://new-endpoint:8080` +- **Override during deployment** with `--set dynamo.api_base=http://new-endpoint:8000` - **Use different values files** for different environments (dev/staging/prod) ### Resource Requirements -**Minimum Requirements for Testing**: -- **Local Development**: 1 GPU for single model serving -- **Production Deployment**: Varies based on models and architecture choice +**Kubernetes Production Deployment**: -**Architecture Options**: +**Minimum Requirements**: +- **Kubernetes cluster** with 4+ GPU nodes for disaggregated serving +- **Each node**: 16+ CPU cores, 64GB+ RAM, 2-4 GPUs +- **Storage**: 500GB+ for model storage (SSD recommended) +- **Network**: High-bandwidth interconnect for multi-node setups -1. **Aggregated Serving** (Simplest): - - Single worker handles both prefill and decode - - Minimum: 1 GPU per model - - Good for: Development, testing, small-scale deployments +**Component Resource Allocation**: +- **Frontend**: 1-2 CPU cores, 2-4GB RAM (handles HTTP requests) +- **Processor**: 2-4 CPU cores, 4-8GB RAM (request processing) +- **VllmDecodeWorker**: 4+ GPU, 8+ CPU cores, 16GB+ RAM (model inference) +- **VllmPrefillWorker**: 2+ GPU, 4+ CPU cores, 8GB+ RAM (prefill operations) +- **Router**: 1-2 CPU cores, 2-4GB RAM (KV-aware routing) +- **LLM Router**: 1 GPU, 2 CPU cores, 4GB RAM (routing model inference) + +**Scaling Considerations**: +- **Disaggregated Serving**: Separate prefill and decode for better throughput +- **Horizontal Scaling**: Multiple VllmDecodeWorker and VllmPrefillWorker replicas +- **GPU Memory**: Adjust based on model size (70B models need 40GB+ VRAM per GPU) -2. **Disaggregated Serving** (Production): - - Separate workers for prefill and decode - - Allows independent scaling - - Better resource utilization for high-throughput scenarios +## Prerequisites -**Component Resource Allocation**: -- **Frontend**: CPU-only (handles HTTP requests) -- **Processor**: CPU-only (request processing) -- **VllmWorker**: GPU required (model inference) -- **PrefillWorker**: GPU required (prefill operations) -- **Router**: CPU-only (KV-aware routing) -- **LLM Router**: 1 GPU (routing model inference) +
+ +[![Prerequisites](https://img.shields.io/badge/Prerequisites-Check%20List-blue?style=for-the-badge&logo=checkmk)](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/dynamo_cloud.md#prerequisites) + +*Ensure your environment meets all requirements before deployment* + +
+ +### Required Tools + +
+ +**Verify you have the required tools installed:** + +
+ +```bash +# Required tools verification +kubectl version --client +helm version +docker version +``` + +
+ +| **Tool** | **Requirement** | **Status** | +|:---:|:---:|:---:| +| **kubectl** | `v1.24+` | Check with `kubectl version --client` | +| **Helm** | `v3.0+` | Check with `helm version` | +| **Docker** | Running daemon | Check with `docker version` | + +
+ +**Additional Requirements:** +- **NVIDIA GPU nodes** with GPU Operator installed (for LLM inference) +- **Container registry access** (Docker Hub, NVIDIA NGC, etc.) +- **Git** for cloning repositories + +### Inference Runtime Images + +Set your inference runtime image from the available NGC options: + +```bash +# Set your inference runtime image +export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0 +``` + +**Available Runtime Images**: +- `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0` - vLLM backend (recommended) +- `nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.0` - SGLang backend +- `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.0` - TensorRT-LLM backend + +### Hugging Face Token + +For accessing models from Hugging Face Hub, you'll need a Hugging Face token: + +```bash +# Set your Hugging Face token for model access +export HF_TOKEN=your_hf_token +``` + +Get your token from [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) + +### Kubernetes Cluster Requirements + +#### PVC Support with Default Storage Class +Dynamo Cloud requires Persistent Volume Claim (PVC) support with a default storage class. Verify your cluster configuration: + +```bash +# Check if default storage class exists +kubectl get storageclass + +# Expected output should show at least one storage class marked as (default) +# Example: +# NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE +# standard (default) kubernetes.io/gce-pd Delete Immediate true 1d +``` + +### Optional Requirements + +#### Service Mesh (Optional) +For advanced networking and security features, you may want to install: +- **Istio service mesh**: For advanced traffic management and security + +```bash +# Check if Istio is installed +kubectl get pods -n istio-system + +# Expected output should show running Istio pods +# istiod-* pods should be in Running state +``` + +If Istio is not installed, follow the [official Istio installation guide](https://istio.io/latest/docs/setup/getting-started/). + +## Pre-Deployment Validation + +
+ +[![Validation](https://img.shields.io/badge/Pre--Deployment-Validation-yellow?style=for-the-badge&logo=checkmarx)](https://kubernetes.io) + +*Validate your environment before starting deployment* + +
+ +Before starting the deployment, validate that your environment meets all requirements: + +### Validate Kubernetes Cluster + +```bash +# Verify Kubernetes cluster access and version +kubectl version --client +kubectl cluster-info + +# Check node resources and GPU availability +kubectl get nodes -o wide +kubectl describe nodes | grep -A 5 "Capacity:" + +# Verify default storage class exists +kubectl get storageclass +``` + +### Validate Container Registry Access + +```bash +# Test NGC registry access (if using NGC images) +docker login nvcr.io --username '$oauthtoken' --password $NGC_API_KEY + +# Verify you can pull the Dynamo runtime image +docker pull $DYNAMO_IMAGE +``` + +### Validate Configuration Files + +```bash +# Navigate to the customization directory +cd customizations/LLM\ Router + +# Check that required files exist +ls -la disagg.yaml router-config-dynamo.yaml llm-router-values-override.yaml + +# Validate YAML syntax +python -c "import yaml; yaml.safe_load(open('disagg.yaml'))" && echo "disagg.yaml is valid" +python -c "import yaml; yaml.safe_load(open('router-config-dynamo.yaml'))" && echo "router-config-dynamo.yaml is valid" +python -c "import yaml; yaml.safe_load(open('llm-router-values-override.yaml'))" && echo "llm-router-values-override.yaml is valid" +``` + +### Environment Setup + +```bash +export NAMESPACE=dynamo-kubernetes +export RELEASE_VERSION=0.4.0 +export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0 +export HF_TOKEN=your_hf_token +export NGC_API_KEY=your-ngc-api-key +``` + +### Validate Environment Variables + +```bash +# Check required environment variables are set +echo "NAMESPACE: ${NAMESPACE:-'NOT SET'}" +echo "RELEASE_VERSION: ${RELEASE_VERSION:-'NOT SET'}" +echo "HF_TOKEN: ${HF_TOKEN:-'NOT SET'}" +echo "DYNAMO_IMAGE: ${DYNAMO_IMAGE:-'NOT SET'}" +echo "NGC_API_KEY: ${NGC_API_KEY:-'NOT SET'}" +``` ## Deployment Guide -This guide walks you through deploying NVIDIA Dynamo and LLM Router step by step using the official deployment methods. +
+ +[![Deployment](https://img.shields.io/badge/Deployment-Step%20by%20Step-green?style=for-the-badge&logo=kubernetes)](https://kubernetes.io) + +**Complete walkthrough for deploying NVIDIA Dynamo and LLM Router** + +
+ +--- + + + +### Deployment Overview + +
+ +```mermaid +graph LR + A[Prerequisites] --> B[Install Platform] + B --> C[Deploy vLLM] + C --> D[Setup Router] + D --> E[Configure Access] + E --> F[Test Integration] + + style A fill:#e3f2fd + style B fill:#f3e5f5 + style C fill:#e8f5e8 + style D fill:#fff3e0 + style E fill:#fce4ec + style F fill:#e0f2f1 +``` + +
+ +### Step 1: Install Dynamo Platform (Path A: Production Install) + +
+ +[![Step 1](https://img.shields.io/badge/Step%201-Install%20Platform-blue?style=for-the-badge&logo=kubernetes)](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/dynamo_cloud.md#path-a-production-install) + +*Deploy the Dynamo Cloud Platform using the official **Path A: Production Install*** + +
-### Step 1: Prepare Your Environment -First, ensure you have all prerequisites: ```bash -# Install Dynamo SDK (recommended method) -apt-get update -DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev python3-pip python3-venv libucx0 -python3 -m venv venv -source venv/bin/activate -pip install "ai-dynamo[all]" - -# Set up required services (etcd and NATS) +# 1. Set environment +export NAMESPACE=dynamo-kubernetes +export RELEASE_VERSION=0.4.0 +export NGC_API_KEY=your-ngc-api-key + +# 2. Clone repository git clone https://github.com/ai-dynamo/dynamo.git cd dynamo + +# 3. Login to NGC +docker login nvcr.io --username '$oauthtoken' --password $NGC_API_KEY + +# 4. Install CRDs +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz +helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default + +# 5. Install Platform +kubectl create namespace ${NAMESPACE} +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz +helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} + +# 6. Verify deployment +kubectl get pods -n $NAMESPACE +kubectl get svc -n $NAMESPACE + +# 7. Setup external access and services +kubectl port-forward svc/dynamo-store 8080:80 -n $NAMESPACE & +export DYNAMO_CLOUD=http://localhost:8080 docker compose -f deploy/metrics/docker-compose.yml up -d ``` -### Step 2: Deploy Dynamo Cloud Platform (For Kubernetes) +### Step 2: Deploy Multiple vLLM Models -**For Kubernetes deployment, you must first deploy the Dynamo Cloud Platform:** +
+ +[![Step 2](https://img.shields.io/badge/Step%202-Deploy%20Multiple%20Models-orange?style=for-the-badge&logo=nvidia)](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/README.md) + +*Deploy multiple vLLM models for intelligent routing* + +
+ + + +Since our LLM Router routes to different models based on task complexity, we need to deploy multiple model instances. Following the official [vLLM backend deployment guide](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/README.md#3-deploy): ```bash -# Set environment variables for Dynamo Cloud Platform -export DOCKER_SERVER=your-registry.com -export IMAGE_TAG=latest -export NAMESPACE=dynamo-cloud -export DOCKER_USERNAME=your-username -export DOCKER_PASSWORD=your-password - -# Build and push Dynamo Cloud Platform components -cd dynamo -earthly --push +all-docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG +# Set up Hugging Face token for model access +export HF_TOKEN=your_hf_token -# Create namespace and deploy the platform -kubectl create namespace $NAMESPACE -kubectl config set-context --current --namespace=$NAMESPACE -cd deploy/cloud/helm -./deploy.sh --crds +# Create Kubernetes secret for Hugging Face token +kubectl create secret generic hf-token-secret \ + --from-literal=HF_TOKEN=${HF_TOKEN} \ + -n ${NAMESPACE} -# Verify platform deployment -kubectl get pods -n $NAMESPACE +# Navigate to the official vLLM backend deployment directory +cd dynamo/components/backends/vllm/deploy ``` -### Step 3: Deploy NVIDIA Dynamo LLM Services +#### Deploy Model 1: Llama-3.1-8B (Small/Fast Model) + +```bash +# Copy the official disagg.yaml for the 8B model +cp disagg.yaml llama-8b-disagg.yaml + +# Edit the model configuration +sed -i 's/Qwen\/Qwen3-0.6B/meta-llama\/Llama-3.1-8B-Instruct/g' llama-8b-disagg.yaml + +# Update service names to avoid conflicts +sed -i 's/Frontend:/Frontend-8B:/g' llama-8b-disagg.yaml +sed -i 's/VllmDecodeWorker:/VllmDecodeWorker-8B:/g' llama-8b-disagg.yaml +sed -i 's/VllmPrefillWorker:/VllmPrefillWorker-8B:/g' llama-8b-disagg.yaml -**Option A: Local Development (Recommended for testing)** +# Deploy the 8B model +kubectl apply -f llama-8b-disagg.yaml -n ${NAMESPACE} +``` + +#### Deploy Model 2: Llama-3.1-70B (Large/Powerful Model) ```bash -# Navigate to LLM examples -cd examples/llm +# Copy the official disagg.yaml for the 70B model +cp disagg.yaml llama-70b-disagg.yaml + +# Edit the model configuration +sed -i 's/Qwen\/Qwen3-0.6B/meta-llama\/Llama-3.1-70B-Instruct/g' llama-70b-disagg.yaml -# Start aggregated serving (single worker for prefill and decode) -dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml +# Update service names to avoid conflicts +sed -i 's/Frontend:/Frontend-70B:/g' llama-70b-disagg.yaml +sed -i 's/VllmDecodeWorker:/VllmDecodeWorker-70B:/g' llama-70b-disagg.yaml +sed -i 's/VllmPrefillWorker:/VllmPrefillWorker-70B:/g' llama-70b-disagg.yaml -# Or start disaggregated serving (separate prefill and decode workers) -# dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml +# Update port to avoid conflicts (70B model on port 8001) +sed -i 's/port: 8000/port: 8001/g' llama-70b-disagg.yaml + +# Deploy the 70B model +kubectl apply -f llama-70b-disagg.yaml -n ${NAMESPACE} ``` -**Option B: Kubernetes Deployment** +#### Deploy Model 3: Mixtral-8x22B (Mixture of Experts) ```bash -# Set up Dynamo Cloud access -kubectl port-forward svc/dynamo-store 8080:80 -n dynamo-cloud & -export DYNAMO_CLOUD=http://localhost:8080 +# Copy the official disagg.yaml for the Mixtral model +cp disagg.yaml mixtral-disagg.yaml + +# Edit the model configuration +sed -i 's/Qwen\/Qwen3-0.6B/mistralai\/Mixtral-8x22B-Instruct-v0.1/g' mixtral-disagg.yaml -# Build and deploy your LLM service -cd examples/llm -DYNAMO_TAG=$(dynamo build hello_world:Frontend | grep "Successfully built" | awk '{ print $3 }' | sed 's/\.$//') -dynamo deployment create $DYNAMO_TAG -n llm-deployment +# Update service names to avoid conflicts +sed -i 's/Frontend:/Frontend-Mixtral:/g' mixtral-disagg.yaml +sed -i 's/VllmDecodeWorker:/VllmDecodeWorker-Mixtral:/g' mixtral-disagg.yaml +sed -i 's/VllmPrefillWorker:/VllmPrefillWorker-Mixtral:/g' mixtral-disagg.yaml -# Monitor deployment -kubectl get pods -n dynamo-cloud +# Update port to avoid conflicts (Mixtral model on port 8002) +sed -i 's/port: 8000/port: 8002/g' mixtral-disagg.yaml + +# Deploy the Mixtral model +kubectl apply -f mixtral-disagg.yaml -n ${NAMESPACE} ``` -### Step 4: Test Dynamo LLM Services +### Adding More Models (Optional) + +**Current Setup**: We deploy 3 models that cover most use cases: +- **Llama-3.1-8B**: Fast model for simple tasks +- **Llama-3.1-70B**: Powerful model for complex tasks +- **Mixtral-8x22B**: Creative model for conversational tasks -Once Dynamo is running, test the LLM services: +**To add more models**, follow this pattern: + +#### Example: Adding Phi-3-Mini Model ```bash -# Test with a sample request +# 1. Deploy the new model (following the same pattern) +cd dynamo/components/backends/vllm/deploy +cp disagg.yaml phi-3-mini-disagg.yaml + +# 2. Update model configuration +sed -i 's/Qwen\/Qwen3-0.6B/microsoft\/Phi-3-mini-128k-instruct/g' phi-3-mini-disagg.yaml +sed -i 's/Frontend:/Frontend-Phi3:/g' phi-3-mini-disagg.yaml +sed -i 's/VllmDecodeWorker:/VllmDecodeWorker-Phi3:/g' phi-3-mini-disagg.yaml +sed -i 's/VllmPrefillWorker:/VllmPrefillWorker-Phi3:/g' phi-3-mini-disagg.yaml +sed -i 's/port: 8000/port: 8003/g' phi-3-mini-disagg.yaml + +# 3. Deploy the model +kubectl apply -f phi-3-mini-disagg.yaml -n ${NAMESPACE} + +# 4. Update router configuration +# Edit customizations/LLM Router/router-config-dynamo.yaml +# Add entries like: +# - name: "New Task Type" +# api_base: ${DYNAMO_API_BASE}/v1 +# api_key: "${DYNAMO_API_KEY}" +# model: microsoft/Phi-3-mini-128k-instruct +``` + +**Repeat this pattern** for any additional models you want to deploy. + +### Step 3: Verify Multiple Model Deployments + +
+ +[![Step 3](https://img.shields.io/badge/Step%203-Verify%20Deployments-green?style=for-the-badge&logo=kubernetes)](https://kubernetes.io) + +*Verify that all vLLM models have been deployed successfully* + +
+ +```bash +# Check deployment status for all models +kubectl get pods -n ${NAMESPACE} +kubectl get svc -n ${NAMESPACE} + +# Look for all model-specific pods and services +kubectl get pods -n ${NAMESPACE} | grep -E "(8B|70B|Mixtral|frontend|worker)" + +# Check services for each model (should see different ports) +kubectl get svc -n ${NAMESPACE} | grep -E "(frontend|8000|8001|8002)" + +# Verify each model's frontend is running +echo "Checking Llama-8B model..." +kubectl logs deployment/frontend-8b -n ${NAMESPACE} --tail=10 + +echo "Checking Llama-70B model..." +kubectl logs deployment/frontend-70b -n ${NAMESPACE} --tail=10 + +echo "Checking Mixtral model..." +kubectl logs deployment/frontend-mixtral -n ${NAMESPACE} --tail=10 + +# Verify the disaggregated architecture for each model +kubectl get pods -n ${NAMESPACE} | grep -E "(prefill|vllm)" | sort +``` + +### Step 4: Test Multiple vLLM Services + +
+ +[![Step 4](https://img.shields.io/badge/Step%204-Test%20Services-purple?style=for-the-badge&logo=checkmarx)](https://checkmarx.com) + +*Test all deployed vLLM services* + +
+ +#### Test Llama-8B Model (Port 8000) + +```bash +# Forward the Llama-8B service port +kubectl port-forward svc/frontend-8b-service 8000:8000 -n ${NAMESPACE} & + +# Test the 8B model curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B", - "messages": [ - { - "role": "user", - "content": "Hello, how are you?" - } - ], + "model": "meta-llama/Llama-3.1-8B-Instruct", + "messages": [{"role": "user", "content": "Simple question: What is 2+2?"}], "stream": false, "max_tokens": 30 }' | jq -# For Kubernetes deployment, use port forwarding to access the service -kubectl port-forward svc/llm-deployment-frontend 3000:3000 -n dynamo-cloud +# Check health and models +curl localhost:8000/health +curl localhost:8000/v1/models | jq +``` + +#### Test Llama-70B Model (Port 8001) + +```bash +# Forward the Llama-70B service port +kubectl port-forward svc/frontend-70b-service 8001:8001 -n ${NAMESPACE} & + +# Test the 70B model with a complex task +curl localhost:8001/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Llama-3.1-70B-Instruct", + "messages": [{"role": "user", "content": "Explain quantum computing in detail"}], + "stream": false, + "max_tokens": 200 + }' | jq + +# Check health and models +curl localhost:8001/health +curl localhost:8001/v1/models | jq +``` + +#### Test Mixtral Model (Port 8002) + +```bash +# Forward the Mixtral service port +kubectl port-forward svc/frontend-mixtral-service 8002:8002 -n ${NAMESPACE} & + +# Test the Mixtral model +curl localhost:8002/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "mistralai/Mixtral-8x22B-Instruct-v0.1", + "messages": [{"role": "user", "content": "Write a creative story about AI"}], + "stream": false, + "max_tokens": 150 + }' | jq + +# Check health and models +curl localhost:8002/health +curl localhost:8002/v1/models | jq ``` ### Step 5: Set Up LLM Router API Keys +
+ +[![Step 5](https://img.shields.io/badge/Step%205-Setup%20API%20Keys-red?style=for-the-badge&logo=docker)](https://github.com/NVIDIA-AI-Blueprints/llm-router) + +*Configure API keys for LLM Router integration* + +
+ **IMPORTANT**: The router configuration uses Kubernetes secrets for API key management following the [official NVIDIA pattern](https://github.com/NVIDIA-AI-Blueprints/llm-router/blob/main/deploy/helm/llm-router/templates/router-controller-configmap.yaml). ```bash @@ -363,6 +889,14 @@ kubectl get secrets -n llm-router ### Step 6: Deploy LLM Router +
+ +[![Step 6](https://img.shields.io/badge/Step%206-Deploy%20Router-indigo?style=for-the-badge&logo=nvidia)](https://github.com/NVIDIA-AI-Blueprints/llm-router) + +*Deploy the NVIDIA LLM Router using Helm* + +
+ **Note**: The NVIDIA LLM Router does not have an official Helm repository. You must clone the GitHub repository and deploy using local Helm charts. ```bash @@ -446,11 +980,11 @@ routerController: - name: Brainstorming api_base: http://llm-deployment-frontend.dynamo-cloud.svc.cluster.local:3000 api_key: "" - model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + model: Qwen/Qwen3-0.6B - name: Chatbot api_base: http://llm-deployment-frontend.dynamo-cloud.svc.cluster.local:3000 api_key: "" - model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B + model: Qwen/Qwen3-0.6B EOF # 6. Deploy LLM Router using Helm chart @@ -467,6 +1001,14 @@ kubectl get svc -n llm-router ### Step 7: Configure External Access +
+ +[![Step 7](https://img.shields.io/badge/Step%207-Configure%20Access-teal?style=for-the-badge&logo=nginx)](https://kubernetes.io) + +*Configure external access to the LLM Router* + +
+ ```bash # For development/testing, use port forwarding to access LLM Router kubectl port-forward svc/llm-router-router-controller 8084:8084 -n llm-router @@ -515,7 +1057,7 @@ The router configuration uses **environment variable substitution** for secure A # In router-config-dynamo.yaml llms: - name: Brainstorming - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1 + api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8000/v1 api_key: "${DYNAMO_API_KEY}" # Resolved from Kubernetes secret model: llama-3.1-70b-instruct ``` @@ -529,32 +1071,42 @@ The LLM Router controller: ### Router Configuration -The `router-config-dynamo.yaml` configures routing policies: - -| **Task Router** | **Model** | **Use Case** | -|-----------------|-----------|--------------| -| Brainstorming | llama-3.1-70b-instruct | Creative ideation | -| Chatbot | mixtral-8x22b-instruct | Conversational AI | -| Code Generation | llama-3.1-nemotron-70b-instruct | Programming tasks | -| Summarization | phi-3-mini-128k-instruct | Text summarization | -| Text Generation | llama-3.2-11b-vision-instruct | General text creation | -| Open QA | llama-3.1-405b-instruct | Complex questions | -| Closed QA | llama-3.1-8b-instruct | Simple Q&A | -| Classification | phi-3-mini-4k-instruct | Text classification | -| Extraction | llama-3.1-8b-instruct | Information extraction | -| Rewrite | phi-3-medium-128k-instruct | Text rewriting | - -| **Complexity Router** | **Model** | **Use Case** | -|----------------------|-----------|--------------| -| Creativity | llama-3.1-70b-instruct | Creative tasks | -| Reasoning | llama-3.3-nemotron-super-49b | Complex reasoning | -| Contextual-Knowledge | llama-3.1-405b-instruct | Knowledge-intensive | -| Few-Shot | llama-3.1-70b-instruct | Few-shot learning | -| Domain-Knowledge | llama-3.1-nemotron-70b-instruct | Specialized domains | -| No-Label-Reason | llama-3.1-8b-instruct | Simple reasoning | -| Constraint | phi-3-medium-128k-instruct | Constrained tasks | - -All routes point to: `${DYNAMO_API_BASE}/v1` (configured via environment variable) +The `router-config-dynamo.yaml` configures routing policies to our deployed models. + +**Important**: This configuration only references the **3 models we actually deploy**: +- `meta-llama/Llama-3.1-8B-Instruct` (fast, simple tasks) +- `meta-llama/Llama-3.1-70B-Instruct` (powerful, complex tasks) +- `mistralai/Mixtral-8x22B-Instruct-v0.1` (creative, conversational tasks) + +**Note**: All models use the same Dynamo gateway endpoint - Dynamo handles internal routing based on the model parameter: + +| **Task Router** | **Model** | **Dynamo Gateway** | **Use Case** | +|-----------------|-----------|--------------|--------------| +| Brainstorming | `meta-llama/Llama-3.1-70B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Creative ideation | +| Chatbot | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Conversational AI | +| Code Generation | `meta-llama/Llama-3.1-70B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Programming tasks | +| Summarization | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Text summarization | +| Text Generation | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | General text creation | +| Open QA | `meta-llama/Llama-3.1-70B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Complex questions | +| Closed QA | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Simple Q&A | +| Classification | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Text classification | +| Extraction | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Information extraction | +| Rewrite | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Text rewriting | + +| **Complexity Router** | **Model** | **Dynamo Gateway** | **Use Case** | +|----------------------|-----------|--------------|--------------| +| Creativity | `meta-llama/Llama-3.1-70B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Creative tasks | +| Reasoning | `meta-llama/Llama-3.1-70B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Complex reasoning | +| Contextual-Knowledge | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Knowledge-intensive | +| Few-Shot | `meta-llama/Llama-3.1-70B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Few-shot learning | +| Domain-Knowledge | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Specialized domains | +| No-Label-Reason | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Simple reasoning | +| Constraint | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Constrained tasks | + +**Routing Strategy**: +- **Simple tasks** → `meta-llama/Llama-3.1-8B-Instruct` (fast, efficient) +- **Complex tasks** → `meta-llama/Llama-3.1-70B-Instruct` (powerful, detailed) +- **Creative/Conversational** → `mistralai/Mixtral-8x22B-Instruct-v0.1` (diverse, creative) ## Testing the Integration @@ -604,227 +1156,84 @@ kubectl logs -f deployment/llm-router-router-controller -n llm-router kubectl logs -f deployment/llm-deployment-frontend -n dynamo-cloud ``` -## Configuration Validation -Before deploying, validate your configuration files: -### 1. Validate Dynamo Configuration -```bash -# For local development, test the service directly -curl http://localhost:8000/health - -# For Kubernetes deployment, check service status -kubectl get pods -n dynamo-cloud -kubectl get svc -n dynamo-cloud - -# Test the Dynamo API endpoint -kubectl port-forward svc/dynamo-frontend 8000:8000 -n dynamo-cloud & -curl http://localhost:8000/v1/models -``` - -### 2. Validate Router Configuration - -```bash -# Check if environment variable substitution will work -export DYNAMO_API_BASE="http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080" -envsubst < router-config-dynamo.yaml | kubectl apply --dry-run=client -f - -``` - -### 3. Validate Helm Values - -```bash -# Validate the Helm values file -cd llm-router/deploy/helm/llm-router -helm template llm-router . --values ../../../../llm-router-values-override.yaml --dry-run -``` - -## Verification and Testing - -### 1. Verify Dynamo Deployment - -```bash -# Check Dynamo platform status -kubectl get pods -n dynamo-cloud -kubectl get dynamographdeployment -n dynamo-cloud - -# Check services -kubectl get svc -n dynamo-cloud - -# Test direct Dynamo endpoint -kubectl port-forward svc/dynamo-llm-service 8080:8080 -n dynamo-cloud & -curl -X POST http://localhost:8080/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "llama-3.1-8b-instruct", - "messages": [{"role": "user", "content": "Hello!"}], - "max_tokens": 100 - }' -``` - -### 2. Test LLM Router Integration - -```bash -# Option 1: Test via Ingress (recommended for production) -# First, add llm-router.local to your /etc/hosts file: -# echo "$(kubectl get ingress llm-router -n llm-router -o jsonpath='{.status.loadBalancer.ingress[0].ip}') llm-router.local" | sudo tee -a /etc/hosts - -# Test task-based routing via ingress -curl -X POST http://llm-router.local/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "", - "messages": [{"role": "user", "content": "Write a Python function"}], - "max_tokens": 512, - "nim-llm-router": { - "policy": "task_router" - } - }' - -# Option 2: Test via port-forward (for development/testing) -kubectl port-forward svc/llm-router 8080:8000 -n llm-router & - -# Test task-based routing via port-forward -curl -X POST http://localhost:8080/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "", - "messages": [{"role": "user", "content": "Write a Python function"}], - "max_tokens": 512, - "nim-llm-router": { - "policy": "task_router" - } - }' - -# Test complexity-based routing via ingress -curl -X POST http://llm-router.local/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "", - "messages": [{"role": "user", "content": "Explain quantum computing"}], - "max_tokens": 512, - "nim-llm-router": { - "policy": "complexity_router" - } - }' - -# Test complexity-based routing via port-forward -curl -X POST http://localhost:8080/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "", - "messages": [{"role": "user", "content": "Explain quantum computing"}], - "max_tokens": 512, - "nim-llm-router": { - "policy": "complexity_router" - } - }' -``` - -### 3. Monitor Deployment - -```bash -# Monitor Dynamo logs -kubectl logs -f deployment/dynamo-store -n dynamo-cloud - -# Monitor LLM Router logs -kubectl logs -f deployment/llm-router -n llm-router - -# Check resource usage -kubectl top pods -n dynamo-cloud -kubectl top pods -n llm-router -``` - -## How Dynamo Model Routing Works - -The key insight is that Dynamo provides a **single gateway endpoint** that routes to different models based on the `model` parameter in the OpenAI-compatible API request: - -1. **Single Endpoint**: `http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080/v1` -2. **Model-Based Routing**: Dynamo routes internally based on the `model` field in requests -3. **OpenAI Compatibility**: Standard OpenAI API format with model selection - -Example request: -```json -{ - "model": "llama-3.1-70b-instruct", // Dynamo routes based on this - "messages": [...], - "temperature": 0.7 -} -``` - -Dynamo's internal architecture handles: -- Model registry and discovery -- Request parsing and routing -- Load balancing across replicas -- KV cache management -- Disaggregated serving coordination ## Troubleshooting -### Common Issues +If you encounter issues, the most common causes are: -1. **Build failures**: Ensure earthly is installed and container registry access is configured -2. **CRD not found**: Wait for Dynamo platform to fully deploy before applying DynamoGraphDeployment -3. **Service communication**: Verify cross-namespace RBAC permissions -4. **Model loading**: Check GPU availability and resource requests +1. **Missing Prerequisites**: Ensure all environment variables are set correctly +2. **Insufficient Resources**: Verify your cluster has enough GPU and memory resources +3. **Network Issues**: Check that services can communicate across namespaces -### Debugging Commands +### Quick Health Check ```bash -# Check Dynamo platform -kubectl get pods -n dynamo-cloud -kubectl logs -f deployment/dynamo-store -n dynamo-cloud -kubectl describe dynamographdeployment llm-multi-model -n dynamo-cloud - -# Check LLM Router +# Verify all components are running +kubectl get pods -n ${NAMESPACE} kubectl get pods -n llm-router -kubectl logs -f deployment/llm-router -n llm-router -kubectl describe configmap router-config-dynamo -n llm-router -# Check networking -kubectl exec -it deployment/llm-router -n llm-router -- nslookup dynamo-llm-service.dynamo-cloud.svc.cluster.local - -# Check events -kubectl get events -n dynamo-cloud --sort-by=.metadata.creationTimestamp -kubectl get events -n llm-router --sort-by=.metadata.creationTimestamp +# If something isn't working, check the logs +kubectl logs -f -n ``` +For detailed debugging, refer to the Kubernetes documentation or the specific component's logs. + ## Cleanup + + ```bash # Remove LLM Router helm uninstall llm-router -n llm-router kubectl delete namespace llm-router -# Remove Dynamo deployment -kubectl delete dynamographdeployment llm-multi-model -n dynamo-cloud -kubectl delete namespace dynamo-cloud +# Remove all model deployments +kubectl delete -f llama-8b-disagg.yaml -n ${NAMESPACE} +kubectl delete -f llama-70b-disagg.yaml -n ${NAMESPACE} +kubectl delete -f mixtral-disagg.yaml -n ${NAMESPACE} + +# Remove Hugging Face token secret +kubectl delete secret hf-token-secret -n ${NAMESPACE} -# Remove Dynamo platform (if desired) -cd dynamo/deploy/cloud/helm -./deploy.sh --uninstall +# Remove Dynamo Cloud Platform (if desired) +helm uninstall dynamo-platform -n ${NAMESPACE} +helm uninstall dynamo-crds -n default +kubectl delete namespace ${NAMESPACE} + +# Stop supporting services (if used) +docker compose -f deploy/metrics/docker-compose.yml down ``` ## Quick Configuration Checklist -Before deployment, ensure you customize these key settings: +### Pre-Deployment Checklist + +Before deployment, ensure you have configured these key settings: -1. **`dynamo-llm-deployment.yaml`**: - - Update `dynamoComponent: frontend:latest` with your actual component reference - - Adjust GPU resource requirements based on your hardware +1. **Environment Variables**: + - ✅ Set `NAMESPACE=dynamo-kubernetes` and `RELEASE_VERSION=0.4.0` + - ✅ Set `HF_TOKEN` for Hugging Face model access + - ✅ Set `DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0` + - ✅ Optional: Set `NGC_API_KEY` for private NGC images -2. **`llm-router-values-override.yaml`**: - - Change `host: llm-router.local` to your actual domain - - Update `api_base` URL if using external Dynamo deployment +2. **Model Configuration in `disagg.yaml`**: + - ✅ Update model names for each deployment (Llama-8B, Llama-70B, Mixtral) + - ✅ Adjust service names to avoid conflicts (Frontend-8B, Frontend-70B, etc.) + - ✅ Configure different ports for each model (8000, 8001, 8002) + - ✅ Adjust GPU resources based on your hardware -3. **Environment Variables**: - - Set `DOCKER_SERVER`, `IMAGE_TAG`, `NAMESPACE` before deployment - - Create `DYNAMO_API_KEY` secret during Step 4 +3. **Router Configuration**: + - ✅ Update `router-config-dynamo.yaml` with correct model endpoints + - ✅ Update `llm-router-values-override.yaml` with your domain + - ✅ Ensure API keys are properly configured for router integration ## Files in This Directory - **`README.md`** - This comprehensive deployment guide -- **`dynamo-llm-deployment.yaml`** - DynamoGraphDeployment for multi-LLM inference +- **`disagg.yaml`** - Official vLLM backend disaggregated service configuration (copied from [components/backends/vllm/deploy/disagg.yaml](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/disagg.yaml)) - **`router-config-dynamo.yaml`** - Router configuration for Dynamo integration - **`llm-router-values-override.yaml`** - Helm values override for LLM Router with Dynamo integration diff --git a/customizations/LLM Router/disagg.yaml b/customizations/LLM Router/disagg.yaml new file mode 100644 index 0000000..20f1c85 --- /dev/null +++ b/customizations/LLM Router/disagg.yaml @@ -0,0 +1,50 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: vllm-disagg +spec: + services: + Frontend: + dynamoNamespace: vllm-disagg + componentType: frontend + replicas: 1 + extraPodSpec: + mainContainer: + image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17 + VllmDecodeWorker: + dynamoNamespace: vllm-disagg + envFromSecret: hf-token-secret + componentType: worker + replicas: 1 + resources: + limits: + gpu: "1" + extraPodSpec: + mainContainer: + image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17 + workingDir: /workspace/components/backends/vllm + command: + - /bin/sh + - -c + args: + - "python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B" + VllmPrefillWorker: + dynamoNamespace: vllm-disagg + envFromSecret: hf-token-secret + componentType: worker + replicas: 1 + resources: + limits: + gpu: "1" + extraPodSpec: + mainContainer: + image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17 + workingDir: /workspace/components/backends/vllm + command: + - /bin/sh + - -c + args: + - "python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --is-prefill-worker" diff --git a/customizations/LLM Router/dynamo-llm-deployment.yaml b/customizations/LLM Router/dynamo-llm-deployment.yaml deleted file mode 100644 index a5aeb66..0000000 --- a/customizations/LLM Router/dynamo-llm-deployment.yaml +++ /dev/null @@ -1,148 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# NVIDIA Dynamo LLM Deployment Example -# This demonstrates how to deploy LLM inference graphs using the official -# NVIDIA Dynamo Cloud Platform CRDs -# -# Based on: https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_operator.html - ---- -apiVersion: nvidia.com/v1alpha1 -kind: DynamoGraphDeployment -metadata: - name: llm-multi-model - namespace: dynamo-cloud -spec: - # Reference to the built and pushed Dynamo component - # Replace with your actual component reference after building with: - # earthly --push +all-docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG - dynamoComponent: frontend:latest # Update this with your actual component reference - - # Global environment variables for all services - envs: - - name: LOG_LEVEL - value: "INFO" - - name: ENABLE_METRICS - value: "true" - - # Service-specific configurations - services: - Frontend: - replicas: 1 - envs: - - name: FRONTEND_PORT - value: "8080" - resources: - requests: - cpu: 500m - memory: 1Gi - limits: - cpu: 1000m - memory: 2Gi - - Processor: - replicas: 1 - envs: - - name: PROCESSOR_WORKERS - value: "4" - resources: - requests: - cpu: 1000m - memory: 2Gi - limits: - cpu: 2000m - memory: 4Gi - - # vLLM Worker for multiple models - VllmWorker: - replicas: 1 - envs: - - name: VLLM_MODELS - value: "llama-3.1-8b-instruct,llama-3.1-70b-instruct,mixtral-8x22b-instruct" - - name: VLLM_GPU_MEMORY_UTILIZATION - value: "0.9" - resources: - requests: - cpu: 2000m - memory: 8Gi - nvidia.com/gpu: 4 - limits: - cpu: 4000m - memory: 16Gi - nvidia.com/gpu: 4 - nodeSelector: - nvidia.com/gpu.present: "true" - tolerations: - - key: nvidia.com/gpu - operator: Exists - effect: NoSchedule - - # Prefill Worker for disaggregated serving - PrefillWorker: - replicas: 2 - envs: - - name: PREFILL_MAX_BATCH_SIZE - value: "32" - resources: - requests: - cpu: 1000m - memory: 4Gi - nvidia.com/gpu: 2 - limits: - cpu: 2000m - memory: 8Gi - nvidia.com/gpu: 2 - nodeSelector: - nvidia.com/gpu.present: "true" - tolerations: - - key: nvidia.com/gpu - operator: Exists - effect: NoSchedule - - # Router for KV-aware routing - Router: - replicas: 1 - envs: - - name: ROUTER_ALGORITHM - value: "kv_aware" - - name: ROUTER_CACHE_SIZE - value: "1000" - resources: - requests: - cpu: 500m - memory: 1Gi - limits: - cpu: 1000m - memory: 2Gi - ---- -# Service to expose the Dynamo deployment -apiVersion: v1 -kind: Service -metadata: - name: dynamo-llm-service - namespace: dynamo-cloud - labels: - app: dynamo-llm -spec: - type: ClusterIP - ports: - - name: http - port: 8080 - targetPort: 8080 - protocol: TCP - selector: - dynamo-component: Frontend \ No newline at end of file diff --git a/customizations/LLM Router/router-config-dynamo.yaml b/customizations/LLM Router/router-config-dynamo.yaml index 9ecef20..c2af5ae 100644 --- a/customizations/LLM Router/router-config-dynamo.yaml +++ b/customizations/LLM Router/router-config-dynamo.yaml @@ -20,6 +20,15 @@ # Based on: https://docs.nvidia.com/dynamo/latest/guides/dynamo_deploy/dynamo_cloud.html # API Key pattern follows: https://github.com/NVIDIA-AI-Blueprints/llm-router/blob/main/deploy/helm/llm-router/templates/router-controller-configmap.yaml # +# IMPORTANT: This config only references the 3 models we actually deploy: +# - meta-llama/Llama-3.1-8B-Instruct (Fast model for simple tasks) +# - meta-llama/Llama-3.1-70B-Instruct (Powerful model for complex tasks) +# - mistralai/Mixtral-8x22B-Instruct-v0.1 (Creative model for conversational tasks) +# +# To add more models: +# 1. Deploy the model using the pattern in Step 2 of README.md +# 2. Add router entries below following the same format +# # NOTE: Environment variables are resolved at runtime: # - ${DYNAMO_API_BASE}: Points to the Dynamo service endpoint # - ${DYNAMO_API_KEY}: API key for authenticating with Dynamo services @@ -32,83 +41,99 @@ policies: - name: "task_router" url: http://router-server:8000/v2/models/task_router_ensemble/infer llms: - - name: Brainstorming + # === DEPLOYED MODELS ONLY === + # We only use the 3 models we actually deploy + + # Simple tasks → Fast 8B model + - name: "Closed QA" api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: llama-3.1-70b-instruct - - name: Chatbot + model: meta-llama/Llama-3.1-8B-Instruct + - name: Classification api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: mixtral-8x22b-instruct - - name: "Code Generation" + model: meta-llama/Llama-3.1-8B-Instruct + - name: Extraction api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: llama-3.1-nemotron-70b-instruct - - name: Summarization + model: meta-llama/Llama-3.1-8B-Instruct + - name: Rewrite api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: phi-3-mini-128k-instruct - - name: "Text Generation" + model: meta-llama/Llama-3.1-8B-Instruct + - name: Summarization api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: llama-3.2-11b-vision-instruct - - name: "Open QA" + model: meta-llama/Llama-3.1-8B-Instruct + - name: Unknown api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: llama-3.1-405b-instruct - - name: "Closed QA" + model: meta-llama/Llama-3.1-8B-Instruct + + # Complex tasks → Powerful 70B model + - name: Brainstorming api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: llama-3.1-8b-instruct - - name: Classification + model: meta-llama/Llama-3.1-70B-Instruct + - name: "Code Generation" api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: phi-3-mini-4k-instruct - - name: Extraction + model: meta-llama/Llama-3.1-70B-Instruct + - name: "Open QA" api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: llama-3.1-8b-instruct - - name: Rewrite + model: meta-llama/Llama-3.1-70B-Instruct + - name: Other api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: phi-3-medium-128k-instruct - - name: Other + model: meta-llama/Llama-3.1-70B-Instruct + + # Creative/Conversational tasks → Mixtral model + - name: Chatbot api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: llama-3.1-70b-instruct - - name: Unknown + model: mistralai/Mixtral-8x22B-Instruct-v0.1 + - name: "Text Generation" api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: llama-3.1-8b-instruct + model: mistralai/Mixtral-8x22B-Instruct-v0.1 - name: "complexity_router" url: http://router-server:8000/v2/models/complexity_router_ensemble/infer llms: - - name: Creativity + # === DEPLOYED MODELS ONLY === + # We only use the 3 models we actually deploy + + # Simple complexity → Fast 8B model + - name: "Contextual-Knowledge" api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: llama-3.1-70b-instruct - - name: Reasoning + model: meta-llama/Llama-3.1-8B-Instruct + - name: "No-Label-Reason" api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: llama-3.3-nemotron-super-49b - - name: "Contextual-Knowledge" + model: meta-llama/Llama-3.1-8B-Instruct + - name: Constraint api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: llama-3.1-405b-instruct - - name: "Few-Shot" + model: meta-llama/Llama-3.1-8B-Instruct + + # High complexity → Powerful 70B model + - name: Creativity api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: llama-3.1-70b-instruct - - name: "Domain-Knowledge" + model: meta-llama/Llama-3.1-70B-Instruct + - name: Reasoning api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: llama-3.1-nemotron-70b-instruct - - name: "No-Label-Reason" + model: meta-llama/Llama-3.1-70B-Instruct + - name: "Few-Shot" api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: llama-3.1-8b-instruct - - name: Constraint + model: meta-llama/Llama-3.1-70B-Instruct + + # Creative/Domain complexity → Mixtral model + - name: "Domain-Knowledge" api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: phi-3-medium-128k-instruct \ No newline at end of file + model: mistralai/Mixtral-8x22B-Instruct-v0.1 \ No newline at end of file From 39f5c257f9843e30895a0660e551880e3cd291b4 Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Sat, 23 Aug 2025 05:19:04 +0000 Subject: [PATCH 08/17] Update README.md for NVIDIA Dynamo and LLM Router enhancements - Revised sections for NVIDIA Dynamo and LLM Router to improve clarity and detail. - Updated descriptions to reflect disaggregated serving capabilities and multi-model deployment. - Enhanced task classification and complexity analysis descriptions for better understanding. - Improved overall documentation for optimal performance insights. --- customizations/LLM Router/README.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md index b09a359..0e7537b 100644 --- a/customizations/LLM Router/README.md +++ b/customizations/LLM Router/README.md @@ -26,20 +26,20 @@ This integration combines two powerful NVIDIA technologies to create an intellig -### NVIDIA Dynamo Cloud Platform +### NVIDIA Dynamo - **Distributed inference serving framework** -- **Disaggregated serving capabilities** -- **Multi-model support** -- **Kubernetes-native scaling** +- **Disaggregated serving with KV cache management** +- **Multi-model deployment and orchestration** +- **Kubernetes-native scaling and operations** ### NVIDIA LLM Router -- **Intelligent request routing** -- **Task classification (12 categories)** -- **Complexity analysis (7 categories)** -- **Rust-based performance** +- **Intelligent request routing and load balancing** +- **Task classification across 12 distinct categories** +- **Complexity analysis with 7 difficulty levels** +- **Rust-based architecture for optimal performance** From d77d3d8278a54ed27145759864483aca856f6940 Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Sat, 23 Aug 2025 05:22:52 +0000 Subject: [PATCH 09/17] Refactor README.md for NVIDIA Dynamo and LLM Router - Streamlined descriptions for NVIDIA Dynamo and LLM Router to enhance clarity. - Consolidated features into concise bullet points for better readability. - Updated task classification and complexity analysis sections for improved understanding. - Removed unnecessary table formatting to simplify the layout. --- customizations/LLM Router/README.md | 25 +++++++------------------ 1 file changed, 7 insertions(+), 18 deletions(-) diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md index 0e7537b..bdc0dd3 100644 --- a/customizations/LLM Router/README.md +++ b/customizations/LLM Router/README.md @@ -22,28 +22,17 @@ This comprehensive guide provides step-by-step instructions for deploying the [* This integration combines two powerful NVIDIA technologies to create an intelligent, scalable LLM serving platform: - - - - - -
- ### NVIDIA Dynamo - **Distributed inference serving framework** -- **Disaggregated serving with KV cache management** -- **Multi-model deployment and orchestration** -- **Kubernetes-native scaling and operations** - - +- **Disaggregated serving capabilities** +- **Multi-model deployment support** +- **Kubernetes-native scaling** ### NVIDIA LLM Router -- **Intelligent request routing and load balancing** -- **Task classification across 12 distinct categories** -- **Complexity analysis with 7 difficulty levels** -- **Rust-based architecture for optimal performance** - -
+- **Intelligent request routing** +- **Task classification (12 categories)** +- **Complexity analysis (7 categories)** +- **Rust-based performance** > **Result**: A complete solution for deploying multiple LLMs with automatic routing based on request characteristics, maximizing both **performance** and **cost efficiency**. From 190a334eb7e81c6b12a9e73dd0bcc7b6f69f2ecf Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Sat, 23 Aug 2025 05:30:35 +0000 Subject: [PATCH 10/17] Update README.md to change badge logo for API Keys setup step in LLM Router documentation --- customizations/LLM Router/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md index bdc0dd3..840ebb3 100644 --- a/customizations/LLM Router/README.md +++ b/customizations/LLM Router/README.md @@ -847,7 +847,7 @@ curl localhost:8002/v1/models | jq
-[![Step 5](https://img.shields.io/badge/Step%205-Setup%20API%20Keys-red?style=for-the-badge&logo=docker)](https://github.com/NVIDIA-AI-Blueprints/llm-router) +[![Step 5](https://img.shields.io/badge/Step%205-Setup%20API%20Keys-red?style=for-the-badge&logo=keycdn)](https://github.com/NVIDIA-AI-Blueprints/llm-router) *Configure API keys for LLM Router integration* From 83c1f2d7b707a0f8c1fba9a88f2fc2fd313c3c7e Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Sat, 23 Aug 2025 05:45:19 +0000 Subject: [PATCH 11/17] Add license headers to disagg.yaml for compliance with Apache License 2.0 --- customizations/LLM Router/disagg.yaml | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/customizations/LLM Router/disagg.yaml b/customizations/LLM Router/disagg.yaml index 20f1c85..181ef17 100644 --- a/customizations/LLM Router/disagg.yaml +++ b/customizations/LLM Router/disagg.yaml @@ -1,5 +1,17 @@ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment From 27bb80773b831acd56497fbcddd3853dee24d22c Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Sat, 23 Aug 2025 05:48:51 +0000 Subject: [PATCH 12/17] Update README.md to include new routing strategy and model fields for LLM Router configurations --- customizations/LLM Router/README.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md index 840ebb3..4fa35fa 100644 --- a/customizations/LLM Router/README.md +++ b/customizations/LLM Router/README.md @@ -215,7 +215,9 @@ curl -X POST http://llm-router.local/v1/chat/completions \ "messages": [{"role": "user", "content": "Write a Python function to sort a list"}], "max_tokens": 512, "nim-llm-router": { - "policy": "task_router" + "policy": "task_router", + "routing_strategy": "triton", + "model": "" } }' ``` @@ -235,7 +237,9 @@ curl -X POST http://llm-router.local/v1/chat/completions \ "messages": [{"role": "user", "content": "Explain quantum entanglement"}], "max_tokens": 512, "nim-llm-router": { - "policy": "complexity_router" + "policy": "complexity_router", + "routing_strategy": "triton", + "model": "" } }' ``` From 7be20263d24805eb102c84ea98577c08304a5e9b Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Sat, 23 Aug 2025 07:21:12 +0000 Subject: [PATCH 13/17] Add aggregated configuration for vLLM deployment - Introduced `agg.yaml` for deploying vLLM with a single GPU setup. - Updated `disagg.yaml` to use environment variables for model names and unified image references. - Enhanced `README.md` to include new environment variables and deployment instructions for both aggregated and disaggregated configurations. --- customizations/LLM Router/README.md | 130 +++++++++++--------------- customizations/LLM Router/agg.yaml | 33 +++++++ customizations/LLM Router/disagg.yaml | 10 +- 3 files changed, 94 insertions(+), 79 deletions(-) create mode 100644 customizations/LLM Router/agg.yaml diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md index 4fa35fa..921669f 100644 --- a/customizations/LLM Router/README.md +++ b/customizations/LLM Router/README.md @@ -312,15 +312,15 @@ Set the required environment variables for NGC deployment: | Variable | Description | Example | Required | |----------|-------------|---------|----------| -| `NGC_API_KEY` | NVIDIA NGC API key | `your-ngc-api-key` | Yes | -| `DOCKER_USERNAME` | NGC registry username | `$oauthtoken` | Yes | -| `DOCKER_PASSWORD` | NGC API key (same as above) | `$NGC_API_KEY` | Yes | -| `DOCKER_SERVER` | NGC container registry URL | `nvcr.io` | Yes | +| `DYNAMO_VERSION` | Dynamo vLLM runtime version | `0.4.0` | Yes | +| `MODEL_NAME` | Hugging Face model to deploy | `meta-llama/Llama-3.1-8B-Instruct` | Yes | +| `NGC_API_KEY` | NVIDIA NGC API key (optional for public images) | `your-ngc-api-key` | No | **NGC Setup Instructions**: -1. **Get NGC API Key**: Visit [https://ngc.nvidia.com/setup/api-key](https://ngc.nvidia.com/setup/api-key) to generate your API key -2. **Login to NGC**: Use `docker login nvcr.io --username '$oauthtoken' --password $NGC_API_KEY` -3. **Prebuilt Images**: NGC provides prebuilt CUDA and ML framework images, eliminating the need for local builds +1. **Choose Dynamo Version**: Visit [NGC Dynamo vLLM Runtime Tags](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime/tags) to see available versions +2. **Set Version**: Export your chosen version: `export DYNAMO_VERSION=0.4.0` (or latest available) +3. **Optional - NGC API Key**: Visit [https://ngc.nvidia.com/setup/api-key](https://ngc.nvidia.com/setup/api-key) if you need private image access +4. **Prebuilt Images**: NGC provides prebuilt CUDA and ML framework images, eliminating the need for local builds **Available NGC Dynamo Images**: - **vLLM Runtime**: `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0` (recommended) @@ -518,10 +518,11 @@ python -c "import yaml; yaml.safe_load(open('llm-router-values-override.yaml'))" ```bash export NAMESPACE=dynamo-kubernetes -export RELEASE_VERSION=0.4.0 -export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0 +export DYNAMO_VERSION=0.4.0 # Choose your Dynamo version from NGC catalog +export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct # Choose your model +export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} export HF_TOKEN=your_hf_token -export NGC_API_KEY=your-ngc-api-key +export NGC_API_KEY=your-ngc-api-key # Optional for public images ``` ### Validate Environment Variables @@ -529,10 +530,11 @@ export NGC_API_KEY=your-ngc-api-key ```bash # Check required environment variables are set echo "NAMESPACE: ${NAMESPACE:-'NOT SET'}" -echo "RELEASE_VERSION: ${RELEASE_VERSION:-'NOT SET'}" +echo "DYNAMO_VERSION: ${DYNAMO_VERSION:-'NOT SET'}" +echo "MODEL_NAME: ${MODEL_NAME:-'NOT SET'}" echo "HF_TOKEN: ${HF_TOKEN:-'NOT SET'}" echo "DYNAMO_IMAGE: ${DYNAMO_IMAGE:-'NOT SET'}" -echo "NGC_API_KEY: ${NGC_API_KEY:-'NOT SET'}" +echo "NGC_API_KEY: ${NGC_API_KEY:-'NOT SET (optional for public images)'}" ``` ## Deployment Guide @@ -586,7 +588,7 @@ graph LR ```bash # 1. Set environment export NAMESPACE=dynamo-kubernetes -export RELEASE_VERSION=0.4.0 +export DYNAMO_VERSION=0.4.0 # Choose your Dynamo version from NGC catalog export NGC_API_KEY=your-ngc-api-key # 2. Clone repository @@ -597,22 +599,27 @@ cd dynamo docker login nvcr.io --username '$oauthtoken' --password $NGC_API_KEY # 4. Install CRDs -helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${RELEASE_VERSION}.tgz -helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace default +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${DYNAMO_VERSION}.tgz +helm install dynamo-crds dynamo-crds-${DYNAMO_VERSION}.tgz --namespace default # 5. Install Platform kubectl create namespace ${NAMESPACE} -helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${RELEASE_VERSION}.tgz -helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} +helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${DYNAMO_VERSION}.tgz +helm install dynamo-platform dynamo-platform-${DYNAMO_VERSION}.tgz --namespace ${NAMESPACE} # 6. Verify deployment -kubectl get pods -n $NAMESPACE -kubectl get svc -n $NAMESPACE +# Check CRDs +kubectl get crd | grep dynamo +# Check operator and platform pods +kubectl get pods -n ${NAMESPACE} +# Expected: dynamo-operator-* and etcd-* pods Running +kubectl get svc -n ${NAMESPACE} -# 7. Setup external access and services -kubectl port-forward svc/dynamo-store 8080:80 -n $NAMESPACE & -export DYNAMO_CLOUD=http://localhost:8080 -docker compose -f deploy/metrics/docker-compose.yml up -d +# 7. Configure Dynamo API access +# Get the dynamo-store service endpoint +kubectl get svc dynamo-store -n ${NAMESPACE} +# Set DYNAMO_CLOUD to the cluster-internal service +export DYNAMO_CLOUD=http://dynamo-store.${NAMESPACE}.svc.cluster.local ``` ### Step 2: Deploy Multiple vLLM Models @@ -627,10 +634,10 @@ docker compose -f deploy/metrics/docker-compose.yml up -d -Since our LLM Router routes to different models based on task complexity, we need to deploy multiple model instances. Following the official [vLLM backend deployment guide](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/README.md#3-deploy): +Since our LLM Router routes to different models based on task complexity, we can deploy models using environment variables. Following the official [vLLM backend deployment guide](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/README.md#3-deploy): ```bash -# Set up Hugging Face token for model access +# 1. Set up Hugging Face token for model access export HF_TOKEN=your_hf_token # Create Kubernetes secret for Hugging Face token @@ -638,68 +645,43 @@ kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN=${HF_TOKEN} \ -n ${NAMESPACE} -# Navigate to the official vLLM backend deployment directory -cd dynamo/components/backends/vllm/deploy +# 2. Navigate to your LLM Router directory (where agg.yaml/disagg.yaml are located) +cd "/mnt/raid/examples/customizations/LLM Router/" ``` -#### Deploy Model 1: Llama-3.1-8B (Small/Fast Model) +#### Option A: Single Model Deployment (Recommended for Single GPU) ```bash -# Copy the official disagg.yaml for the 8B model -cp disagg.yaml llama-8b-disagg.yaml +# Set the model you want to deploy +export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct -# Edit the model configuration -sed -i 's/Qwen\/Qwen3-0.6B/meta-llama\/Llama-3.1-8B-Instruct/g' llama-8b-disagg.yaml +# Deploy using aggregated configuration (single GPU) +envsubst < agg.yaml | kubectl apply -f - -n ${NAMESPACE} -# Update service names to avoid conflicts -sed -i 's/Frontend:/Frontend-8B:/g' llama-8b-disagg.yaml -sed -i 's/VllmDecodeWorker:/VllmDecodeWorker-8B:/g' llama-8b-disagg.yaml -sed -i 's/VllmPrefillWorker:/VllmPrefillWorker-8B:/g' llama-8b-disagg.yaml - -# Deploy the 8B model -kubectl apply -f llama-8b-disagg.yaml -n ${NAMESPACE} +# Or deploy using disaggregated configuration (multiple GPUs) +# envsubst < disagg.yaml | kubectl apply -f - -n ${NAMESPACE} ``` -#### Deploy Model 2: Llama-3.1-70B (Large/Powerful Model) - -```bash -# Copy the official disagg.yaml for the 70B model -cp disagg.yaml llama-70b-disagg.yaml - -# Edit the model configuration -sed -i 's/Qwen\/Qwen3-0.6B/meta-llama\/Llama-3.1-70B-Instruct/g' llama-70b-disagg.yaml +#### Option B: Multiple Model Deployment (Requires Multiple GPUs) -# Update service names to avoid conflicts -sed -i 's/Frontend:/Frontend-70B:/g' llama-70b-disagg.yaml -sed -i 's/VllmDecodeWorker:/VllmDecodeWorker-70B:/g' llama-70b-disagg.yaml -sed -i 's/VllmPrefillWorker:/VllmPrefillWorker-70B:/g' llama-70b-disagg.yaml +For users with multiple GPUs who want to deploy multiple models: -# Update port to avoid conflicts (70B model on port 8001) -sed -i 's/port: 8000/port: 8001/g' llama-70b-disagg.yaml - -# Deploy the 70B model -kubectl apply -f llama-70b-disagg.yaml -n ${NAMESPACE} +**Deploy Model 1: Llama-3.1-8B (Fast Model)** +```bash +export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct +envsubst < disagg.yaml | sed 's/vllm-disagg/vllm-8b/g' | kubectl apply -f - -n ${NAMESPACE} ``` -#### Deploy Model 3: Mixtral-8x22B (Mixture of Experts) - +**Deploy Model 2: Llama-3.1-70B (Powerful Model)** ```bash -# Copy the official disagg.yaml for the Mixtral model -cp disagg.yaml mixtral-disagg.yaml - -# Edit the model configuration -sed -i 's/Qwen\/Qwen3-0.6B/mistralai\/Mixtral-8x22B-Instruct-v0.1/g' mixtral-disagg.yaml - -# Update service names to avoid conflicts -sed -i 's/Frontend:/Frontend-Mixtral:/g' mixtral-disagg.yaml -sed -i 's/VllmDecodeWorker:/VllmDecodeWorker-Mixtral:/g' mixtral-disagg.yaml -sed -i 's/VllmPrefillWorker:/VllmPrefillWorker-Mixtral:/g' mixtral-disagg.yaml - -# Update port to avoid conflicts (Mixtral model on port 8002) -sed -i 's/port: 8000/port: 8002/g' mixtral-disagg.yaml +export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct +envsubst < disagg.yaml | sed 's/vllm-disagg/vllm-70b/g' | kubectl apply -f - -n ${NAMESPACE} +``` -# Deploy the Mixtral model -kubectl apply -f mixtral-disagg.yaml -n ${NAMESPACE} +**Deploy Model 3: Mixtral-8x22B (Creative Model)** +```bash +export MODEL_NAME=mistralai/Mixtral-8x22B-Instruct-v0.1 +envsubst < disagg.yaml | sed 's/vllm-disagg/vllm-mixtral/g' | kubectl apply -f - -n ${NAMESPACE} ``` ### Adding More Models (Optional) @@ -1207,7 +1189,7 @@ docker compose -f deploy/metrics/docker-compose.yml down Before deployment, ensure you have configured these key settings: 1. **Environment Variables**: - - ✅ Set `NAMESPACE=dynamo-kubernetes` and `RELEASE_VERSION=0.4.0` + - ✅ Set `NAMESPACE=dynamo-kubernetes` and `DYNAMO_VERSION=0.4.0` - ✅ Set `HF_TOKEN` for Hugging Face model access - ✅ Set `DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0` - ✅ Optional: Set `NGC_API_KEY` for private NGC images diff --git a/customizations/LLM Router/agg.yaml b/customizations/LLM Router/agg.yaml new file mode 100644 index 0000000..953ff69 --- /dev/null +++ b/customizations/LLM Router/agg.yaml @@ -0,0 +1,33 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: vllm-agg +spec: + services: + Frontend: + dynamoNamespace: vllm-agg + componentType: frontend + replicas: 1 + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} + VllmDecodeWorker: + envFromSecret: hf-token-secret + dynamoNamespace: vllm-agg + componentType: worker + replicas: 1 + resources: + limits: + gpu: "1" + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} + workingDir: /workspace/components/backends/vllm + command: + - /bin/sh + - -c + args: + - python3 -m dynamo.vllm --model ${MODEL_NAME} diff --git a/customizations/LLM Router/disagg.yaml b/customizations/LLM Router/disagg.yaml index 181ef17..cdca9c1 100644 --- a/customizations/LLM Router/disagg.yaml +++ b/customizations/LLM Router/disagg.yaml @@ -25,7 +25,7 @@ spec: replicas: 1 extraPodSpec: mainContainer: - image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17 + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} VllmDecodeWorker: dynamoNamespace: vllm-disagg envFromSecret: hf-token-secret @@ -36,13 +36,13 @@ spec: gpu: "1" extraPodSpec: mainContainer: - image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17 + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} workingDir: /workspace/components/backends/vllm command: - /bin/sh - -c args: - - "python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B" + - "python3 -m dynamo.vllm --model ${MODEL_NAME}" VllmPrefillWorker: dynamoNamespace: vllm-disagg envFromSecret: hf-token-secret @@ -53,10 +53,10 @@ spec: gpu: "1" extraPodSpec: mainContainer: - image: nvcr.io/nvidian/nim-llm-dev/vllm-runtime:dep-233.17 + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} workingDir: /workspace/components/backends/vllm command: - /bin/sh - -c args: - - "python3 -m dynamo.vllm --model Qwen/Qwen3-0.6B --is-prefill-worker" + - "python3 -m dynamo.vllm --model ${MODEL_NAME} --is-prefill-worker" From 493a87efacabfa19b0d2a08e3e9b08981117aecf Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Thu, 4 Sep 2025 04:17:13 +0000 Subject: [PATCH 14/17] Refactor LLM Router configuration for shared frontend architecture - Removed frontend service definitions from `agg.yaml` and `disagg.yaml`, consolidating them into a new `frontend.yaml` for a shared API service. - Updated `llm-router-values-override.yaml` to adjust API base and repository configurations for improved deployment flexibility. - Enhanced `README.md` to reflect the new shared frontend architecture, detailing its benefits and deployment instructions. - Revised routing strategies and model configurations to streamline multi-model deployments and reduce resource overhead. --- customizations/LLM Router/README.md | 459 ++++++++---------- customizations/LLM Router/agg.yaml | 7 - customizations/LLM Router/disagg.yaml | 21 +- customizations/LLM Router/frontend.yaml | 16 + .../llm-router-values-override.yaml | 18 +- 5 files changed, 237 insertions(+), 284 deletions(-) create mode 100644 customizations/LLM Router/frontend.yaml diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md index 921669f..b3fb47c 100644 --- a/customizations/LLM Router/README.md +++ b/customizations/LLM Router/README.md @@ -52,13 +52,24 @@ graph TB RS[Router Server + GPU] end - subgraph "Dynamo Platform" - FE[Frontend Service] + subgraph "Dynamo Platform - Shared Frontend Architecture" + FE[Shared Frontend Service] PR[Processor] - VW[VllmDecodeWorker + GPU] - PW[VllmPrefillWorker + GPU] - RT[Router] - PL[Planner] + + subgraph "Model 1 Workers" + VW1[VllmDecodeWorker-8B + GPU] + PW1[VllmPrefillWorker-8B + GPU] + end + + subgraph "Model 2 Workers" + VW2[VllmDecodeWorker-70B + GPU] + PW2[VllmPrefillWorker-70B + GPU] + end + + subgraph "Model 3 Workers" + VW3[VllmDecodeWorker-Mixtral + GPU] + PW3[VllmPrefillWorker-Mixtral + GPU] + end end end @@ -66,21 +77,24 @@ graph TB RC --> RS RS --> FE FE --> PR - PR --> VW - PR --> PW - VW --> RT - PW --> RT - RT --> PL + PR --> VW1 + PR --> VW2 + PR --> VW3 + PR --> PW1 + PR --> PW2 + PR --> PW3 style LB fill:#e1f5fe style RC fill:#f3e5f5 style RS fill:#f3e5f5 style FE fill:#e8f5e8 style PR fill:#e8f5e8 - style VW fill:#fff3e0 - style PW fill:#fff3e0 - style RT fill:#e8f5e8 - style PL fill:#e8f5e8 + style VW1 fill:#fff3e0 + style VW2 fill:#fff3e0 + style VW3 fill:#fff3e0 + style PW1 fill:#ffecb3 + style PW2 fill:#ffecb3 + style PW3 fill:#ffecb3 ```
@@ -280,15 +294,48 @@ This integration demonstrates how to deploy the official NVIDIA Dynamo Cloud Pla ### Key Components -- **disagg.yaml**: Official vLLM backend disaggregated service configuration from [components/backends/vllm/deploy/disagg.yaml](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/disagg.yaml) - - Common: Shared configuration (model, block-size, KV connector) - - Frontend: OpenAI-compatible API endpoint (port 8000) with direct VllmDecodeWorker routing - - VllmDecodeWorker: Decode worker with conditional disaggregation and KV caching (1 GPU) - - VllmPrefillWorker: Specialized prefill worker for high-throughput token processing (1 GPU) +#### Shared Frontend Architecture + +The deployment now uses a **shared frontend architecture** that splits the original `agg.yaml` into separate components for better resource utilization and model sharing: + +- **frontend.yaml**: Shared OpenAI-compatible API frontend service + - Single frontend instance serves all models + - Handles request routing and load balancing + - Reduces resource overhead compared to per-model frontends - Uses official NGC Dynamo vLLM Runtime container from `DYNAMO_IMAGE` variable + +- **decode-worker.yaml**: Template for model-specific decode workers + - VllmDecodeWorker: Decode worker with conditional disaggregation and KV caching (1 GPU per model) + - VllmPrefillWorker: Specialized prefill worker for high-throughput token processing (1 GPU per model) + - Common: Shared configuration (model, block-size, KV connector) + - Deployed per model with unique names and configurations + +#### Configuration Files + - **router-config-dynamo.yaml**: Router policies for Dynamo integration (uses `${DYNAMO_API_BASE}` variable) - **llm-router-values-override.yaml**: Helm values for LLM Router with Dynamo integration (defines `dynamo.api_base` variable) +### Shared Frontend Benefits + +
+ +| **Benefit** | **Shared Frontend** | **Per-Model Frontend** | **Improvement** | +|:---:|:---:|:---:|:---:| +| **Resource Usage** | 1 Frontend + N Workers | N Frontends + N Workers | **↓ 30-50% CPU/Memory** | +| **Network Complexity** | Single Endpoint | Multiple Endpoints | **Simplified Routing** | +| **Maintenance** | Single Service | Multiple Services | **↓ 70% Ops Overhead** | +| **Load Balancing** | Built-in across models | Per-model only | **Better Utilization** | +| **API Consistency** | Single OpenAI API | Multiple APIs | **Unified Interface** | + +
+ +**Key Advantages:** +- **Resource Efficiency**: Single frontend serves all models, reducing CPU and memory overhead +- **Simplified Operations**: One service to monitor, scale, and maintain instead of multiple frontends +- **Better Load Distribution**: Intelligent request routing across all available model workers +- **Cost Optimization**: Fewer running services means lower infrastructure costs +- **Unified API Gateway**: Single endpoint for all models with consistent OpenAI API interface + ### Disaggregated Serving Configuration The deployment uses the official disaggregated serving architecture based on [Dynamo's vLLM backend deployment reference](https://github.com/ai-dynamo/dynamo/tree/main/components/backends/vllm/deploy): @@ -312,20 +359,37 @@ Set the required environment variables for NGC deployment: | Variable | Description | Example | Required | |----------|-------------|---------|----------| -| `DYNAMO_VERSION` | Dynamo vLLM runtime version | `0.4.0` | Yes | -| `MODEL_NAME` | Hugging Face model to deploy | `meta-llama/Llama-3.1-8B-Instruct` | Yes | +| `DYNAMO_VERSION` | Dynamo vLLM runtime version | `0.4.1` | Yes | +| `MODEL_NAME` | Hugging Face model to deploy | `Qwen/Qwen2.5-1.5B-Instruct` | Yes | | `NGC_API_KEY` | NVIDIA NGC API key (optional for public images) | `your-ngc-api-key` | No | +### Model Size Recommendations + +For optimal deployment experience, consider model size vs. resources: + +| Model Size | GPU Memory | Download Time | Recommended For | +|------------|------------|---------------|-----------------| +| **Small (1-2B)** | ~3-4GB | 2-5 minutes | Development, testing | +| **Medium (7-8B)** | ~8-12GB | 10-20 minutes | Production, single GPU | +| **Large (70B+)** | ~40GB+ | 30+ minutes | Multi-GPU setups | + +**Recommended Models:** +- `Qwen/Qwen2.5-1.5B-Instruct` - Fast, good quality (3GB) +- `meta-llama/Llama-3.1-8B-Instruct` - Balanced performance (15GB) +- `TinyLlama/TinyLlama-1.1B-Chat-v1.0` - Ultra-fast (2GB) + +> **💡 Health Check Configuration**: The `frontend.yaml` and `decode-worker.yaml` include extended health check timeouts (30 minutes) to allow sufficient time for model download and loading. Health checks must be configured at the service level, not in `extraPodSpec`, for the Dynamo operator to respect them. The shared frontend architecture reduces the number of health checks needed compared to per-model frontends. + **NGC Setup Instructions**: 1. **Choose Dynamo Version**: Visit [NGC Dynamo vLLM Runtime Tags](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime/tags) to see available versions -2. **Set Version**: Export your chosen version: `export DYNAMO_VERSION=0.4.0` (or latest available) +2. **Set Version**: Export your chosen version: `export DYNAMO_VERSION=0.4.1` (or latest available) 3. **Optional - NGC API Key**: Visit [https://ngc.nvidia.com/setup/api-key](https://ngc.nvidia.com/setup/api-key) if you need private image access 4. **Prebuilt Images**: NGC provides prebuilt CUDA and ML framework images, eliminating the need for local builds **Available NGC Dynamo Images**: -- **vLLM Runtime**: `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0` (recommended) -- **SGLang Runtime**: `nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.0` -- **TensorRT-LLM Runtime**: `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.0` +- **vLLM Runtime**: `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1` (recommended) +- **SGLang Runtime**: `nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.1` +- **TensorRT-LLM Runtime**: `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.1` - **Dynamo Kubernetes Operator**: `nvcr.io/nvidia/ai-dynamo/dynamo-operator:latest` - **Dynamo Deployment API**: `nvcr.io/nvidia/ai-dynamo/dynamo-api-store:latest` @@ -412,13 +476,13 @@ Set your inference runtime image from the available NGC options: ```bash # Set your inference runtime image -export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0 +export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1 ``` **Available Runtime Images**: -- `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0` - vLLM backend (recommended) -- `nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.0` - SGLang backend -- `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.0` - TensorRT-LLM backend +- `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1` - vLLM backend (recommended) +- `nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.1` - SGLang backend +- `nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.1` - TensorRT-LLM backend ### Hugging Face Token @@ -506,9 +570,11 @@ docker pull $DYNAMO_IMAGE cd customizations/LLM\ Router # Check that required files exist -ls -la disagg.yaml router-config-dynamo.yaml llm-router-values-override.yaml +ls -la frontend.yaml agg.yaml disagg.yaml router-config-dynamo.yaml llm-router-values-override.yaml # Validate YAML syntax +python -c "import yaml; yaml.safe_load(open('frontend.yaml'))" && echo "frontend.yaml is valid" +python -c "import yaml; yaml.safe_load(open('agg.yaml'))" && echo "agg.yaml is valid" python -c "import yaml; yaml.safe_load(open('disagg.yaml'))" && echo "disagg.yaml is valid" python -c "import yaml; yaml.safe_load(open('router-config-dynamo.yaml'))" && echo "router-config-dynamo.yaml is valid" python -c "import yaml; yaml.safe_load(open('llm-router-values-override.yaml'))" && echo "llm-router-values-override.yaml is valid" @@ -518,8 +584,8 @@ python -c "import yaml; yaml.safe_load(open('llm-router-values-override.yaml'))" ```bash export NAMESPACE=dynamo-kubernetes -export DYNAMO_VERSION=0.4.0 # Choose your Dynamo version from NGC catalog -export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct # Choose your model +export DYNAMO_VERSION=0.4.1 # Choose your Dynamo version from NGC catalog +export MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct # Choose your model (see recommendations above) export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} export HF_TOKEN=your_hf_token export NGC_API_KEY=your-ngc-api-key # Optional for public images @@ -588,7 +654,7 @@ graph LR ```bash # 1. Set environment export NAMESPACE=dynamo-kubernetes -export DYNAMO_VERSION=0.4.0 # Choose your Dynamo version from NGC catalog +export DYNAMO_VERSION=0.4.1 # Choose your Dynamo version from NGC catalog export NGC_API_KEY=your-ngc-api-key # 2. Clone repository @@ -598,11 +664,11 @@ cd dynamo # 3. Login to NGC docker login nvcr.io --username '$oauthtoken' --password $NGC_API_KEY -# 4. Install CRDs +# 4. Install CRDs (use 'upgrade' instead of 'install' if already installed) helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${DYNAMO_VERSION}.tgz helm install dynamo-crds dynamo-crds-${DYNAMO_VERSION}.tgz --namespace default -# 5. Install Platform +# 5. Install Platform (use 'upgrade' instead of 'install' if already installed) kubectl create namespace ${NAMESPACE} helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${DYNAMO_VERSION}.tgz helm install dynamo-platform dynamo-platform-${DYNAMO_VERSION}.tgz --namespace ${NAMESPACE} @@ -614,12 +680,6 @@ kubectl get crd | grep dynamo kubectl get pods -n ${NAMESPACE} # Expected: dynamo-operator-* and etcd-* pods Running kubectl get svc -n ${NAMESPACE} - -# 7. Configure Dynamo API access -# Get the dynamo-store service endpoint -kubectl get svc dynamo-store -n ${NAMESPACE} -# Set DYNAMO_CLOUD to the cluster-internal service -export DYNAMO_CLOUD=http://dynamo-store.${NAMESPACE}.svc.cluster.local ``` ### Step 2: Deploy Multiple vLLM Models @@ -646,42 +706,33 @@ kubectl create secret generic hf-token-secret \ -n ${NAMESPACE} # 2. Navigate to your LLM Router directory (where agg.yaml/disagg.yaml are located) -cd "/mnt/raid/examples/customizations/LLM Router/" +cd "customizations/LLM Router/" ``` -#### Option A: Single Model Deployment (Recommended for Single GPU) +#### Shared Frontend Deployment +**Step 1: Deploy Shared Frontend** ```bash -# Set the model you want to deploy -export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct - -# Deploy using aggregated configuration (single GPU) -envsubst < agg.yaml | kubectl apply -f - -n ${NAMESPACE} - -# Or deploy using disaggregated configuration (multiple GPUs) -# envsubst < disagg.yaml | kubectl apply -f - -n ${NAMESPACE} +# Deploy the shared frontend service (serves all models) +envsubst < frontend.yaml | kubectl apply -f - -n ${NAMESPACE} ``` -#### Option B: Multiple Model Deployment (Requires Multiple GPUs) +**Step 2: Deploy Model Workers** -For users with multiple GPUs who want to deploy multiple models: +Choose your worker deployment approach: -**Deploy Model 1: Llama-3.1-8B (Fast Model)** +**Option A: Using agg.yaml (aggregated workers)** ```bash +# Deploy model workers only (frontend extracted to frontend.yaml) export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct -envsubst < disagg.yaml | sed 's/vllm-disagg/vllm-8b/g' | kubectl apply -f - -n ${NAMESPACE} -``` - -**Deploy Model 2: Llama-3.1-70B (Powerful Model)** -```bash -export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct -envsubst < disagg.yaml | sed 's/vllm-disagg/vllm-70b/g' | kubectl apply -f - -n ${NAMESPACE} +envsubst < agg.yaml | kubectl apply -f - -n ${NAMESPACE} ``` -**Deploy Model 3: Mixtral-8x22B (Creative Model)** +**Option B: Using disagg.yaml (disaggregated workers)** ```bash -export MODEL_NAME=mistralai/Mixtral-8x22B-Instruct-v0.1 -envsubst < disagg.yaml | sed 's/vllm-disagg/vllm-mixtral/g' | kubectl apply -f - -n ${NAMESPACE} +# Deploy separate prefill and decode workers (frontend extracted to frontend.yaml) +export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct +envsubst < disagg.yaml | kubectl apply -f - -n ${NAMESPACE} ``` ### Adding More Models (Optional) @@ -696,83 +747,60 @@ envsubst < disagg.yaml | sed 's/vllm-disagg/vllm-mixtral/g' | kubectl apply -f - #### Example: Adding Phi-3-Mini Model ```bash -# 1. Deploy the new model (following the same pattern) -cd dynamo/components/backends/vllm/deploy -cp disagg.yaml phi-3-mini-disagg.yaml - -# 2. Update model configuration -sed -i 's/Qwen\/Qwen3-0.6B/microsoft\/Phi-3-mini-128k-instruct/g' phi-3-mini-disagg.yaml -sed -i 's/Frontend:/Frontend-Phi3:/g' phi-3-mini-disagg.yaml -sed -i 's/VllmDecodeWorker:/VllmDecodeWorker-Phi3:/g' phi-3-mini-disagg.yaml -sed -i 's/VllmPrefillWorker:/VllmPrefillWorker-Phi3:/g' phi-3-mini-disagg.yaml -sed -i 's/port: 8000/port: 8003/g' phi-3-mini-disagg.yaml - -# 3. Deploy the model -kubectl apply -f phi-3-mini-disagg.yaml -n ${NAMESPACE} - -# 4. Update router configuration -# Edit customizations/LLM Router/router-config-dynamo.yaml -# Add entries like: -# - name: "New Task Type" -# api_base: ${DYNAMO_API_BASE}/v1 -# api_key: "${DYNAMO_API_KEY}" -# model: microsoft/Phi-3-mini-128k-instruct +# Simply set the model name and deploy using existing files +export MODEL_NAME=microsoft/Phi-3-mini-128k-instruct + +# Deploy using aggregated workers +envsubst < agg.yaml | kubectl apply -f - -n ${NAMESPACE} + +# OR deploy using disaggregated workers +envsubst < disagg.yaml | kubectl apply -f - -n ${NAMESPACE} ``` **Repeat this pattern** for any additional models you want to deploy. -### Step 3: Verify Multiple Model Deployments +### Step 3: Verify Shared Frontend Deployment
[![Step 3](https://img.shields.io/badge/Step%203-Verify%20Deployments-green?style=for-the-badge&logo=kubernetes)](https://kubernetes.io) -*Verify that all vLLM models have been deployed successfully* +*Verify that the shared frontend and model workers have been deployed successfully*
```bash -# Check deployment status for all models +# Check deployment status for shared frontend and all model workers kubectl get pods -n ${NAMESPACE} kubectl get svc -n ${NAMESPACE} -# Look for all model-specific pods and services -kubectl get pods -n ${NAMESPACE} | grep -E "(8B|70B|Mixtral|frontend|worker)" +# Verify shared frontend is running +kubectl logs deployment/frontend -n ${NAMESPACE} --tail=10 -# Check services for each model (should see different ports) -kubectl get svc -n ${NAMESPACE} | grep -E "(frontend|8000|8001|8002)" +# Look for all model worker pods +kubectl get pods -n ${NAMESPACE} | grep -E "(worker|decode|prefill)" -# Verify each model's frontend is running -echo "Checking Llama-8B model..." -kubectl logs deployment/frontend-8b -n ${NAMESPACE} --tail=10 - -echo "Checking Llama-70B model..." -kubectl logs deployment/frontend-70b -n ${NAMESPACE} --tail=10 - -echo "Checking Mixtral model..." -kubectl logs deployment/frontend-mixtral -n ${NAMESPACE} --tail=10 - -# Verify the disaggregated architecture for each model -kubectl get pods -n ${NAMESPACE} | grep -E "(prefill|vllm)" | sort +# Verify the shared frontend service (single port for all models) +kubectl get svc -n ${NAMESPACE} | grep frontend ``` -### Step 4: Test Multiple vLLM Services +### Step 4: Test Shared Frontend Service
[![Step 4](https://img.shields.io/badge/Step%204-Test%20Services-purple?style=for-the-badge&logo=checkmarx)](https://checkmarx.com) -*Test all deployed vLLM services* +*Test the shared frontend service with different models*
-#### Test Llama-8B Model (Port 8000) - ```bash -# Forward the Llama-8B service port -kubectl port-forward svc/frontend-8b-service 8000:8000 -n ${NAMESPACE} & +# Forward the shared frontend service port +kubectl port-forward svc/frontend-service 8000:8000 -n ${NAMESPACE} & -# Test the 8B model +# Test different models through the same endpoint by specifying the model name + +# Test Model 1 (e.g., Llama-3.1-8B) curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ @@ -782,51 +810,19 @@ curl localhost:8000/v1/chat/completions \ "max_tokens": 30 }' | jq -# Check health and models -curl localhost:8000/health -curl localhost:8000/v1/models | jq -``` - -#### Test Llama-70B Model (Port 8001) - -```bash -# Forward the Llama-70B service port -kubectl port-forward svc/frontend-70b-service 8001:8001 -n ${NAMESPACE} & - -# Test the 70B model with a complex task -curl localhost:8001/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "meta-llama/Llama-3.1-70B-Instruct", - "messages": [{"role": "user", "content": "Explain quantum computing in detail"}], - "stream": false, - "max_tokens": 200 - }' | jq - -# Check health and models -curl localhost:8001/health -curl localhost:8001/v1/models | jq -``` - -#### Test Mixtral Model (Port 8002) - -```bash -# Forward the Mixtral service port -kubectl port-forward svc/frontend-mixtral-service 8002:8002 -n ${NAMESPACE} & - -# Test the Mixtral model -curl localhost:8002/v1/chat/completions \ +# Test Model 2 (e.g., different model if deployed) +curl localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "mistralai/Mixtral-8x22B-Instruct-v0.1", - "messages": [{"role": "user", "content": "Write a creative story about AI"}], + "model": "microsoft/Phi-3-mini-128k-instruct", + "messages": [{"role": "user", "content": "Explain quantum computing briefly"}], "stream": false, - "max_tokens": 150 + "max_tokens": 100 }' | jq -# Check health and models -curl localhost:8002/health -curl localhost:8002/v1/models | jq +# Check health and available models +curl localhost:8000/health +curl localhost:8000/v1/models | jq ``` ### Step 5: Set Up LLM Router API Keys @@ -872,7 +868,7 @@ kubectl get secrets -n llm-router -**Note**: The NVIDIA LLM Router does not have an official Helm repository. You must clone the GitHub repository and deploy using local Helm charts. +**Note**: The NVIDIA LLM Router requires building images from source and using the official Helm charts from the GitHub repository. ```bash # 1. Clone the NVIDIA LLM Router repository (required for Helm charts) @@ -880,17 +876,30 @@ git clone https://github.com/NVIDIA-AI-Blueprints/llm-router.git cd llm-router # 2. Build and push LLM Router images to your registry -docker build -t your-registry.com/router-server:latest -f src/router-server/router-server.dockerfile . -docker build -t your-registry.com/router-controller:latest -f src/router-controller/router-controller.dockerfile . -docker push your-registry.com/router-server:latest -docker push your-registry.com/router-controller:latest - -# 3. Create API key secret (using dummy key for Dynamo integration) -kubectl create secret generic llm-api-keys \ - --from-literal=nvidia_api_key=dummy-key-for-dynamo \ +docker build -t /router-server:latest -f src/router-server/router-server.dockerfile . +docker build -t /router-controller:latest -f src/router-controller/router-controller.dockerfile . +docker build -t /llm-router-client:app -f demo/app/app.dockerfile . + +# Push to your registry +docker push /router-server:latest +docker push /router-controller:latest +docker push /llm-router-client:app + + +# 3. Create router configuration ConfigMap with environment variable substitution +# Set environment variables for template substitution +export DYNAMO_API_BASE="http://frontend-service.${NAMESPACE}.svc.cluster.local:8000" +# Note: DYNAMO_API_KEY will be empty (local Dynamo doesn't require authentication) + +# Create ConfigMap with substituted values +envsubst < ../customizations/LLM\ Router/router-config-dynamo.yaml | \ +kubectl create configmap router-config-dynamo \ + --from-file=config.yaml=/dev/stdin \ --namespace=llm-router # 4. Prepare router models (download from NGC) +# Download the NemoCurator Prompt Task and Complexity Classifier model from NGC: +# https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/prompt-task-and-complexity-classifier/version # Follow the main project README to download models to local 'routers/' directory # Then create PVC and upload models: @@ -933,43 +942,17 @@ kubectl wait --for=condition=ready pod/model-uploader -n llm-router --timeout=60 kubectl cp routers/ llm-router/model-uploader:/models/ kubectl delete pod model-uploader -n llm-router -# 5. Create custom values file for Dynamo integration -cat > values.dynamo.yaml </router-controller \ + --set routerServer.image.repository=/router-server \ + --set imagePullSecrets[0].name=nvcr-secret \ --wait --timeout=10m -# 7. Verify LLM Router deployment +# 6. Verify LLM Router deployment kubectl get pods -n llm-router kubectl get svc -n llm-router ``` @@ -996,11 +979,11 @@ curl http://localhost:8084/health ### Ingress Configuration -The LLM Router is configured with ingress enabled for external access: +The LLM Router is configured with ingress disabled by default to avoid service name conflicts. To enable external access: ```yaml ingress: - enabled: true + enabled: false # Disabled by default - enable after deployment is working className: "nginx" # Adjust for your ingress controller hosts: - host: llm-router.local # Change to your domain @@ -1032,9 +1015,9 @@ The router configuration uses **environment variable substitution** for secure A # In router-config-dynamo.yaml llms: - name: Brainstorming - api_base: http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8000/v1 + api_base: http://frontend-service.dynamo-kubernetes.svc.cluster.local:8000/v1 api_key: "${DYNAMO_API_KEY}" # Resolved from Kubernetes secret - model: llama-3.1-70b-instruct + model: meta-llama/Llama-3.1-70B-Instruct ``` The LLM Router controller: @@ -1048,40 +1031,41 @@ The LLM Router controller: The `router-config-dynamo.yaml` configures routing policies to our deployed models. -**Important**: This configuration only references the **3 models we actually deploy**: -- `meta-llama/Llama-3.1-8B-Instruct` (fast, simple tasks) -- `meta-llama/Llama-3.1-70B-Instruct` (powerful, complex tasks) -- `mistralai/Mixtral-8x22B-Instruct-v0.1` (creative, conversational tasks) +**Current Setup**: The configuration routes to different models based on task complexity and type: +- `meta-llama/Llama-3.1-8B-Instruct` - Fast model for simple tasks +- `meta-llama/Llama-3.1-70B-Instruct` - Powerful model for complex tasks +- `mistralai/Mixtral-8x22B-Instruct-v0.1` - Creative model for conversational tasks -**Note**: All models use the same Dynamo gateway endpoint - Dynamo handles internal routing based on the model parameter: +**Note**: All routing goes through the shared frontend service which handles model selection: -| **Task Router** | **Model** | **Dynamo Gateway** | **Use Case** | +| **Task Router** | **Model** | **Shared Frontend** | **Use Case** | |-----------------|-----------|--------------|--------------| -| Brainstorming | `meta-llama/Llama-3.1-70B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Creative ideation | -| Chatbot | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Conversational AI | -| Code Generation | `meta-llama/Llama-3.1-70B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Programming tasks | -| Summarization | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Text summarization | -| Text Generation | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | General text creation | -| Open QA | `meta-llama/Llama-3.1-70B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Complex questions | -| Closed QA | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Simple Q&A | -| Classification | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Text classification | -| Extraction | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Information extraction | -| Rewrite | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Text rewriting | - -| **Complexity Router** | **Model** | **Dynamo Gateway** | **Use Case** | +| Brainstorming | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Creative ideation | +| Chatbot | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://frontend-service.${NAMESPACE}:8000/v1` | Conversational AI | +| Code Generation | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Programming tasks | +| Summarization | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Text summarization | +| Text Generation | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://frontend-service.${NAMESPACE}:8000/v1` | General text creation | +| Open QA | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Complex questions | +| Closed QA | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Simple Q&A | +| Classification | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Text classification | +| Extraction | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Information extraction | +| Rewrite | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Text rewriting | + +| **Complexity Router** | **Model** | **Shared Frontend** | **Use Case** | |----------------------|-----------|--------------|--------------| -| Creativity | `meta-llama/Llama-3.1-70B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Creative tasks | -| Reasoning | `meta-llama/Llama-3.1-70B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Complex reasoning | -| Contextual-Knowledge | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Knowledge-intensive | -| Few-Shot | `meta-llama/Llama-3.1-70B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Few-shot learning | -| Domain-Knowledge | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Specialized domains | -| No-Label-Reason | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Simple reasoning | -| Constraint | `meta-llama/Llama-3.1-8B-Instruct` | `http://dynamo-gateway.${NAMESPACE}:8000/v1` | Constrained tasks | - -**Routing Strategy**: +| Creativity | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Creative tasks | +| Reasoning | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Complex reasoning | +| Contextual-Knowledge | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Knowledge-intensive | +| Few-Shot | `meta-llama/Llama-3.1-70B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Few-shot learning | +| Domain-Knowledge | `mistralai/Mixtral-8x22B-Instruct-v0.1` | `http://frontend-service.${NAMESPACE}:8000/v1` | Specialized domains | +| No-Label-Reason | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Simple reasoning | +| Constraint | `meta-llama/Llama-3.1-8B-Instruct` | `http://frontend-service.${NAMESPACE}:8000/v1` | Constrained tasks | + +**Intelligent Routing Strategy**: - **Simple tasks** → `meta-llama/Llama-3.1-8B-Instruct` (fast, efficient) - **Complex tasks** → `meta-llama/Llama-3.1-70B-Instruct` (powerful, detailed) - **Creative/Conversational** → `mistralai/Mixtral-8x22B-Instruct-v0.1` (diverse, creative) +- **Extensible**: Add more models by deploying additional workers and updating router configuration ## Testing the Integration @@ -1182,33 +1166,12 @@ kubectl delete namespace ${NAMESPACE} docker compose -f deploy/metrics/docker-compose.yml down ``` -## Quick Configuration Checklist - -### Pre-Deployment Checklist - -Before deployment, ensure you have configured these key settings: - -1. **Environment Variables**: - - ✅ Set `NAMESPACE=dynamo-kubernetes` and `DYNAMO_VERSION=0.4.0` - - ✅ Set `HF_TOKEN` for Hugging Face model access - - ✅ Set `DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0` - - ✅ Optional: Set `NGC_API_KEY` for private NGC images - -2. **Model Configuration in `disagg.yaml`**: - - ✅ Update model names for each deployment (Llama-8B, Llama-70B, Mixtral) - - ✅ Adjust service names to avoid conflicts (Frontend-8B, Frontend-70B, etc.) - - ✅ Configure different ports for each model (8000, 8001, 8002) - - ✅ Adjust GPU resources based on your hardware - -3. **Router Configuration**: - - ✅ Update `router-config-dynamo.yaml` with correct model endpoints - - ✅ Update `llm-router-values-override.yaml` with your domain - - ✅ Ensure API keys are properly configured for router integration - ## Files in This Directory - **`README.md`** - This comprehensive deployment guide -- **`disagg.yaml`** - Official vLLM backend disaggregated service configuration (copied from [components/backends/vllm/deploy/disagg.yaml](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/disagg.yaml)) +- **`frontend.yaml`** - Shared OpenAI-compatible API frontend service configuration +- **`agg.yaml`** - Aggregated worker configuration (frontend extracted to frontend.yaml) +- **`disagg.yaml`** - Disaggregated worker configuration with separate prefill/decode workers (frontend extracted to frontend.yaml) - **`router-config-dynamo.yaml`** - Router configuration for Dynamo integration - **`llm-router-values-override.yaml`** - Helm values override for LLM Router with Dynamo integration diff --git a/customizations/LLM Router/agg.yaml b/customizations/LLM Router/agg.yaml index 953ff69..1a6f775 100644 --- a/customizations/LLM Router/agg.yaml +++ b/customizations/LLM Router/agg.yaml @@ -7,13 +7,6 @@ metadata: name: vllm-agg spec: services: - Frontend: - dynamoNamespace: vllm-agg - componentType: frontend - replicas: 1 - extraPodSpec: - mainContainer: - image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} VllmDecodeWorker: envFromSecret: hf-token-secret dynamoNamespace: vllm-agg diff --git a/customizations/LLM Router/disagg.yaml b/customizations/LLM Router/disagg.yaml index cdca9c1..0f7d0e9 100644 --- a/customizations/LLM Router/disagg.yaml +++ b/customizations/LLM Router/disagg.yaml @@ -1,17 +1,5 @@ # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. apiVersion: nvidia.com/v1alpha1 kind: DynamoGraphDeployment @@ -19,13 +7,6 @@ metadata: name: vllm-disagg spec: services: - Frontend: - dynamoNamespace: vllm-disagg - componentType: frontend - replicas: 1 - extraPodSpec: - mainContainer: - image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} VllmDecodeWorker: dynamoNamespace: vllm-disagg envFromSecret: hf-token-secret @@ -59,4 +40,4 @@ spec: - /bin/sh - -c args: - - "python3 -m dynamo.vllm --model ${MODEL_NAME} --is-prefill-worker" + - "python3 -m dynamo.vllm --model ${MODEL_NAME} --is-prefill-worker" \ No newline at end of file diff --git a/customizations/LLM Router/frontend.yaml b/customizations/LLM Router/frontend.yaml new file mode 100644 index 0000000..5829c5c --- /dev/null +++ b/customizations/LLM Router/frontend.yaml @@ -0,0 +1,16 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +apiVersion: nvidia.com/v1alpha1 +kind: DynamoGraphDeployment +metadata: + name: vllm-agg +spec: + services: + Frontend: + dynamoNamespace: vllm-agg + componentType: frontend + replicas: 1 + extraPodSpec: + mainContainer: + image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} diff --git a/customizations/LLM Router/llm-router-values-override.yaml b/customizations/LLM Router/llm-router-values-override.yaml index 5c29249..1a6931e 100644 --- a/customizations/LLM Router/llm-router-values-override.yaml +++ b/customizations/LLM Router/llm-router-values-override.yaml @@ -24,8 +24,8 @@ # Dynamo Configuration - will be created as ConfigMap dynamoConfig: - api_base: "http://dynamo-llm-service.dynamo-cloud.svc.cluster.local:8080" - namespace: "dynamo-cloud" + api_base: "http://frontend-service.${NAMESPACE}.svc.cluster.local:8000" + namespace: "${NAMESPACE}" # For external Dynamo deployments, use: # api_base: "https://your-dynamo-endpoint.com" @@ -33,7 +33,7 @@ dynamoConfig: routerController: replicaCount: 1 image: - repository: nvcr.io/nvidia/router-controller + repository: /router-controller # Override with --set during deployment tag: latest pullPolicy: IfNotPresent @@ -71,7 +71,7 @@ routerController: routerServer: replicaCount: 1 image: - repository: nvcr.io/nvidia/router-server + repository: /router-server # Override with --set during deployment tag: latest pullPolicy: IfNotPresent @@ -104,9 +104,9 @@ config: configMap: name: router-config-dynamo -# Ingress Configuration (enabled for external access) +# Ingress Configuration (disabled to avoid service name conflicts) ingress: - enabled: true + enabled: false className: "nginx" # Use your cluster's ingress class annotations: nginx.ingress.kubernetes.io/rewrite-target: / @@ -121,9 +121,9 @@ ingress: pathType: Prefix backend: service: - name: llm-router + name: llm-router-router-controller # Match actual service name created by Helm port: - number: 8000 # Router service port + number: 8084 # Router controller service port tls: [] # - secretName: llm-router-tls # hosts: @@ -156,7 +156,7 @@ securityContext: # Image Pull Secrets (if needed for private registries) imagePullSecrets: [] - # - name: nvidia-registry-secret + # - name: nvcr-secret # Cross-namespace service access rbac: From 93b2e7041c352c16d0c102896fe398e82c7136ec Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Thu, 4 Sep 2025 04:46:59 +0000 Subject: [PATCH 15/17] Update README.md and router-config-dynamo.yaml for model configuration changes - Revised README.md to reflect updated worker templates and multi-model support, enhancing clarity on deployment configurations. - Changed model reference in router-config-dynamo.yaml from `meta-llama/Llama-3.1-70B-Instruct` to `mistralai/Mixtral-8x22B-Instruct-v0.1`, aligning with the new deployment strategy. - Improved health check configuration details in README.md to ensure proper service-level setup. --- customizations/LLM Router/README.md | 29 ++++++++++--------- .../LLM Router/router-config-dynamo.yaml | 2 +- 2 files changed, 17 insertions(+), 14 deletions(-) diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md index b3fb47c..e82ff8b 100644 --- a/customizations/LLM Router/README.md +++ b/customizations/LLM Router/README.md @@ -304,11 +304,11 @@ The deployment now uses a **shared frontend architecture** that splits the origi - Reduces resource overhead compared to per-model frontends - Uses official NGC Dynamo vLLM Runtime container from `DYNAMO_IMAGE` variable -- **decode-worker.yaml**: Template for model-specific decode workers - - VllmDecodeWorker: Decode worker with conditional disaggregation and KV caching (1 GPU per model) - - VllmPrefillWorker: Specialized prefill worker for high-throughput token processing (1 GPU per model) +- **agg.yaml / disagg.yaml**: Templates for model-specific workers + - **agg.yaml**: Aggregated worker configuration with VllmDecodeWorker (1 GPU per model) + - **disagg.yaml**: Disaggregated worker configuration with separate VllmDecodeWorker and VllmPrefillWorker (1 GPU each) - Common: Shared configuration (model, block-size, KV connector) - - Deployed per model with unique names and configurations + - Deployed per model with unique names using environment variables #### Configuration Files @@ -341,15 +341,15 @@ The deployment now uses a **shared frontend architecture** that splits the origi The deployment uses the official disaggregated serving architecture based on [Dynamo's vLLM backend deployment reference](https://github.com/ai-dynamo/dynamo/tree/main/components/backends/vllm/deploy): **Key Features**: -- **Model**: `Qwen/Qwen3-0.6B` (optimized for disaggregated inference) +- **Multi-Model Support**: Deploy multiple models (Llama-3.1-8B, Llama-3.1-70B, Mixtral-8x22B) using environment variables - **KV Transfer**: Uses `DynamoNixlConnector` for high-performance KV cache transfer - **Conditional Disaggregation**: Automatically switches between prefill and decode workers - **Remote Prefill**: Offloads prefill operations to dedicated VllmPrefillWorker instances - **Prefix Caching**: Enables intelligent caching for improved performance - **Block Size**: 64 tokens for optimal memory utilization -- **Max Model Length**: 16,384 tokens context window -- **Autoscaling**: Optional Planner component for dynamic worker scaling based on load metrics -- **Load Prediction**: ARIMA-based load forecasting for proactive scaling +- **Max Model Length**: 16,384+ tokens context window (varies by model) +- **Shared Frontend**: Single frontend serves all deployed models +- **Intelligent Routing**: LLM Router selects optimal model based on task complexity @@ -378,7 +378,7 @@ For optimal deployment experience, consider model size vs. resources: - `meta-llama/Llama-3.1-8B-Instruct` - Balanced performance (15GB) - `TinyLlama/TinyLlama-1.1B-Chat-v1.0` - Ultra-fast (2GB) -> **💡 Health Check Configuration**: The `frontend.yaml` and `decode-worker.yaml` include extended health check timeouts (30 minutes) to allow sufficient time for model download and loading. Health checks must be configured at the service level, not in `extraPodSpec`, for the Dynamo operator to respect them. The shared frontend architecture reduces the number of health checks needed compared to per-model frontends. +> **💡 Health Check Configuration**: The `frontend.yaml` and `disagg.yaml` include extended health check timeouts (30 minutes) to allow sufficient time for model download and loading. Health checks must be configured at the service level, not in `extraPodSpec`, for the Dynamo operator to respect them. The shared frontend architecture reduces the number of health checks needed compared to per-model frontends. **NGC Setup Instructions**: 1. **Choose Dynamo Version**: Visit [NGC Dynamo vLLM Runtime Tags](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime/tags) to see available versions @@ -1149,10 +1149,13 @@ For detailed debugging, refer to the Kubernetes documentation or the specific co helm uninstall llm-router -n llm-router kubectl delete namespace llm-router -# Remove all model deployments -kubectl delete -f llama-8b-disagg.yaml -n ${NAMESPACE} -kubectl delete -f llama-70b-disagg.yaml -n ${NAMESPACE} -kubectl delete -f mixtral-disagg.yaml -n ${NAMESPACE} +# Remove all model deployments (use the same files you deployed with) +# If you used agg.yaml: +# kubectl delete -f agg.yaml -n ${NAMESPACE} +# If you used disagg.yaml: +# kubectl delete -f disagg.yaml -n ${NAMESPACE} +# Remove shared frontend +kubectl delete -f frontend.yaml -n ${NAMESPACE} # Remove Hugging Face token secret kubectl delete secret hf-token-secret -n ${NAMESPACE} diff --git a/customizations/LLM Router/router-config-dynamo.yaml b/customizations/LLM Router/router-config-dynamo.yaml index c2af5ae..fa50641 100644 --- a/customizations/LLM Router/router-config-dynamo.yaml +++ b/customizations/LLM Router/router-config-dynamo.yaml @@ -86,7 +86,7 @@ policies: - name: Other api_base: ${DYNAMO_API_BASE}/v1 api_key: "${DYNAMO_API_KEY}" - model: meta-llama/Llama-3.1-70B-Instruct + model: mistralai/Mixtral-8x22B-Instruct-v0.1 # Creative/Conversational tasks → Mixtral model - name: Chatbot From ff3bcb78cd15dc31a52de055a3c9bc36a81d7aff Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Thu, 4 Sep 2025 04:58:09 +0000 Subject: [PATCH 16/17] Update README.md to enhance environment variable documentation for LLM Router deployment - Revised environment variable section to include additional variables such as `DYNAMO_IMAGE`, `DYNAMO_API_BASE`, and `DYNAMO_API_KEY`, clarifying their usage in deployment. - Updated model recommendations to reflect current configurations and performance insights. - Improved instructions for setting up environment variables, ensuring clarity for users during deployment. --- customizations/LLM Router/README.md | 51 ++++++++++++++++++++--------- 1 file changed, 35 insertions(+), 16 deletions(-) diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md index e82ff8b..afb29d6 100644 --- a/customizations/LLM Router/README.md +++ b/customizations/LLM Router/README.md @@ -355,13 +355,18 @@ The deployment uses the official disaggregated serving architecture based on [Dy ### Environment Variables -Set the required environment variables for NGC deployment: - -| Variable | Description | Example | Required | -|----------|-------------|---------|----------| -| `DYNAMO_VERSION` | Dynamo vLLM runtime version | `0.4.1` | Yes | -| `MODEL_NAME` | Hugging Face model to deploy | `Qwen/Qwen2.5-1.5B-Instruct` | Yes | -| `NGC_API_KEY` | NVIDIA NGC API key (optional for public images) | `your-ngc-api-key` | No | +Set the required environment variables for deployment: + +| Variable | Description | Example | Required | Used In | +|----------|-------------|---------|----------|---------| +| `NAMESPACE` | Kubernetes namespace for deployment | `dynamo-kubernetes` | Yes | All deployments | +| `DYNAMO_VERSION` | Dynamo vLLM runtime version | `0.4.1` | Yes | Platform install | +| `MODEL_NAME` | Hugging Face model to deploy | `meta-llama/Llama-3.1-8B-Instruct` | Yes | Model deployment | +| `DYNAMO_IMAGE` | Full Dynamo runtime image path | `nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1` | Yes | Model deployment | +| `HF_TOKEN` | Hugging Face access token | `your_hf_token` | Yes | Model access | +| `NGC_API_KEY` | NVIDIA NGC API key | `your-ngc-api-key` | No | Private images | +| `DYNAMO_API_BASE` | Dynamo service endpoint URL | `http://frontend-service.dynamo-kubernetes.svc.cluster.local:8000` | Yes | LLM Router | +| `DYNAMO_API_KEY` | Dynamo API authentication key | `your-dynamo-api-key-here` | No | LLM Router auth | ### Model Size Recommendations @@ -374,9 +379,11 @@ For optimal deployment experience, consider model size vs. resources: | **Large (70B+)** | ~40GB+ | 30+ minutes | Multi-GPU setups | **Recommended Models:** -- `Qwen/Qwen2.5-1.5B-Instruct` - Fast, good quality (3GB) -- `meta-llama/Llama-3.1-8B-Instruct` - Balanced performance (15GB) -- `TinyLlama/TinyLlama-1.1B-Chat-v1.0` - Ultra-fast (2GB) +- `meta-llama/Llama-3.1-8B-Instruct` - Balanced performance, used in router config (15GB) +- `meta-llama/Llama-3.1-70B-Instruct` - High performance, used in router config (40GB+) +- `mistralai/Mixtral-8x22B-Instruct-v0.1` - Creative tasks, used in router config (90GB+) +- `Qwen/Qwen2.5-1.5B-Instruct` - Fast testing model (3GB) +- `TinyLlama/TinyLlama-1.1B-Chat-v1.0` - Ultra-fast testing (2GB) > **💡 Health Check Configuration**: The `frontend.yaml` and `disagg.yaml` include extended health check timeouts (30 minutes) to allow sufficient time for model download and loading. Health checks must be configured at the service level, not in `extraPodSpec`, for the Dynamo operator to respect them. The shared frontend architecture reduces the number of health checks needed compared to per-model frontends. @@ -583,12 +590,21 @@ python -c "import yaml; yaml.safe_load(open('llm-router-values-override.yaml'))" ### Environment Setup ```bash +# Core deployment variables export NAMESPACE=dynamo-kubernetes export DYNAMO_VERSION=0.4.1 # Choose your Dynamo version from NGC catalog -export MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct # Choose your model (see recommendations above) export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} + +# Model deployment variables +export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct # Choose your model (see recommendations above) export HF_TOKEN=your_hf_token + +# Optional variables export NGC_API_KEY=your-ngc-api-key # Optional for public images + +# LLM Router variables (set during router deployment) +export DYNAMO_API_BASE="http://frontend-service.${NAMESPACE}.svc.cluster.local:8000" +export DYNAMO_API_KEY="your-dynamo-api-key-here" # Optional for local deployments ``` ### Validate Environment Variables @@ -598,9 +614,11 @@ export NGC_API_KEY=your-ngc-api-key # Optional for public images echo "NAMESPACE: ${NAMESPACE:-'NOT SET'}" echo "DYNAMO_VERSION: ${DYNAMO_VERSION:-'NOT SET'}" echo "MODEL_NAME: ${MODEL_NAME:-'NOT SET'}" -echo "HF_TOKEN: ${HF_TOKEN:-'NOT SET'}" echo "DYNAMO_IMAGE: ${DYNAMO_IMAGE:-'NOT SET'}" +echo "HF_TOKEN: ${HF_TOKEN:-'NOT SET'}" echo "NGC_API_KEY: ${NGC_API_KEY:-'NOT SET (optional for public images)'}" +echo "DYNAMO_API_BASE: ${DYNAMO_API_BASE:-'NOT SET (set during router deployment)'}" +echo "DYNAMO_API_KEY: ${DYNAMO_API_KEY:-'NOT SET (optional for local deployments)'}" ``` ## Deployment Guide @@ -652,7 +670,7 @@ graph LR ```bash -# 1. Set environment +# 1. Set environment (core variables for platform deployment) export NAMESPACE=dynamo-kubernetes export DYNAMO_VERSION=0.4.1 # Choose your Dynamo version from NGC catalog export NGC_API_KEY=your-ngc-api-key @@ -697,8 +715,10 @@ kubectl get svc -n ${NAMESPACE} Since our LLM Router routes to different models based on task complexity, we can deploy models using environment variables. Following the official [vLLM backend deployment guide](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/README.md#3-deploy): ```bash -# 1. Set up Hugging Face token for model access +# 1. Set up model deployment variables +export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct # Choose your model export HF_TOKEN=your_hf_token +export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} # Create Kubernetes secret for Hugging Face token kubectl create secret generic hf-token-secret \ @@ -887,9 +907,8 @@ docker push /llm-router-client:app # 3. Create router configuration ConfigMap with environment variable substitution -# Set environment variables for template substitution +# Set DYNAMO_API_BASE for template substitution (DYNAMO_API_KEY comes from secret) export DYNAMO_API_BASE="http://frontend-service.${NAMESPACE}.svc.cluster.local:8000" -# Note: DYNAMO_API_KEY will be empty (local Dynamo doesn't require authentication) # Create ConfigMap with substituted values envsubst < ../customizations/LLM\ Router/router-config-dynamo.yaml | \ From 858a2fbe9aa1f04c25a27481dc975adf9b1d4e8e Mon Sep 17 00:00:00 2001 From: Arun Raman Date: Fri, 5 Sep 2025 06:02:02 +0000 Subject: [PATCH 17/17] Update llm-router-values-override.yaml and README.md for improved deployment instructions - Added image pull secret configuration in llm-router-values-override.yaml to support private registry access. - Revised README.md to enhance clarity on environment variable usage and deployment steps, including updates to model deployment instructions and Kubernetes secret references. - Streamlined instructions for creating ConfigMaps and verifying deployments to ensure a smoother user experience. --- customizations/LLM Router/README.md | 34 +++++-------------- .../llm-router-values-override.yaml | 4 +-- 2 files changed, 10 insertions(+), 28 deletions(-) diff --git a/customizations/LLM Router/README.md b/customizations/LLM Router/README.md index afb29d6..1def4bd 100644 --- a/customizations/LLM Router/README.md +++ b/customizations/LLM Router/README.md @@ -634,7 +634,6 @@ echo "DYNAMO_API_KEY: ${DYNAMO_API_KEY:-'NOT SET (optional for local deployments --- - ### Deployment Overview
@@ -670,28 +669,16 @@ graph LR ```bash -# 1. Set environment (core variables for platform deployment) -export NAMESPACE=dynamo-kubernetes -export DYNAMO_VERSION=0.4.1 # Choose your Dynamo version from NGC catalog -export NGC_API_KEY=your-ngc-api-key - -# 2. Clone repository -git clone https://github.com/ai-dynamo/dynamo.git -cd dynamo - -# 3. Login to NGC -docker login nvcr.io --username '$oauthtoken' --password $NGC_API_KEY - -# 4. Install CRDs (use 'upgrade' instead of 'install' if already installed) +# 1. Install CRDs (use 'upgrade' instead of 'install' if already installed) helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-${DYNAMO_VERSION}.tgz helm install dynamo-crds dynamo-crds-${DYNAMO_VERSION}.tgz --namespace default -# 5. Install Platform (use 'upgrade' instead of 'install' if already installed) +# 2. Install Platform (use 'upgrade' instead of 'install' if already installed) kubectl create namespace ${NAMESPACE} helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-${DYNAMO_VERSION}.tgz helm install dynamo-platform dynamo-platform-${DYNAMO_VERSION}.tgz --namespace ${NAMESPACE} -# 6. Verify deployment +# 3. Verify deployment # Check CRDs kubectl get crd | grep dynamo # Check operator and platform pods @@ -712,15 +699,10 @@ kubectl get svc -n ${NAMESPACE} -Since our LLM Router routes to different models based on task complexity, we can deploy models using environment variables. Following the official [vLLM backend deployment guide](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/README.md#3-deploy): +Since our LLM Router routes to different models based on task complexity, we can deploy models using the environment variables already set in Step 1. Following the official [vLLM backend deployment guide](https://github.com/ai-dynamo/dynamo/blob/main/components/backends/vllm/deploy/README.md#3-deploy): ```bash -# 1. Set up model deployment variables -export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct # Choose your model -export HF_TOKEN=your_hf_token -export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:${DYNAMO_VERSION} - -# Create Kubernetes secret for Hugging Face token +# 1. Create Kubernetes secret for Hugging Face token (using variables from Step 1) kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN=${HF_TOKEN} \ -n ${NAMESPACE} @@ -907,11 +889,11 @@ docker push /llm-router-client:app # 3. Create router configuration ConfigMap with environment variable substitution -# Set DYNAMO_API_BASE for template substitution (DYNAMO_API_KEY comes from secret) +# Set DYNAMO_API_BASE for template substitution (DYNAMO_API_KEY comes from dynamo-api-secret) export DYNAMO_API_BASE="http://frontend-service.${NAMESPACE}.svc.cluster.local:8000" # Create ConfigMap with substituted values -envsubst < ../customizations/LLM\ Router/router-config-dynamo.yaml | \ +envsubst < router-config-dynamo.yaml | \ kubectl create configmap router-config-dynamo \ --from-file=config.yaml=/dev/stdin \ --namespace=llm-router @@ -1040,7 +1022,7 @@ llms: ``` The LLM Router controller: -1. Reads `DYNAMO_API_KEY` from the Kubernetes secret +1. Reads `DYNAMO_API_KEY` from the `dynamo-api-secret` Kubernetes secret 2. Replaces `${DYNAMO_API_KEY}` placeholders in the configuration 3. Uses the actual API key value for authentication with Dynamo services diff --git a/customizations/LLM Router/llm-router-values-override.yaml b/customizations/LLM Router/llm-router-values-override.yaml index 1a6931e..eaec97f 100644 --- a/customizations/LLM Router/llm-router-values-override.yaml +++ b/customizations/LLM Router/llm-router-values-override.yaml @@ -155,8 +155,8 @@ securityContext: runAsUser: 1000 # Image Pull Secrets (if needed for private registries) -imagePullSecrets: [] - # - name: nvcr-secret +imagePullSecrets: + - name: nvcr-secret # Cross-namespace service access rbac: