Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 11 additions & 1 deletion docs/en/guide/recipes/create-workload.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,4 +62,14 @@ See all configuration options [Workload Configuration](/reference/workload-annot

## Step 3. Verify the App Status

[WIP]
1. Check that a new container named ```inject-lib``` appears in your pods

2. Execute into the shell of the tensor-fusion.ai/inject-container and run:

``` nvidia-smi```

3. Verify that:

- The command runs successfully

- The GPU memory quota has been updated to match your ```tensor-fusion.ai/vram-limit``` setting
120 changes: 115 additions & 5 deletions docs/en/guide/recipes/migrate-existing.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,121 @@

# Migrate Existing Workload to TensorFusion

# Migrate Existing Workload
This guide walks you through migrating your existing GPU workloads to TensorFusion's virtualized GPU infrastructure. The migration process is designed to be gradual and safe, allowing you to test the new setup before fully switching over.

Under construction
## Prerequisites

## Step 1. Map Current GPU Requests to vGPU TFlops/VRAM Requests
- Existing workload running on physical GPUs
- TensorFusion cluster deployed and configured
- Access to your current workload's GPU specifications

## Step 2. Deploy and Test New Workload using TensorFusion GPU Pool
## Step 1: Map Current GPU Requests to vGPU TFlops/VRAM Requests

## Step 3. Shift All Traffic to New Workload and Delete Old One
Before migrating, you need to understand your current GPU resource requirements and map them to TensorFusion's vGPU specifications.

### 1.1 Identify Your GPU Instance Type

First, determine the GPU instance type currently used by your workload:

```bash
# Check your current pod's GPU requests
kubectl describe pod <your-pod-name> | grep -A 5 "Requests:"
```

### 1.2 Query GPU Specifications

Look up the total TFlops and VRAM specifications for your GPU instance type. You can find this information in:
- Cloud provider documentation (AWS, GCP, Azure)
- GPU manufacturer specifications (NVIDIA, AMD)
- Your cluster's node specifications

### 1.3 Configure Pod Annotations

Add the following annotations to your workload's pod specification to match your current GPU resources:

```yaml
metadata:
annotations:
tensor-fusion.ai/tflops-limit: "{total_tflops_of_instance_type}"
tensor-fusion.ai/tflops-request: "{total_tflops_of_instance_type}"
tensor-fusion.ai/vram-limit: "{total_vram_of_instance_type}"
tensor-fusion.ai/vram-request: "{total_vram_of_instance_type}"
```

**Example:**
```yaml
metadata:
annotations:
tensor-fusion.ai/tflops-limit: "312"
tensor-fusion.ai/tflops-request: "312"
tensor-fusion.ai/vram-limit: "24Gi"
tensor-fusion.ai/vram-request: "24Gi"
```

## Step 2: Deploy and Test New Workload with TensorFusion

Deploy a test version of your workload using TensorFusion's GPU pool to validate the migration.

### 2.1 Enable TensorFusion for Your Workload

Add the following configuration to enable TensorFusion:

**Labels:**
```yaml
metadata:
labels:
tensor-fusion.ai/enabled: "true"
```

**Annotations:**
```yaml
metadata:
annotations:
tensor-fusion.ai/enabled-replicas: "1" # Start with 1 replica for testing
```

### 2.2 Deploy Test Workload

Deploy your workload with the TensorFusion configuration:

```bash
kubectl apply -f your-workload-with-tensorfusion.yaml
```

### 2.3 Validate the Migration

Test your workload to ensure it functions correctly with virtualized GPUs:

- Verify GPU resource allocation
- Run your typical workload tests
- Monitor performance metrics
- Check for any compatibility issues

## Step 3: Gradual Traffic Migration

Once testing is successful, gradually shift traffic from your old workload to the new TensorFusion-enabled workload.

### 3.1 Control Traffic Distribution

Use the `enabled-replicas` annotation to control the percentage of pods using virtualized GPUs:

```yaml
metadata:
annotations:
tensor-fusion.ai/enabled-replicas: "{number_of_replicas_to_use_tensorfusion}"
```

**Migration Strategy:**
- Start with 25% of replicas: `tensor-fusion.ai/enabled-replicas: "2"` (if you have 8 total replicas)
- Gradually increase to 50%, 75%, and finally 100%
- Monitor performance and stability at each stage

### 3.2 Complete Migration

When you're confident in the new setup, set all replicas to use TensorFusion:

```yaml
metadata:
annotations:
tensor-fusion.ai/enabled-replicas: "{total_replicas}"
```