This repository contains the VASA implementation separated from EMOPortraits, with all components properly configured for standalone training.
- Clean separation of VASA motion generation from EMOPortraits volumetric rendering
 - Bridge interface for easy swapping of volumetric avatar backends
 - XY/UV warping system for expression transfer and canonical view generation
 - Efficient caching with single-bucket preprocessing
 - Multi-mode training support (overfitting, full dataset)
 
- MCP Server Setup (for Claude integration):
 
# Add Weights & Biases MCP server for Claude
claude mcp add wandb -- uvx --from git+https://github.com/wandb/wandb-mcp-server wandb_mcp_server && uvx wandb login- Clone the repository with submodules:
 
# Clone with submodules included
git clone --recurse-submodules https://github.com/johndpope/VASA-1-hack.git
cd VASA-1-hack
# Or if you already cloned without submodules:
git submodule update --init --recursive# Install system dependencies
sudo apt-get update
sudo apt-get install -y ffmpeg git-lfs
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
chmod +x ~/miniconda.sh
~/miniconda.sh
# carefully accept - type yes - 
# Create conda environment
conda create -n vasa python=3.12
conda activate vasa
# Install PyTorch (adjust for your CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu129
# Install required packages
pip install omegaconf wandb opencv-python-headless pillow scipy matplotlib tqdm
pip install transformers diffusers accelerate einops
pip install facenet-pytorch insightface hsemotion-onnx
pip install mediapipe OmegaConf wandb
pip install memory-profiler rich
pip install diffusers h5py scikit-learn seaborn python_speech_features
pip install onnxruntime-gpu lpips pytorch_msssim
# EMOPortaits
cd nemo
chmod +x ./bootstrap.sh
./bootstrap.sh
- Create necessary symlinks:
 
# Create symlink for repos (required for relative paths)
ln -s nemo/repos repos
# Create symlink for data directory (required for aligned keypoints)
ln -s nemo/data data
# Create symlink for losses directory (required for loss model weights)
ln -s nemo/losses losses- Download pre-trained volumetric avatar model:
 
The pre-trained model should be placed in:
nemo/logs/Retrain_with_17_V1_New_rand_MM_SEC_4_drop_02_stm_10_CV_05_1_1/checkpoints/328_model.pth
- Prepare your training data:
 
# Create directories
mkdir -p junk cache checkpoints
# Place your training videos in the junk directory
# Videos should be .mp4 format
cp your_training_videos/*.mp4 junk/VASA-1-hack/
├── nemo/                        # Git submodule: nemo repository (base EMOPortraits code)
│   ├── models/                  # Model implementations
│   ├── networks/                # Network architectures
│   ├── losses/                  # Loss functions
│   ├── datasets/                # Dataset loaders
│   ├── repos/                   # External repositories (face_par_off, etc.)
│   └── logs/                    # Pre-trained model checkpoints
│
├── vasa_*.py                    # VASA-specific implementations
│   ├── vasa_trainer.py          # Main training script
│   ├── vasa_model.py            # VASA model architecture
│   ├── vasa_dataset.py          # VASA dataset handler
│   ├── vasa_scheduler.py        # Diffusion scheduler
│   └── vasa_lip_normalizer.py   # Lip normalization utilities
│
├── vasa_config.yaml             # Main configuration file
├── video_tracker.py             # Video tracking utilities
├── syncnet.py                   # Sync network implementation
│
├── data/                        # Data files
│   └── aligned_keypoints_3d.npy
├── losses/                      # Loss model weights
│   └── loss_model_weights/
├── junk/                        # Training videos directory
├── cache/                       # Cache for processed data
├── checkpoints/                 # Model checkpoints
└── repos/                       # Symlink to nemo/repos
Edit vasa_config.yaml to configure paths and training parameters:
paths:
  volumetric_model: "nemo/logs/[...]/328_model.pth"  # Pre-trained model
  volumetric_config: "nemo/models/stage_1/volumetric_avatar/va.yaml"
  data_dir: "data"
  video_folder: "junk"  # Your training videos directory
  cache_dir: "cache"
  checkpoint_dir: "checkpoints"
train:
  batch_size: 1
  num_epochs: 4000
  lr: 1e-3
  # ... other training parameterspython test_vasa_setup.pyExpected output:
✓ Config loaded successfully
✓ All paths exist
✓ All modules import correctly
✓ Setup looks good! You can now run vasa_trainer.py
Test your setup and verify model can train properly:
# Run overfitting test with optimized settings
python train_overfit.pyThis uses overfit_config.yaml with:
- Single-bucket caching for fast data loading
 - Face attribute caching (gaze, emotion, head_distance)
 - Optimized batch sizes and learning rates
 - WandB integration for monitoring
 - Automatic checkpoint resumption
 
Use the standard configuration for training on your complete dataset:
# Uses vasa_config.yaml by default
python vasa_trainer.py
# Or explicitly specify the config
python vasa_trainer.py --config vasa_config.yamlKey parameters in vasa_config.yaml:
window_size: 50- Full 50-frame windowsn_layers: 8- Full 8 transformer layersnum_steps: 1000- Full 1000 diffusion stepsbatch_size: 1- Adjust based on GPU memorynum_epochs: 4000- Full training schedule
Use the overfitting configuration via vasa_trainer:
# Use the overfitting configuration with vasa_trainer
python vasa_trainer.py --config overfit_config.yamlKey differences in overfit_config.yaml:
window_size: 20- Smaller windows for faster processingn_layers: 2- Reduced transformer depth (2x-4x faster)num_steps: 100- Reduced diffusion steps (10x faster)batch_size: 4- Larger batch for better GPU utilizationnum_epochs: 100- Shorter training for quick iterationmax_videos: 100- Limited dataset sizenum_workers: 8- Multi-threaded data loading- No augmentation - Pure overfitting test
 
When to use overfitting mode:
- Testing new model architectures
 - Debugging training pipeline
 - Verifying data loading and caching
 - Quick convergence tests
 - Checking if model can overfit to small dataset (sanity check)
 
For faster training, preprocess all windows into a single cache file:
# Preprocess data for overfitting test (small dataset)
python preprocess_single_bucket.py --max_videos 100 --cache_dir cache_overfit
# Preprocess full dataset
python preprocess_single_bucket.py --max_videos 1000 --cache_dir cache_fullBenefits of single-bucket caching:
- 10x faster data loading - Direct index access to any window
 - Face attributes cached - Gaze, emotion, head_distance pre-computed
 - Better shuffling - Perfect for random sampling
 - Memory efficient - One H5 file instead of many
 - Self-contained windows - Context is cached, no video dependencies
 
The cache will be automatically used if:
use_single_bucket: truein your config file- The cache file exists in the specified 
cache_dir 
Both training modes support WandB logging:
# View training progress
# Visit the URL printed at training start, e.g.:
# wandb: 🚀 View run at https://wandb.ai/your-username/vasa/runs/run-idFor overfitting mode, runs are grouped as "overfit-experiments" in WandB for easy comparison.
To use a different dataset (e.g., CelebV-HQ):
# Edit the config file or create a custom one
# Update video_folder path in the config:
# video_folder: "/path/to/your/dataset"
# For example, using CelebV-HQ:
# video_folder: "/media/12TB/Downloads/CelebV-HQ/celebvhq/35666"The trainer will:
- Load the pre-trained volumetric avatar model
 - Process videos from the configured directory
 - Cache processed windows for faster subsequent epochs
 - Save checkpoints periodically based on 
save_freq - Save checkpoints to 
checkpoints/(orcheckpoints_overfit/for overfitting mode) - Log to Weights & Biases (if enabled)
 
| Parameter | Vanilla Training | Overfitting Mode | Speedup | 
|---|---|---|---|
| Window Size | 50 frames | 20 frames | 2.5x | 
| Transformer Layers | 8 | 2 | 4x | 
| Diffusion Steps | 1000 | 100 | 10x | 
| Batch Size | 1 | 4 | 4x | 
| Workers | 0 | 8 | Parallel loading | 
| Epoch Time (RTX 5090) | ~5 min | ~1.5 min | 3.3x | 
The project includes several debugging pipelines for analyzing face swap and identity preservation issues:
# Test with video (uses joint extraction to prevent identity drift)
python nemo/pipeline3.py --target nemo/data/VID_1.mp4 --max-frames 10
# Test with single image
python nemo/pipeline3.py --target nemo/data/IMG_2.png
# Use custom source identity
python nemo/pipeline3.py --source path/to/source.png --target path/to/target.mp4
# Swap identity mode (use driver's identity with source's expression)
python nemo/pipeline3.py --default-video --swap-identity
# This is useful when the model is extracting the wrong identityFeatures:
- Joint extraction: Processes source+first_driver_frame together to calibrate embeddings
 - Identity swapping: 
--swap-identityflag to use driver's identity with source's expression - Comprehensive tracing: Every step logged with images and tensors
 - Comparison grids: Side-by-side visualization of results
 - Warp visualization: XY/UV warp magnitude heatmaps
 - Debug output: All intermediates saved to 
debug_pipeline3/ 
# The reference pipeline that produces correct results
python nemo/pipeline2.pyThis is the baseline implementation that pipeline3.py was designed to match.
Various analysis scripts for specific debugging:
check_identity_confusion.py- Analyze identity preservationdebug_identity_extraction.py- Test identity feature extractiontest_polished_face_swap.py- Test face swap qualityextract_and_apply_warps_properly.py- Analyze warp field application
The volumetric avatar system uses two types of warps:
- 
XY Warps (Rigid + Non-rigid 3D warping)
- Transform from posed face → canonical (neutral) space
 - Removes head pose and expression from source
 - Creates identity-preserving canonical volume
 
 - 
UV Warps (Expression transfer)
- Transform from canonical → target expression
 - Applies target's expression and pose
 - Preserves source identity while adopting target motion
 
 
Problem: Generated face morphs away from source identity Cause: Solo extraction (processing source alone without driver context) Solution: Joint extraction - process source+first_driver_frame together
Problem: Male faces (e.g., IMG_1.png) appear feminine in results Cause: Identity embeddings not properly calibrated to driver motion space Solution: Joint extraction ensures embeddings are aligned with driver poses
debug_pipeline3/
├── trace_YYYYMMDD_HHMMSS.json    # Complete execution trace
├── step_NNNN_*.png                # Intermediate images at each step
├── step_NNNN_*.pt                 # Tensor checkpoints
├── frame_NNN_result.png           # Final output frames
└── video_comparison.png           # Grid comparison of all frames
The trace files contain detailed information about each processing step:
- Entry/exit points for all major functions
 - Tensor shapes and statistics
 - Mask generation and compositing steps
 - Warp field generation and application
 
Use the trace to identify where identity drift or other issues occur in the pipeline. | Convergence | 1000+ epochs | 10-20 epochs | 50x+ |
The VASA model uses a sophisticated two-stage warping system to separate identity from expression, enabling clean expression transfer between faces.
- Coordinate System: XY refers to spatial coordinates (X=width, Y=height) in the 3D volume space (16×64×64 grid)
 - Direction: FROM current expression → TO canonical (neutral)
 - Purpose: Expression normalization - removes the current expression to get back to a neutral state
 - Effect: "Undoes" expressions (e.g., moves smiling mouth corners back to neutral positions)
 - Applied to: The source volume before any target expression is added
 
- Coordinate System: UV uses texture/surface coordinates (0-1 normalized range)
 - Direction: FROM canonical → TO target expression
 - Purpose: Expression application - adds the desired expression to the neutral volume
 - Effect: Deforms canonical volume to create new expressions (smile, frown, surprise, etc.)
 - Applied to: The volume after XY warping (canonical state)
 
Source Face (😊) → [XY Warp] → Canonical (😐) → [UV Warp] → Target Face (😮)
- Stage 1 (XY Warping): Normalizes any expression to canonical
 - Stage 2 (UV Warping): Applies target expression to canonical
 
This separation enables:
- Clean expression transfer between any source and target
 - Identity preservation while changing expressions
 - Consistent canonical representation for all faces
 
The warps are extracted during dataset preprocessing:
# In vasa_dataset.py - extract warps for training
motion_data = {
    'xy_warps': xy_warps,      # [T, 16, 64, 64, 3] - normalizes to canonical
    'rigid_warps': rigid_warps,  # [T, 16, 64, 64, 3] - head pose alignment
    'uv_warps': uv_warps,       # [T, 16, 64, 64, 3] - applies target expression
    'source_theta': thetas      # [T, 3, 4] - pose matrices
}To cleanly separate VASA from the volumetric avatar implementation, we've developed a bridge interface that abstracts all EMOPortraits-specific details.
Abstract interface that any volumetric avatar backend must implement:
class VolumetricAvatarBridgeInterface:
    def extract_warps_for_window(frames, identity_frame_idx) -> WindowWarpData
    def extract_warps_for_frame(identity_frame, target_frame) -> FrameWarpData
    def generate_canonical_view(identity_frame) -> canonical_image
    def get_identity_embedding(identity_frame) -> identity_embedConcrete implementation for EMOPortraits/MegaPortraits models:
- Handles all model-specific details internally
 - Provides clean warp extraction API
 - Manages caching for efficiency
 - Supports batch processing for entire windows
 
from vasa_emo_bridge_interface import create_bridge
# Create bridge (abstracts all EMO details)
bridge = create_bridge("emoportraits", emo_model)
# Extract warps for entire window at once
window_warps = bridge.extract_warps_for_window(
    frames=frames,           # [T, C, H, W]
    identity_frame_idx=0     # Use first frame as identity
)
# Access extracted warps
xy_warps = window_warps.xy_warps        # [T, D, H, W, 3]
rigid_warps = window_warps.rigid_warps  # [T, D, H, W, 3]
uv_warps = window_warps.uv_warps        # [T, D, H, W, 3]
# Generate canonical view
canonical = bridge.generate_canonical_view(identity_frame)- Clean Separation: VASA code doesn't need to know EMOPortraits internals
 - Easy Swapping: Can replace volumetric backend without changing VASA
 - Batch Efficiency: Process entire windows at once
 - Automatic Caching: Identity embeddings cached automatically
 - Type Safety: Clear data structures with type hints
 
The system can generate canonical (neutral, front-facing) views from any input expression:
A canonical view represents a person in a standardized state:
- Neutral expression (no smile, closed mouth)
 - Front-facing pose (no head rotation)
 - Consistent lighting and appearance
 
- Extract identity embedding from the source frame
 - Create canonical pose (identity matrix = no rotation)
 - Process through volumetric model to get canonical volume
 - Decode with minimal warping to get neutral view
 
- Reference frame generation for consistent motion synthesis
 - Expression normalization for training
 - Identity preservation during expression transfer
 - Quality evaluation of the volumetric model
 
When given different expressions as input, the canonical generation produces nearly identical neutral views:
- Average difference between canonical views: < 0.1 (excellent consistency)
 - Identity fully preserved
 - All expressions normalized to neutral
 
The project uses Python's logging module with three configurable levels defined in nemo/logger.py:28-30:
# log_level = logging.WARNING    # Minimal output - only warnings and errors
log_level = logging.INFO         # Standard output - informational messages (default)
# log_level = logging.DEBUG       # Verbose output - detailed debugging informationLogging Levels Explained:
- 
WARNING (
logging.WARNING)- Shows only warnings, errors, and critical messages
 - Use when you want minimal console output during training
 - Best for production runs where you only need to know about issues
 
 - 
INFO (
logging.INFO) - Currently Active- Shows informational messages, warnings, and errors
 - Provides training progress, epoch updates, and key metrics
 - Default and recommended level for normal training runs
 - Balances visibility with readability
 
 - 
DEBUG (
logging.DEBUG)- Shows all messages including detailed debugging information
 - Includes tensor shapes, gradient information, and internal state
 - Use when troubleshooting model issues or understanding data flow
 - Can be verbose - recommended only for debugging sessions
 
 
To change the logging level:
- Edit 
nemo/logger.pyline 29 - Uncomment the desired level and comment out the others
 - The change takes effect on next run
 
Additional Features:
- Logs are saved to 
project.logfile for later review - Rich formatting with color-coded output and timestamps
 - Third-party library logging is suppressed to reduce noise
 - TorchDebugger class available for advanced PyTorch debugging
 
- 
ModuleNotFoundError: No module named 'logger'
# The logger module is in nemo, paths are already configured # If still having issues, check that nemo is cloned properly
 - 
FileNotFoundError: './repos/face_par_off/res/cp/79999_iter.pth'
# Ensure the symlink exists: ln -s nemo/repos repos - 
ValueError: num_samples should be a positive integer value, but got num_samples=0
# No videos found. Add videos to junk/ directory: cp your_video.mp4 junk/ - 
FileNotFoundError: Config file not found at channel_config.yaml
# Copy from EMOPortraits or create a basic one - 
CUDA out of memory
- Reduce 
batch_sizein vasa_config.yaml - Enable gradient checkpointing
 - Reduce 
sequence_lengthin dataset config 
 - Reduce 
 - 
FFmpeg warnings
- These can be safely ignored if not processing audio
 - To fix: 
pip install ffmpeg-python 
 
If you're missing files, you'll need these from EMOPortraits:
channel_config.yaml- Channel configurationsyncnet.py- Sync network implementationdata/aligned_keypoints_3d.npy- 3D keypoint alignmentslosses/loss_model_weights/*.pth- Pre-trained loss models- Pre-trained volumetric avatar checkpoint
 
Training progress is logged to:
- Console: Real-time training metrics
 - Weights & Biases: Detailed metrics and visualizations (if enabled)
 - Checkpoints: Saved every N epochs to 
checkpoints/ 
Monitor training:
# Watch training logs
tail -f project.log
# Check W&B dashboard
# https://wandb.ai/YOUR_USERNAME/vasa/- VASA-specific code: Root directory (
vasa_*.py) - Base EMOPortraits code: 
nemo/directory - Configuration: 
vasa_config.yaml - Training data: 
junk/directory - Model outputs: 
checkpoints/directory 
- Separated VASA components from EMOPortraits codebase
 - Fixed all hardcoded paths to be relative or configurable
 - Proper module imports with sys.path management
 - Configurable paths via vasa_config.yaml
 - Auto-detection of project directories in nemo code
 - Clean separation between VASA-specific and base code
 
Update nemo to latest version:
cd nemo
git pull origin main
cd ..
git add nemo
git commit -m "Update nemo submodule to latest"Lock to specific nemo version:
cd nemo
git checkout <commit-hash>
cd ..
git add nemo
git commit -m "Lock nemo to specific version"- The volumetric model must be pre-trained (from EMOPortraits)
 - Training requires at least one video in the 
junk/directory - All paths in configs are relative to the project root
 - The 
repossymlink is required for backward compatibility 
- Training requires significant GPU memory (recommended: 24GB+)
 - Some imports show FFmpeg warnings (can be ignored)
 - Initial dataset processing can be slow (cached afterward)
 
This project is licensed under the MIT License - see the LICENSE file for details.
Note: The nemo submodule and other dependencies may have their own licenses.
- EMOPortraits team for the base implementation
 - VASA paper authors for the architecture design
 - Contributors to the nemo repository