Mobile Video VAEs Trained with Only One RTX 4090 GPU!
Ya Zou*, Jingfeng Yao*, Siyuan Yu, Shuai Zhang, Wenyu Liu, Xinggang Wang📧
Huazhong University of Science and Technology (HUST)
(* equal contribution, 📧 corresponding author: [email protected])
-
[2025.09.04] We have released the pre-trained weights!
-
[2025.08.25] We have released the code, and the weights will be available soon.
-
[2025.08.13] We have released our paper on arXiv.
There is a growing demand for deploying large generative AI models on mobile devices. For recent popular video generative models, however, the Variational AutoEncoder (VAE) represents one of the major computational bottlenecks. Both large parameter sizes and mismatched kernels cause out-of-memory errors or extremely slow inference on mobile devices. To address this, we propose a low-cost solution that efficiently transfers widely used video VAEs to mobile devices.
(1) We analyze redundancy in existing VAE architectures and get empirical design insights. By integrating 3D depthwise separable convolutions into our model, we significantly reduce the number of parameters.
(2) We observe that the upsampling techniques in mainstream video VAEs are poorly suited to mobile hardware and form the main bottleneck. In response, we propose a decoupled 3D pixel shuffle scheme that slashes end-to-end delay. Building upon these, we develop a universal mobile-oriented VAE decoder, Turbo-VAED.
(3) We propose an efficient VAE decoder training method. Since only the decoder is used during deployment, we distill it to Turbo-VAED instead of retraining the full VAE, enabling fast mobile adaptation with minimal performance loss.
To our knowledge, our method enables real-time 720p video VAE decoding on mobile devices for the first time. This approach is widely applicable to most video VAEs. When integrated into four representative models, with training cost as low as $95, it accelerates original VAEs by up to 84.5× at 720p resolution on GPUs, uses as low as 17.5% of original parameter count, and retains 96.9% of the original reconstruction quality. Compared to mobile-optimized VAEs, Turbo-VAED achieves a 2.9× speedup in FPS and better reconstruction quality on the iPhone 16 Pro.
conda create -n turbovaed python=3.10.0
conda activate turbovaed
pip install -r requirements.txt
- Downloads Video Datasets & Teacher Models
You can download video datasets such as VidGen and UCF-101. The video data should be placed in a root directory, which may consist of multiple subdirectories.
You can download LTX-VAE, Hunyuan-VAE, CogVideoX-VAE, or any other video VAE you want to distill.
- (Optional) You can pre-generate and save latents for small video datasets to reduce the computational cost of encoding during training.
python train_vae/generate_latents.py
And you can use the dataset implementation in train_vae/dataset/video_latent_dataset.py
for training.
-
You need to modify some necessary paths as required in
train.sh
. -
Run the following command to start training.
bash train.sh
- Try using the trained model to reconstruct videos! Run the following command:
python validation_videos.py
- Calculate metrics.
You can download the pretrained weights required for calculating rFVD, modify the corresponding paths for loading the model weights and validation dataset directory in the code, and run the following code to compute the rFVD, PSNR, LPIPS, and SSIM metrics.
torchrun --nnodes=1 --nproc_per_node=1 train_vae/validation_metrics.py
Our Turbo-VAED codes are mainly built with Open-Sora-Plan and diffusers. Thanks for all these great works.
If you find Turbo-VAED useful, please consider giving us a star 🌟 and citing it as follows:
@misc{zou2025turbovaedfaststabletransfer,
title={Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices},
author={Ya Zou and Jingfeng Yao and Siyuan Yu and Shuai Zhang and Wenyu Liu and Xinggang Wang},
year={2025},
eprint={2508.09136},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.09136},
}