NaVILA is a two-level framework that combines VLAs with locomotion skills for navigation. It generates high-level language-based commands, while a real-time locomotion policy ensures obstacle avoidance.
- Release mode/weight/evaluation.
- Release training code. (around June 30th)
- Release YouTube Human Touring dataset. (around June 30th)
- Release Isaac Sim evaluation, please see here.
To build environment for training NaVILA, please run the following:
./environment_setup.sh navila
conda activate navilaOptional: If you plan to use TensorBoard for logging, install tensorboardX via pip.
For general VQA datasets like video_chatgpt, sharegpt_video, sharegpt4v_sft, please follow the data preparation instructions in NVILA.
We provide annotations for envdrop, scanqa, r2r, rxr, and human on Hugging Face.
Please download the repo and extract the tar.gz files in their respective subfolders.
- 
YouTube Human Touring: 
 Due to copyright restrictions, raw videos/images are not released. We provide video IDs and annotations. You can download the videos usingyt-dlpand extract frames using:scripts/extract_rawframes.py
- 
EnvDrop: 
 Due to the large number of videos, we provide annotations only. Please download the R2R augmented split from R2R_VLNCE_v1-3_preprocessed.zip and render corresponding videos using VLN-CE.
The data should have structure like:
NaVILA-Dataset
├─ EnvDrop
|   ├─ videos
|   |    ├─ 1.mp4
|   |    ├─ ...
|   ├─ annotations.json
├─ Human
|   ├─ raw_frames
|   |    ├─ Aei0GpsWNys
|   |    |    ├─ 0001.jpg
|   |    |    ├─ ...
|   |    ├─ ...
|   ├─ videos
|   |    ├─ Aei0GpsWNys.mp4
|   |    ├─ ...
|   ├─ annotations.json
|   ├─ video_ids.txt
├─ R2R
|   ├─ train
|   |    ├─ 1
|   |    |    ├─ frame_0.jpg 
|   |    |    ├─ ...
|   |    ├─ ...
|   ├─ annotations.json
├─ RxR
|   ├─ train
|   |    ├─ 1
|   |    |    ├─ frame_0.jpg 
|   |    |    ├─ ...
|   |    ├─ ...
|   ├─ annotations.json
├─ ScanQA
|   ├─ videos
|   |    ├─ scene0760_00.mp4
|   |    ├─ ...
|   ├─ annotations
|   |    ├─ ScanQA_v1.0_train_reformat.json
|   |    ├─ ...The pretrain model to start from is provided in a8cheng/navila-siglip-llama3-8b-v1.5-pretrain. Please modify the data paths in llava/data/datasets_mixture.py and use the script in scripts/train/sft_8frames.sh to lanuch the training.
This repository builds on VLN-CE, which relies on older versions of Habitat-Lab and Habitat-Sim. The installation process requires several modifications and can be complex.
- Create a Conda Environment with Python 3.10
conda create -n navila-eval python=3.10
conda activate navila-eval- Build Habitat-Sim & Lab (v0.1.7) from Source
Follow the VLN-CE setup guide. To resolve NumPy compatibility issues, apply the following hotfix:
python evaluation/scripts/habitat_sim_autofix.py # replace habitat_sim/utils/common.py- Install VLN-CE Dependencies
pip install -r evaluation/requirements.txt- Install VILA Dependencies
# Install FlashAttention2
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu122torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# Install VILA (assum in root dir)
pip install -e .
pip install -e ".[train]"
pip install -e ".[eval]"
# Install HF's Transformers
pip install git+https://github.com/huggingface/[email protected]
site_pkg_path=$(python -c 'import site; print(site.getsitepackages()[0])')
cp -rv ./llava/train/transformers_replace/* $site_pkg_path/transformers/
cp -rv ./llava/train/deepspeed_replace/* $site_pkg_path/deepspeed/- Fix WebDataset Version for VLN-CE Compatibility
pip install webdataset==0.1.103Please follow VLN-CE and download R2R and RxR annotations, and scene data inside the evaluation/data folder. The data should have structure like:
data/datasets
├─ RxR_VLNCE_v0
|   ├─ train
|   |    ├─ train_guide.json.gz
|   |    ├─ ...
|   ├─ val_unseen
|   |    ├─ val_unseen_guide.json.gz
|   |    ├─ ...
|   ├─ ...
├─ R2R_VLNCE_v1-3_preprocessed
|   ├─ train
|   |    ├─ train.json.gz
|   |    ├─ ...
|   ├─ val_unseen
|   |    ├─ val_unseen.json.gz
|   |    ├─ ...
data/scene_datasets
├─ mp3d
|   ├─ 17DRP5sb8fy
|   |    ├─ 17DRP5sb8fy.glb
|   |    ├─ ...
|   ├─ ...- Download the checkpoint from a8cheng/navila-llama3-8b-8f.
- Run evaluation on R2R using:
cd evaluation
bash scripts/eval/r2r.sh CKPT_PATH NUM_CHUNKS CHUNK_START_IDX "GPU_IDS"Examples:
- Single GPU:
bash scripts/eval/r2r.sh CKPT_PATH 1 0 "0"
- Multiple GPUs (e.g., 8 GPUs):
bash scripts/eval/r2r.sh CKPT_PATH 8 0 "0,1,2,3,4,5,6,7"
- Visualized videos are saved in
./eval_out/CKPT_NAME/VLN-CE-v1/val_unseen/videospython scripts/eval_jsons.py ./eval_out/CKPT_NAME/VLN-CE-v1/val_unseen NUM_CHUNKS@inproceedings{cheng2025navila,
        title={Navila: Legged robot vision-language-action model for navigation},
        author={Cheng, An-Chieh and Ji, Yandong and Yang, Zhaojing and Gongye, Zaitian and Zou, Xueyan and Kautz, Jan and B{\i}y{\i}k, Erdem and Yin, Hongxu and Liu, Sifei and Wang, Xiaolong},
        booktitle={RSS},
        year={2025}
}



