NaVILA: Legged Robot Vision-Language-Action Model for Navigation (RSS'25)

💡 Introduction

NaVILA is a two-level framework that combines VLAs with locomotion skills for navigation. It generates high-level language-based commands, while a real-time locomotion policy ensures obstacle avoidance.

TODO

Release mode/weight/evaluation.
Release training code. (around June 30th)
Release YouTube Human Touring dataset. (around June 30th)
Release Isaac Sim evaluation, please see here.

🚀 Training

Installation

To build environment for training NaVILA, please run the following:

./environment_setup.sh navila
conda activate navila

Optional: If you plan to use TensorBoard for logging, install tensorboardX via pip.

Dataset

For general VQA datasets like video_chatgpt, sharegpt_video, sharegpt4v_sft, please follow the data preparation instructions in NVILA. We provide annotations for envdrop, scanqa, r2r, rxr, and human on Hugging Face. Please download the repo and extract the tar.gz files in their respective subfolders.

YouTube Human Touring:
Due to copyright restrictions, raw videos/images are not released. We provide video IDs and annotations. You can download the videos using yt-dlp and extract frames using: scripts/extract_rawframes.py
EnvDrop:
Due to the large number of videos, we provide annotations only. Please download the R2R augmented split from R2R_VLNCE_v1-3_preprocessed.zip and render corresponding videos using VLN-CE.

The data should have structure like:

NaVILA-Dataset
├─ EnvDrop
|   ├─ videos
|   |    ├─ 1.mp4
|   |    ├─ ...
|   ├─ annotations.json
├─ Human
|   ├─ raw_frames
|   |    ├─ Aei0GpsWNys
|   |    |    ├─ 0001.jpg
|   |    |    ├─ ...
|   |    ├─ ...
|   ├─ videos
|   |    ├─ Aei0GpsWNys.mp4
|   |    ├─ ...
|   ├─ annotations.json
|   ├─ video_ids.txt
├─ R2R
|   ├─ train
|   |    ├─ 1
|   |    |    ├─ frame_0.jpg 
|   |    |    ├─ ...
|   |    ├─ ...
|   ├─ annotations.json
├─ RxR
|   ├─ train
|   |    ├─ 1
|   |    |    ├─ frame_0.jpg 
|   |    |    ├─ ...
|   |    ├─ ...
|   ├─ annotations.json
├─ ScanQA
|   ├─ videos
|   |    ├─ scene0760_00.mp4
|   |    ├─ ...
|   ├─ annotations
|   |    ├─ ScanQA_v1.0_train_reformat.json
|   |    ├─ ...

Training

The pretrain model to start from is provided in a8cheng/navila-siglip-llama3-8b-v1.5-pretrain. Please modify the data paths in llava/data/datasets_mixture.py and use the script in scripts/train/sft_8frames.sh to lanuch the training.

📊 Evaluation

Installation

This repository builds on VLN-CE, which relies on older versions of Habitat-Lab and Habitat-Sim. The installation process requires several modifications and can be complex.

Create a Conda Environment with Python 3.10

conda create -n navila-eval python=3.10
conda activate navila-eval

Build Habitat-Sim & Lab (v0.1.7) from Source

Follow the VLN-CE setup guide. To resolve NumPy compatibility issues, apply the following hotfix:

python evaluation/scripts/habitat_sim_autofix.py # replace habitat_sim/utils/common.py

Install VLN-CE Dependencies

pip install -r evaluation/requirements.txt

Install VILA Dependencies

# Install FlashAttention2
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu122torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

# Install VILA (assum in root dir)
pip install -e .
pip install -e ".[train]"
pip install -e ".[eval]"

# Install HF's Transformers
pip install git+https://github.com/huggingface/[email protected]
site_pkg_path=$(python -c 'import site; print(site.getsitepackages()[0])')
cp -rv ./llava/train/transformers_replace/* $site_pkg_path/transformers/
cp -rv ./llava/train/deepspeed_replace/* $site_pkg_path/deepspeed/

Fix WebDataset Version for VLN-CE Compatibility

pip install webdataset==0.1.103

Data

Please follow VLN-CE and download R2R and RxR annotations, and scene data inside the evaluation/data folder. The data should have structure like:

data/datasets
├─ RxR_VLNCE_v0
|   ├─ train
|   |    ├─ train_guide.json.gz
|   |    ├─ ...
|   ├─ val_unseen
|   |    ├─ val_unseen_guide.json.gz
|   |    ├─ ...
|   ├─ ...
├─ R2R_VLNCE_v1-3_preprocessed
|   ├─ train
|   |    ├─ train.json.gz
|   |    ├─ ...
|   ├─ val_unseen
|   |    ├─ val_unseen.json.gz
|   |    ├─ ...
data/scene_datasets
├─ mp3d
|   ├─ 17DRP5sb8fy
|   |    ├─ 17DRP5sb8fy.glb
|   |    ├─ ...
|   ├─ ...

Running Evaluation

Download the checkpoint from a8cheng/navila-llama3-8b-8f.
Run evaluation on R2R using:

cd evaluation
bash scripts/eval/r2r.sh CKPT_PATH NUM_CHUNKS CHUNK_START_IDX "GPU_IDS"

Examples:

Single GPU:

bash scripts/eval/r2r.sh CKPT_PATH 1 0 "0"

Multiple GPUs (e.g., 8 GPUs):

bash scripts/eval/r2r.sh CKPT_PATH 8 0 "0,1,2,3,4,5,6,7"

Visualized videos are saved in

./eval_out/CKPT_NAME/VLN-CE-v1/val_unseen/videos

4. Aggregate results and view the scores

python scripts/eval_jsons.py ./eval_out/CKPT_NAME/VLN-CE-v1/val_unseen NUM_CHUNKS

📜 Citation

@inproceedings{cheng2025navila,
        title={Navila: Legged robot vision-language-action model for navigation},
        author={Cheng, An-Chieh and Ji, Yandong and Yang, Zhaojing and Gongye, Zaitian and Zou, Xueyan and Kautz, Jan and B{\i}y{\i}k, Erdem and Yin, Hongxu and Liu, Sifei and Wang, Xiaolong},
        booktitle={RSS},
        year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
evaluation		evaluation
llava		llava
scripts		scripts
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
environment_setup.sh		environment_setup.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NaVILA: Legged Robot Vision-Language-Action Model for Navigation (RSS'25)

💡 Introduction

TODO

🚀 Training

Installation

Dataset

Training

📊 Evaluation

Installation

Data

Running Evaluation

📜 Citation

About

Uh oh!

Languages

License

AnjieCheng/NaVILA

Folders and files

Latest commit

History

Repository files navigation

NaVILA: Legged Robot Vision-Language-Action Model for Navigation (RSS'25)

💡 Introduction

TODO

🚀 Training

Installation

Dataset

Training

📊 Evaluation

Installation

Data

Running Evaluation

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages