Skip to content

[RSS'25] This repository is the implementation of "NaVILA: Legged Robot Vision-Language-Action Model for Navigation"

License

Notifications You must be signed in to change notification settings

AnjieCheng/NaVILA

Repository files navigation

NaVILA: Legged Robot Vision-Language-Action Model for Navigation (RSS'25)

website Arxiv Huggingface Locomotion Code

💡 Introduction

NaVILA is a two-level framework that combines VLAs with locomotion skills for navigation. It generates high-level language-based commands, while a real-time locomotion policy ensures obstacle avoidance.

TODO

  • Release mode/weight/evaluation.
  • Release training code. (around June 30th)
  • Release YouTube Human Touring dataset. (around June 30th)
  • Release Isaac Sim evaluation, please see here.

🚀 Training

Installation

To build environment for training NaVILA, please run the following:

./environment_setup.sh navila
conda activate navila

Optional: If you plan to use TensorBoard for logging, install tensorboardX via pip.

Dataset

For general VQA datasets like video_chatgpt, sharegpt_video, sharegpt4v_sft, please follow the data preparation instructions in NVILA. We provide annotations for envdrop, scanqa, r2r, rxr, and human on Hugging Face. Please download the repo and extract the tar.gz files in their respective subfolders.

  • YouTube Human Touring:
    Due to copyright restrictions, raw videos/images are not released. We provide video IDs and annotations. You can download the videos using yt-dlp and extract frames using: scripts/extract_rawframes.py

  • EnvDrop:
    Due to the large number of videos, we provide annotations only. Please download the R2R augmented split from R2R_VLNCE_v1-3_preprocessed.zip and render corresponding videos using VLN-CE.

The data should have structure like:

NaVILA-Dataset
├─ EnvDrop
|   ├─ videos
|   |    ├─ 1.mp4
|   |    ├─ ...
|   ├─ annotations.json
├─ Human
|   ├─ raw_frames
|   |    ├─ Aei0GpsWNys
|   |    |    ├─ 0001.jpg
|   |    |    ├─ ...
|   |    ├─ ...
|   ├─ videos
|   |    ├─ Aei0GpsWNys.mp4
|   |    ├─ ...
|   ├─ annotations.json
|   ├─ video_ids.txt
├─ R2R
|   ├─ train
|   |    ├─ 1
|   |    |    ├─ frame_0.jpg 
|   |    |    ├─ ...
|   |    ├─ ...
|   ├─ annotations.json
├─ RxR
|   ├─ train
|   |    ├─ 1
|   |    |    ├─ frame_0.jpg 
|   |    |    ├─ ...
|   |    ├─ ...
|   ├─ annotations.json
├─ ScanQA
|   ├─ videos
|   |    ├─ scene0760_00.mp4
|   |    ├─ ...
|   ├─ annotations
|   |    ├─ ScanQA_v1.0_train_reformat.json
|   |    ├─ ...

Training

The pretrain model to start from is provided in a8cheng/navila-siglip-llama3-8b-v1.5-pretrain. Please modify the data paths in llava/data/datasets_mixture.py and use the script in scripts/train/sft_8frames.sh to lanuch the training.

📊 Evaluation

Installation

This repository builds on VLN-CE, which relies on older versions of Habitat-Lab and Habitat-Sim. The installation process requires several modifications and can be complex.

  1. Create a Conda Environment with Python 3.10
conda create -n navila-eval python=3.10
conda activate navila-eval
  1. Build Habitat-Sim & Lab (v0.1.7) from Source

Follow the VLN-CE setup guide. To resolve NumPy compatibility issues, apply the following hotfix:

python evaluation/scripts/habitat_sim_autofix.py # replace habitat_sim/utils/common.py
  1. Install VLN-CE Dependencies
pip install -r evaluation/requirements.txt
  1. Install VILA Dependencies
# Install FlashAttention2
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu122torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

# Install VILA (assum in root dir)
pip install -e .
pip install -e ".[train]"
pip install -e ".[eval]"

# Install HF's Transformers
pip install git+https://github.com/huggingface/[email protected]
site_pkg_path=$(python -c 'import site; print(site.getsitepackages()[0])')
cp -rv ./llava/train/transformers_replace/* $site_pkg_path/transformers/
cp -rv ./llava/train/deepspeed_replace/* $site_pkg_path/deepspeed/
  1. Fix WebDataset Version for VLN-CE Compatibility
pip install webdataset==0.1.103

Data

Please follow VLN-CE and download R2R and RxR annotations, and scene data inside the evaluation/data folder. The data should have structure like:

data/datasets
├─ RxR_VLNCE_v0
|   ├─ train
|   |    ├─ train_guide.json.gz
|   |    ├─ ...
|   ├─ val_unseen
|   |    ├─ val_unseen_guide.json.gz
|   |    ├─ ...
|   ├─ ...
├─ R2R_VLNCE_v1-3_preprocessed
|   ├─ train
|   |    ├─ train.json.gz
|   |    ├─ ...
|   ├─ val_unseen
|   |    ├─ val_unseen.json.gz
|   |    ├─ ...
data/scene_datasets
├─ mp3d
|   ├─ 17DRP5sb8fy
|   |    ├─ 17DRP5sb8fy.glb
|   |    ├─ ...
|   ├─ ...

Running Evaluation

  1. Download the checkpoint from a8cheng/navila-llama3-8b-8f.
  2. Run evaluation on R2R using:
cd evaluation
bash scripts/eval/r2r.sh CKPT_PATH NUM_CHUNKS CHUNK_START_IDX "GPU_IDS"

Examples:

  • Single GPU:
    bash scripts/eval/r2r.sh CKPT_PATH 1 0 "0"
  • Multiple GPUs (e.g., 8 GPUs):
    bash scripts/eval/r2r.sh CKPT_PATH 8 0 "0,1,2,3,4,5,6,7"
  1. Visualized videos are saved in
./eval_out/CKPT_NAME/VLN-CE-v1/val_unseen/videos

4. Aggregate results and view the scores
python scripts/eval_jsons.py ./eval_out/CKPT_NAME/VLN-CE-v1/val_unseen NUM_CHUNKS

📜 Citation

@inproceedings{cheng2025navila,
        title={Navila: Legged robot vision-language-action model for navigation},
        author={Cheng, An-Chieh and Ji, Yandong and Yang, Zhaojing and Gongye, Zaitian and Zou, Xueyan and Kautz, Jan and B{\i}y{\i}k, Erdem and Yin, Hongxu and Liu, Sifei and Wang, Xiaolong},
        booktitle={RSS},
        year={2025}
}

About

[RSS'25] This repository is the implementation of "NaVILA: Legged Robot Vision-Language-Action Model for Navigation"

Resources

License

Stars

Watchers

Forks

Languages