This is the official repo for the paper: "MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training".
- [2024-09-20]: To better reflect the generality of our proposed method, we rename it to RagVL.
- [2024-08-05]: Codes of RagVL (RagLLaVA) released.
- [2024-07-31]: Paper of RagVL (RagLLaVA) online.
The required libraries for running RagVL can be found in requirements.txt. We recommend following LLaVA to configure your environment.
Before running RagVL, please:
-
Download from Google Drive for datasets and checkpoints.
-
Download from WebQA and MultimodalQA for image files.
-
Unzip the file. Place the
checkpoints/anddatasets/intoRagVL/. -
Place the
tasks/intoRagVL/finetune/. -
Place the
MMQA_imgs/andtrain_img/intoRagVL/finetune/tasks/. -
Place the
val_image/intoRagVL/datasets/.
- Reranker
| Models | Global Batch Size | Epochs |
|---|---|---|
| LLaVA-v1.5-13B | 16 | 2 (WebQA) / 1 (others) |
| Qwen-VL-Chat | 16 | 2 (WebQA) / 1 (others) |
| mPLUG-Owl2 | 16 | 2 (WebQA) / 1 (others) |
| InternVL2-1B | 16 | 1 |
| InternVL2-2B | 16 | 1 |
- Generator
| Models | Global Batch Size | Epochs |
|---|---|---|
| LLaVA-v1.5-13B | 16 | 2 (WebQA) / 3 (MMQA) |
| InternVL2-1B | 16 | 1 |
| InternVL2-2B | 16 | 1 |
Except for the above two hyperparameters, the others follow the default settings from different models.
To finetune LLaVA-v1.5-13B, Qwen-VL-Chat, and mPLUG-Owl2, find the corresponding finetune script in RagVL/finetune/scripts/.
To finetune InternVL2-1B and InternVL2-2B, find the corresponding finetune script in RagVL/internvl_chat/shell/internvl2.0/2nd_finetune.
To evaluate RagVL on WebQA / MultimodalQA, you can employ the following command:
python webqa_pipeline.py \ # same arguments on mmqa_pipeline.py
--reranker_model caption_lora \ # select the reranker
--generator_model noise_injected_lora \ # select the generator
--filter 0 \ # select the adaptive threshold
--clip_topk 20 \ # we first retrieve 20 candidates by default
To evaluate the oracle settings on WebQA / MultimodalQA, you can employ the following command:
python webqa_oracle.py \ # same arguments on mmqa_oracle.py
If you are interested or inspired by this work, you can cite us by:
@article{chen2024mllm,
title={MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training},
author={Chen, Zhanpeng and Xu, Chengjin and Qi, Yiyan and Guo, Jian},
journal={arXiv preprint arXiv:2407.21439},
year={2024}
}- LLaVA: Large Language and Vision Assistant
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
- InternVL: A Pioneering Open-Source Alternative to GPT-4o
- Visualized BGE: A universal multi-modal embedding model
- VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
- CAL: Prioritizing Visual Correlation by Contrastive Alignment
