feat: add colpali retriever #25

paknikolai · 2025-07-15T11:13:44Z

Description of changes

Added Colpali retriever:

added 3 models vidore/colSmol-256M, vidore/colpali-v1.3, vidore/colqwen2-v1.0
added downloading models while building docker image
added 2 more pools to process colpali model(1 for query, 1 for images) not to block other light processing embeddings, according to measurements it is possible to run query processing + image processing in parallel withought memory issues.

Added unit tests for retriever and end-to-end test:

test for retrieving exact page from document
test for e2e processing
tests for requests that are running in parallel queries and images processing

Added separate docker file that allows to build image based on image of ai-dial-rag that adds saved model weights inside docker image for specified model

Some measurements for memory consumptions on T4(in colab), batch 8 seems to be best option in time/memory vise.

Model	Type	Batch	Time (ms)	Time/Unit (ms)	Peak VRAM (GB)
vidore/colSmol-256M	Image	8	2508.89	313.61	3.99
vidore/colpali-v1.3	Image	8	2693.39	336.67	7.42
vidore/colqwen2-v1.0	Image	8	15800.93	1975.12	6.45

Model	Type	Batch	Time (ms)	Time/Unit (ms)	Peak VRAM (GB)
vidore/colSmol-256M	Image	1	341.87	341.87	0.89
vidore/colSmol-256M	Image	2	640.58	320.29	1.34
vidore/colSmol-256M	Image	4	1310.05	327.51	2.22
vidore/colSmol-256M	Image	8	2508.89	313.61	3.99

query + images parallel run

Metric	colSmol-256M	colpali-v1.3	colqwen2-v1.0
Model Memory (Post-load)	0.45 GB	5.52 GB	4.18 GB
Peak Memory (PyTorch)	4.00 GB	7.45 GB	6.48 GB

Model Optimization Experiments

To measure time was used vidore-benchmark, batch 4 or 8

0. Pytorch models

All pytorch models were run on gpu with float16

1. ONNX Export

Overview

ONNX export was attempted to optimize model inference performance by converting the PyTorch models.

Export Results

vidore/colpali-v1.3 → Exportable with minor tweaks (kwargs → args).
vidore/colqwen2-v1.0 → Exportable with more extensive tweaks (argument fixes, some code rewrites to fix graph tracking to enable export).
vidore/colSmol-256M → Not successful (graph not tracked, skipped to focus on colpali and colqwen models).

Performance (T4 & L4 GPUs)

Small input (query) → 1.5–2× faster than PyTorch.
Large input (image) → ~20% slower due to memory allocations.

Optimizations Tried

Replacing CPU-only nodes with GPU nodes → no big improvements, no change in allocation memory time.
Attention from pytorch ports as separate ops, fused into kernels → the time spent to calculate attention wasn't bottleneck and it didnt speed up mcuh the inference.
INT8 quantization on ONNX → no noticeable speedup, works a bit slower.
Different graph optimization settings → didn't result in much difference

TensorRT

ONNX → TensorRT, ONNX(TensorRT backend) and Torch-TensorRT failed (unsupported operations).

2. Quantization (Original PyTorch Models)

Tried libraries: Quanto, Eetq, Hqq, TorchAo.
4-bit → 4–10× slower (no GPU backend support on T4).
INT8 → 5–10% slower than original PyTorch for all 3 models.

3. `torch.compile`

Backends Tested

cudagraphs, onnxrt, openxla, tvm, inductor

colpali & colsmol → Slightly faster for image processing, but text processing up to 10× slower.
colqwen → moslty failed but succedded ones 2–10× slower, also Dynamic=False image inputs caused frequent recompilations.

… select k chunks from top pages

…me and type

Allob · 2025-09-17T17:22:40Z

aidial_rag/retrievers/colpali_retriever/colpali_model_resource.py

+
+        return self._run_model(inputs)
+
+    def calculate_images_embeddings(self, images: List[str]) -> List[Tensor]:


images: List[str] - is it actually str or pil image?

yes list of pil images, corrected that

Allob · 2025-09-17T17:45:05Z

aidial_rag/retrievers/colpali_retriever/colpali_model_resource.py

+            colpali_index_config is not None
+            and colpali_index_config.enabled
+            and model_resource_config is not None
+        ):


If one of these fields is not set, we will get a created instance of a ColpaliModelResource, which is not functional.

And we have to call _check_model_processor_device_is_set every time when we need to use it, because it can be in 2 states: valid one and an invalid one.

The code will be easier to read and use if the ColpaliModelResourceConfig could have only valid state. I.e. if ColpaliModelResourceConfig is not set, we can just not create an instance of ColpaliModelResourceConfig at all (this can be done in app.py). And if the ColpaliModelResourceConfig, we should assume that it is in a valid state to use if needed (otherwise it should throw an exception from the constructor if something goes wrong).

removed invalid state and simplified logic, now it ColpaliModelResource itself can be either none or valid depending on ColpaliModelResourceConfig which is set using app config on start.

Allob · 2025-09-17T17:48:30Z

aidial_rag/retrievers/colpali_retriever/colpali_model_resource.py

+        # if both are set then we can load model
+        if (
+            colpali_index_config is not None
+            and colpali_index_config.enabled


colpali_index_config and colpali_index_config.enabled values come from the part of the config which could be overridden on per-request basis: https://github.com/epam/ai-dial-rag/pull/25/files#diff-50338faca97bed0430a4330eb9445199dd51a3c636bf95bee711680786b234f4R211

I.e. if the default value for the colpali_index_config.enabled is False, and someone will enable in a particular request, the request will fail, because the ColpaliModelResourceConfig will not be in a valid usable state.

indeed, now i have removed dependency on it from colapli_model_resource, now it depends only on ColpaliModelResourceConfig, and now it creates model all the time and its field cannot be None anymore, but now ColpaliModelResource itself can be None

Allob · 2025-09-17T18:02:34Z

tests/test_colpali_retriever.py

+    """
+    # Patch create_retriever to use only ColPali retriever
+    with patch(
+        "aidial_rag.retrieval_chain.create_retriever", new=mock_create_retriever


I do not understand why we need to mock create_retriever here. The run_e2e_test function already replaces the ColpaliModelResource with CachedColpaliModelResource. Shouldn't it be enough to get the responses from cache?

i have simplified mock_create_retriever, colpali model resource is being used from arguments rn, which should be CachedColpaliModelResource which was set in the app.

there is a problem with create_retriever, if pdf is small then it will be using AllDocumentsRetriever instead, and if pdf is big enough then it will add semantic retriever anyway, so here i explicitly return only colplali retriever.

Allob · 2025-09-17T18:04:14Z

tests/test_colpali_retriever.py

+    cached_app_class = create_cached_app_config()
+
+    # Patch the app creation to use cached model resource
+    with patch("aidial_rag.app.DialRAGApplication", new=cached_app_class):


We can just add a parameter to the DialRAGApplication, which would allow to inject the dependency on the ColpaliModelResource which should be used here, instead of patching.

removed cached app class.

adding class which is used doesnt seem that good decision, and adding resource as a parameter to app also seems to be odd.
also i have tried replace the resource instance afrer creation of DialRAGApplication but getting instance of it from app is not that simple, so i instead just patched class that was used when colpali resource is creating, so instead standard one it is using cahced version

Allob · 2025-09-17T18:08:43Z

tests/cache/test_colpali_retriever/ec034ed955b0303d2dbace1b0ea8ba5e_query_embeddings_cache.pkl

Looks like you forgot to include the name of the file with the tests into the path. This may cause collisions for the tests with the same name in a different files.
See tests/cache/test_colpali_retriever/test_colpali_retrieval_e2e/1adc73f3b7d87d911383d346295d6d66.response.

moved it to a separate folder, but this embeddings are for one pdf which i use in tests and it is shared so cant use test name in it.

Allob · 2025-09-22T13:01:41Z

aidial_rag/retrievers/colpali_retriever/colpali_models.py

+        local_files_only=model_path
+        is not None,  # if set use only local files from folder
+    ).eval()
+    processor = processor_class.from_pretrained(model_name)


Why is the cache_dir not needed for processor_class.from_pretrained?

here cache dir has base weights for image feature extractor and llm, and all processor's configs are in the model folder which was donwloaded using snapshot download, so cache in this case is not needed

# Conflicts: # aidial_rag/documents.py

Nikolai Pak added 29 commits June 30, 2025 11:42

reapplied commits

a520953

fixed path

9f28efd

upgraded dependencies

9b3cd34

fixed pathes

0d9d598

added dependencies

563fd45

added test for colpali retriever

4cfd258

added code for caching results

e4dd82c

fixed the test cahce and added cache itsef

85506b4

added e2e test code

a0064a4

updated code of test to match the query

a0a8f55

updated test and cache

8e4c225

updated test files

628c6a3

updated caches for the tests

5d92ca6

updated cache

cbbc41d

added config yaml for colpali and updated tests

accf675

added image size to colpali config

ed060e1

added creation of colpaliresource on creation,

07c4886

moved colpali resource to colpali folder

2162e91

partlt changed the calculation of scores

bc4881b

updated retriever code to use page embeddings and score them and then…

592ea07

… select k chunks from top pages

changed enum+str to strenum, added consistency check for the model na…

a688b84

…me and type

added lock for gpu related operations

66fcdeb

added cpu pools usage

d6bc224

changed bloat16 to float16

3039f13

import fixes

dbac766

fixed lint issues

9201739

fixed cache and refactored cache code

b15ca56

removed ignores

3acd205

removed unnecessary processor caching

bb52615

paknikolai requested a review from Allob as a code owner July 15, 2025 11:13

Nikolai Pak and others added 10 commits September 11, 2025 19:59

fixed queries

1c73c81

added batch size to parameters

a39bee5

moved model call inside resource

6c106a8

removed unusd config

fe92a1e

fixed docker file since folder with dial rag is already copied

90fda28

fixed downloading script

f9b22b7

merged develop branch

e381ae5

fixed format

af19431

added some comments

c796ffd

Merge branch 'development' into f/colpali_retriever

06def61

Allob reviewed Sep 17, 2025

View reviewed changes

Nikolai Pak and others added 10 commits September 18, 2025 10:30

fixed type for images in calculate_images_embeddings

e2717ac

simplified replacing cached resource in tests

dc8d368

created constabt for the test folder

dc2beb5

removed colpali index from colpali resource

b3ef125

moved common embeddings to separate folder

056722f

Merge branch 'development' into f/colpali_retriever

ead9b7e

separated download scripts

0532587

corrected argument line

95bf413

replaced env variable for model path with config

5d3c6cd

removed padding of embeddings

b829ff7

Allob reviewed Sep 22, 2025

View reviewed changes

Nikolai Pak added 3 commits September 25, 2025 12:55

changed hf_home to cache models there

73a05ed

updated donwload

53e09a8

Merge branch 'development' into f/colpali_retriever

73e51bb

# Conflicts: # aidial_rag/documents.py


		return self._run_model(inputs)

		def calculate_images_embeddings(self, images: List[str]) -> List[Tensor]:

feat: add colpali retriever #25

Are you sure you want to change the base?

feat: add colpali retriever #25

Uh oh!

Conversation

paknikolai commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Model Optimization Experiments

0. Pytorch models

1. ONNX Export

Overview

Export Results

Performance (T4 & L4 GPUs)

Optimizations Tried

TensorRT

2. Quantization (Original PyTorch Models)

3. torch.compile

Backends Tested

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paknikolai commented Jul 15, 2025 •

edited

Loading

3. `torch.compile`