Skip to content

Conversation

@paknikolai
Copy link

@paknikolai paknikolai commented Jul 15, 2025

Description of changes

Added Colpali retriever:

  • added 3 models vidore/colSmol-256M, vidore/colpali-v1.3, vidore/colqwen2-v1.0
  • added downloading models while building docker image
  • added 2 more pools to process colpali model(1 for query, 1 for images) not to block other light processing embeddings, according to measurements it is possible to run query processing + image processing in parallel withought memory issues.

Added unit tests for retriever and end-to-end test:

  • test for retrieving exact page from document
  • test for e2e processing
  • tests for requests that are running in parallel queries and images processing

Added separate docker file that allows to build image based on image of ai-dial-rag that adds saved model weights inside docker image for specified model

Some measurements for memory consumptions on T4(in colab), batch 8 seems to be best option in time/memory vise.

Model Type Batch Time (ms) Time/Unit (ms) Peak VRAM (GB)
vidore/colSmol-256M Image 8 2508.89 313.61 3.99
vidore/colpali-v1.3 Image 8 2693.39 336.67 7.42
vidore/colqwen2-v1.0 Image 8 15800.93 1975.12 6.45
           
Model Type Batch Time (ms) Time/Unit (ms) Peak VRAM (GB)
vidore/colSmol-256M Image 1 341.87 341.87 0.89
vidore/colSmol-256M Image 2 640.58 320.29 1.34
vidore/colSmol-256M Image 4 1310.05 327.51 2.22
vidore/colSmol-256M Image 8 2508.89 313.61 3.99

query + images parallel run

Metric colSmol-256M colpali-v1.3 colqwen2-v1.0
Model Memory (Post-load) 0.45 GB 5.52 GB 4.18 GB
Peak Memory (PyTorch) 4.00 GB 7.45 GB 6.48 GB

Model Optimization Experiments

To measure time was used vidore-benchmark, batch 4 or 8

0. Pytorch models

All pytorch models were run on gpu with float16

1. ONNX Export

Overview

ONNX export was attempted to optimize model inference performance by converting the PyTorch models.

Export Results

  • vidore/colpali-v1.3 → Exportable with minor tweaks (kwargs → args).
  • vidore/colqwen2-v1.0 → Exportable with more extensive tweaks (argument fixes, some code rewrites to fix graph tracking to enable export).
  • vidore/colSmol-256M → Not successful (graph not tracked, skipped to focus on colpali and colqwen models).

Performance (T4 & L4 GPUs)

  • Small input (query) → 1.5–2× faster than PyTorch.
  • Large input (image) → ~20% slower due to memory allocations.

Optimizations Tried

  • Replacing CPU-only nodes with GPU nodes → no big improvements, no change in allocation memory time.
  • Attention from pytorch ports as separate ops, fused into kernels → the time spent to calculate attention wasn't bottleneck and it didnt speed up mcuh the inference.
  • INT8 quantization on ONNX → no noticeable speedup, works a bit slower.
  • Different graph optimization settings → didn't result in much difference

TensorRT

  • ONNX → TensorRT, ONNX(TensorRT backend) and Torch-TensorRT failed (unsupported operations).

2. Quantization (Original PyTorch Models)

  • Tried libraries: Quanto, Eetq, Hqq, TorchAo.
  • 4-bit → 4–10× slower (no GPU backend support on T4).
  • INT8 → 5–10% slower than original PyTorch for all 3 models.

3. torch.compile

Backends Tested

cudagraphs, onnxrt, openxla, tvm, inductor

  • colpali & colsmol → Slightly faster for image processing, but text processing up to 10× slower.
  • colqwen → moslty failed but succedded ones 2–10× slower, also Dynamic=False image inputs caused frequent recompilations.

@paknikolai paknikolai requested a review from Allob as a code owner July 15, 2025 11:13

return self._run_model(inputs)

def calculate_images_embeddings(self, images: List[str]) -> List[Tensor]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

images: List[str] - is it actually str or pil image?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes list of pil images, corrected that

colpali_index_config is not None
and colpali_index_config.enabled
and model_resource_config is not None
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one of these fields is not set, we will get a created instance of a ColpaliModelResource, which is not functional.

And we have to call _check_model_processor_device_is_set every time when we need to use it, because it can be in 2 states: valid one and an invalid one.

The code will be easier to read and use if the ColpaliModelResourceConfig could have only valid state. I.e. if ColpaliModelResourceConfig is not set, we can just not create an instance of ColpaliModelResourceConfig at all (this can be done in app.py). And if the ColpaliModelResourceConfig, we should assume that it is in a valid state to use if needed (otherwise it should throw an exception from the constructor if something goes wrong).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed invalid state and simplified logic, now it ColpaliModelResource itself can be either none or valid depending on ColpaliModelResourceConfig which is set using app config on start.

# if both are set then we can load model
if (
colpali_index_config is not None
and colpali_index_config.enabled
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

colpali_index_config and colpali_index_config.enabled values come from the part of the config which could be overridden on per-request basis: https://github.com/epam/ai-dial-rag/pull/25/files#diff-50338faca97bed0430a4330eb9445199dd51a3c636bf95bee711680786b234f4R211

I.e. if the default value for the colpali_index_config.enabled is False, and someone will enable in a particular request, the request will fail, because the ColpaliModelResourceConfig will not be in a valid usable state.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, now i have removed dependency on it from colapli_model_resource, now it depends only on ColpaliModelResourceConfig, and now it creates model all the time and its field cannot be None anymore, but now ColpaliModelResource itself can be None

"""
# Patch create_retriever to use only ColPali retriever
with patch(
"aidial_rag.retrieval_chain.create_retriever", new=mock_create_retriever
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why we need to mock create_retriever here. The run_e2e_test function already replaces the ColpaliModelResource with CachedColpaliModelResource. Shouldn't it be enough to get the responses from cache?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have simplified mock_create_retriever, colpali model resource is being used from arguments rn, which should be CachedColpaliModelResource which was set in the app.

there is a problem with create_retriever, if pdf is small then it will be using AllDocumentsRetriever instead, and if pdf is big enough then it will add semantic retriever anyway, so here i explicitly return only colplali retriever.

cached_app_class = create_cached_app_config()

# Patch the app creation to use cached model resource
with patch("aidial_rag.app.DialRAGApplication", new=cached_app_class):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just add a parameter to the DialRAGApplication, which would allow to inject the dependency on the ColpaliModelResource which should be used here, instead of patching.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed cached app class.

adding class which is used doesnt seem that good decision, and adding resource as a parameter to app also seems to be odd.
also i have tried replace the resource instance afrer creation of DialRAGApplication but getting instance of it from app is not that simple, so i instead just patched class that was used when colpali resource is creating, so instead standard one it is using cahced version

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you forgot to include the name of the file with the tests into the path. This may cause collisions for the tests with the same name in a different files.
See tests/cache/test_colpali_retriever/test_colpali_retrieval_e2e/1adc73f3b7d87d911383d346295d6d66.response.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved it to a separate folder, but this embeddings are for one pdf which i use in tests and it is shared so cant use test name in it.

local_files_only=model_path
is not None, # if set use only local files from folder
).eval()
processor = processor_class.from_pretrained(model_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the cache_dir not needed for processor_class.from_pretrained?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here cache dir has base weights for image feature extractor and llm, and all processor's configs are in the model folder which was donwloaded using snapshot download, so cache in this case is not needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants