Skip to content

Conversation

luke-lin-vmc
Copy link
Contributor

@luke-lin-vmc luke-lin-vmc commented Sep 23, 2025

Whisper sample code to enable model caching on GPU and NPU

This is #2751 follow up

Sample Code Reference:
https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/visual_language_chat/encrypted_model_vlm.py#L87
https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/cpp/text_generation/encrypted_model_causal_lm.cpp#L52

OPTIMIZE_SIZE and encryption are not included. The main performance concern for Whisper is pipeline speed. Since Whisper is much smaller than LLMs, size optimization offers only very little savings while potentially adding latency. Similarly, model encryption can also introduce additional latency.

ov::AnyMap ov_config;
if (device == "NPU" || device.find("GPU") != std::string::npos) { // need to handle cases like "GPU", "GPU.0" and "GPU.1"
// Cache compiled models on disk for GPU and NPU to save time on the
// next run. It's not beneficial for CPU.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it's not beneficial for CPU?

Copy link
Contributor Author

@luke-lin-vmc luke-lin-vmc Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This comment is simply copied from the reference sample code.
  2. AFAIK CPU plugin's "compile" step is mostly graph rewrites and primitive selection. It’s typically milliseconds–a few hundred ms, not seconds–minutes like on GPU/NPU.
  3. Most importantly, enable model caching on CPU causes Whisper pipeline crashed. This looks like a bug which needs further investigation. So currently model caching is enabled only on GPU and NPU to avoid the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: Whisper samples GenAI Whisper samples

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants