Quant tool: Don't store original outputs in MinMax Calibrator #26321

jambayk · 2025-10-16T07:09:25Z

Description

The MixMax calibrator stores all the outputs of each session run as intermediate outputs. However, it only needs the augmented min and max values. This causes the memory usage to unnecessarily grow with each iteration of the calibration data, especially for llms which have large outputs (logits and kv cache).

This PR fixes this by not storing the values from the model's original outputs that are not needed for calibration.

Motivation and Context

Improve memory usage.

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/python/tools/quantization/calibrate.py

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

tianleiwu

LGTM.

Here is analysis from AI:

This change is correct, effective, and well-implemented. It directly addresses the stated problem of high memory consumption by preventing the storage of unnecessary model outputs.

Code Review
Logic and Correctness: The core of the change is the list comprehension that filters the outputs of self.infer_session.run().

zip(self.infer_session.get_outputs(), self.infer_session.run(None, inputs), strict=False) correctly pairs each output's metadata (which contains the name) with its corresponding numerical value (the numpy array).

The condition if sess_o.name not in self.model_original_outputs accurately identifies which outputs are the augmented min/max values and which are the original (and potentially very large) model outputs like logits or KV cache.

By replacing the large original output arrays with None, the change effectively releases that memory while preserving the list's structure.

Memory Optimization: This is an excellent optimization. Storing the full output tensors from a large language model on every calibration step can quickly lead to out-of-memory errors. Replacing these large tensors with None reduces the memory footprint of self.intermediate_outputs to only what is needed for calibration—the small min/max scalar values.

Downstream Compatibility: The change is safe. The compute_data method, which consumes self.intermediate_outputs, is already designed to handle this. The following line specifically filters the dictionary to only include the augmented outputs, thereby ignoring the entries that now have None as their value:

merged_added_output_dict = {
    i: merged_output_dict[i] for i in merged_output_dict if i not in self.model_original_outputs
}

This ensures that no errors will occur later in the computation.

Conclusion

The pull request is a definite improvement. The change is targeted, logically sound, and solves a significant potential memory issue without introducing any negative side effects. It's a necessary optimization for calibrating large models.

…oft#26321) ### Description The MixMax calibrator stores all the outputs of each session run as intermediate outputs. However, it only needs the augmented min and max values. This causes the memory usage to unnecessarily grow with each iteration of the calibration data, especially for llms which have large outputs (logits and kv cache). This PR fixes this by not storing the values from the model's original outputs that are not needed for calibration. ### Motivation and Context Improve memory usage. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description The MixMax calibrator stores all the outputs of each session run as intermediate outputs. However, it only needs the augmented min and max values. This causes the memory usage to unnecessarily grow with each iteration of the calibration data, especially for llms which have large outputs (logits and kv cache). This PR fixes this by not storing the values from the model's original outputs that are not needed for calibration. ### Motivation and Context Improve memory usage. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

don't store model outputs

a13788e

github-actions bot reviewed Oct 16, 2025

View reviewed changes

onnxruntime/python/tools/quantization/calibrate.py Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Oct 16, 2025

View reviewed changes

onnxruntime/python/tools/quantization/calibrate.py Fixed Show fixed Hide fixed

lint fix

16fec9c

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

tianleiwu approved these changes Oct 16, 2025

View reviewed changes

jambayk merged commit bbd0a80 into main Oct 16, 2025
95 of 96 checks passed

jambayk deleted the jambayk/calib-minmax branch October 16, 2025 18:29

jambayk mentioned this pull request Oct 23, 2025

Static Quantization: Patch MinMaxCalibrator, Handle extra_options for QNN EP microsoft/Olive#2229

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Quant tool: Don't store original outputs in MinMax Calibrator #26321

Quant tool: Don't store original outputs in MinMax Calibrator #26321

Uh oh!

jambayk commented Oct 16, 2025

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Quant tool: Don't store original outputs in MinMax Calibrator #26321

Quant tool: Don't store original outputs in MinMax Calibrator #26321

Uh oh!

Conversation

jambayk commented Oct 16, 2025

Description

Motivation and Context

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants