Skip to content

Conversation

@jambayk
Copy link
Contributor

@jambayk jambayk commented Oct 16, 2025

Description

The MixMax calibrator stores all the outputs of each session run as intermediate outputs. However, it only needs the augmented min and max values. This causes the memory usage to unnecessarily grow with each iteration of the calibration data, especially for llms which have large outputs (logits and kv cache).

This PR fixes this by not storing the values from the model's original outputs that are not needed for calibration.

Motivation and Context

Improve memory usage.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Copy link
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Here is analysis from AI:

This change is correct, effective, and well-implemented. It directly addresses the stated problem of high memory consumption by preventing the storage of unnecessary model outputs.

Code Review
Logic and Correctness: The core of the change is the list comprehension that filters the outputs of self.infer_session.run().

zip(self.infer_session.get_outputs(), self.infer_session.run(None, inputs), strict=False) correctly pairs each output's metadata (which contains the name) with its corresponding numerical value (the numpy array).

The condition if sess_o.name not in self.model_original_outputs accurately identifies which outputs are the augmented min/max values and which are the original (and potentially very large) model outputs like logits or KV cache.

By replacing the large original output arrays with None, the change effectively releases that memory while preserving the list's structure.

Memory Optimization: This is an excellent optimization. Storing the full output tensors from a large language model on every calibration step can quickly lead to out-of-memory errors. Replacing these large tensors with None reduces the memory footprint of self.intermediate_outputs to only what is needed for calibration—the small min/max scalar values.

Downstream Compatibility: The change is safe. The compute_data method, which consumes self.intermediate_outputs, is already designed to handle this. The following line specifically filters the dictionary to only include the augmented outputs, thereby ignoring the entries that now have None as their value:

merged_added_output_dict = {
    i: merged_output_dict[i] for i in merged_output_dict if i not in self.model_original_outputs
}

This ensures that no errors will occur later in the computation.

Conclusion

The pull request is a definite improvement. The change is targeted, logically sound, and solves a significant potential memory issue without introducing any negative side effects. It's a necessary optimization for calibrating large models.

@jambayk jambayk merged commit bbd0a80 into main Oct 16, 2025
95 of 96 checks passed
@jambayk jambayk deleted the jambayk/calib-minmax branch October 16, 2025 18:29
JonathanC-ARM pushed a commit to JonathanC-ARM/onnxruntime that referenced this pull request Oct 24, 2025
…oft#26321)

### Description
The MixMax calibrator stores all the outputs of each session run as
intermediate outputs. However, it only needs the augmented min and max
values. This causes the memory usage to unnecessarily grow with each
iteration of the calibration data, especially for llms which have large
outputs (logits and kv cache).

This PR fixes this by not storing the values from the model's original
outputs that are not needed for calibration.

### Motivation and Context
Improve memory usage.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
fs-eire pushed a commit that referenced this pull request Oct 24, 2025
### Description
The MixMax calibrator stores all the outputs of each session run as
intermediate outputs. However, it only needs the augmented min and max
values. This causes the memory usage to unnecessarily grow with each
iteration of the calibration data, especially for llms which have large
outputs (logits and kv cache).

This PR fixes this by not storing the values from the model's original
outputs that are not needed for calibration.

### Motivation and Context
Improve memory usage.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants