-
Couldn't load subscription status.
- Fork 3.5k
Quant tool: Don't store original outputs in MinMax Calibrator #26321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Here is analysis from AI:
This change is correct, effective, and well-implemented. It directly addresses the stated problem of high memory consumption by preventing the storage of unnecessary model outputs.
Code Review
Logic and Correctness: The core of the change is the list comprehension that filters the outputs of self.infer_session.run().
zip(self.infer_session.get_outputs(), self.infer_session.run(None, inputs), strict=False) correctly pairs each output's metadata (which contains the name) with its corresponding numerical value (the numpy array).
The condition if sess_o.name not in self.model_original_outputs accurately identifies which outputs are the augmented min/max values and which are the original (and potentially very large) model outputs like logits or KV cache.
By replacing the large original output arrays with None, the change effectively releases that memory while preserving the list's structure.
Memory Optimization: This is an excellent optimization. Storing the full output tensors from a large language model on every calibration step can quickly lead to out-of-memory errors. Replacing these large tensors with None reduces the memory footprint of self.intermediate_outputs to only what is needed for calibration—the small min/max scalar values.
Downstream Compatibility: The change is safe. The compute_data method, which consumes self.intermediate_outputs, is already designed to handle this. The following line specifically filters the dictionary to only include the augmented outputs, thereby ignoring the entries that now have None as their value:
merged_added_output_dict = {
i: merged_output_dict[i] for i in merged_output_dict if i not in self.model_original_outputs
}
This ensures that no errors will occur later in the computation.
Conclusion
The pull request is a definite improvement. The change is targeted, logically sound, and solves a significant potential memory issue without introducing any negative side effects. It's a necessary optimization for calibrating large models.
…oft#26321) ### Description The MixMax calibrator stores all the outputs of each session run as intermediate outputs. However, it only needs the augmented min and max values. This causes the memory usage to unnecessarily grow with each iteration of the calibration data, especially for llms which have large outputs (logits and kv cache). This PR fixes this by not storing the values from the model's original outputs that are not needed for calibration. ### Motivation and Context Improve memory usage. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description The MixMax calibrator stores all the outputs of each session run as intermediate outputs. However, it only needs the augmented min and max values. This causes the memory usage to unnecessarily grow with each iteration of the calibration data, especially for llms which have large outputs (logits and kv cache). This PR fixes this by not storing the values from the model's original outputs that are not needed for calibration. ### Motivation and Context Improve memory usage. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Description
The MixMax calibrator stores all the outputs of each session run as intermediate outputs. However, it only needs the augmented min and max values. This causes the memory usage to unnecessarily grow with each iteration of the calibration data, especially for llms which have large outputs (logits and kv cache).
This PR fixes this by not storing the values from the model's original outputs that are not needed for calibration.
Motivation and Context
Improve memory usage.