-
Notifications
You must be signed in to change notification settings - Fork 214
[Tests] Add recovery-based validation to LM-Eval tests #1750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Evaluate base model first to establish baseline - Use recovery threshold (default 95%) instead of absolute values - Maintain backward compatibility with warnings for absolute thresholds - Fixes false positives from upstream model changes Signed-off-by: Rahul Tuli <[email protected]>
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @rahul-tuli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant enhancement to the LM-Eval testing framework by implementing recovery-based validation. Instead of relying on fixed performance thresholds, tests now dynamically compare the accuracy of quantized models against their original, unquantized base models. This strategic shift aims to make the test suite more robust and less susceptible to external regressions in base models, thereby reducing false positives in continuous integration pipelines. The change ensures that the quality of quantization is maintained relative to the base model's performance, providing a more reliable and adaptive testing environment.
Highlights
- Recovery-Based Validation for LM-Eval Tests: LM-Eval tests now employ a recovery-based validation strategy, comparing the performance of quantized models directly against their base (unquantized) counterparts. This ensures that tests pass if the quantized model maintains a specified percentage (defaulting to 98%) of the base model's performance.
- Improved Test Robustness and Reduced False Positives: This change significantly enhances test resilience against external factors, such as regressions in upstream base models (e.g., due to
transformers
library updates). By focusing on relative performance recovery, the system avoids false positives that previously blocked CI, streamlining the development workflow. - Backward Compatibility with Absolute Thresholds: The new validation mechanism is backward compatible. Existing test configurations that specify absolute performance thresholds will continue to function, but these thresholds will now serve as warnings rather than strict assertions, allowing for a smooth transition.
- Minimal Performance Impact: While the new approach introduces an additional evaluation of the base model per test, adding approximately 2-3 minutes, the results are cached within the test session. This minimizes the overall performance impact relative to the time required for quantization.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a robust recovery-based validation system for LM-Eval tests, which is a great improvement for test stability against upstream model changes. The implementation is well-structured, with the evaluation logic nicely refactored into a reusable _evaluate_model
function. I've identified a couple of areas for improvement: there's an inconsistency in the default recovery threshold value that should be aligned, and a bug in the recovery calculation when the base model's score is zero, which could lead to false negatives in tests. Addressing these points will make the new validation logic even more reliable.
continue | ||
|
||
# Calculate recovery | ||
recovery = quant_val / base_val if base_val != 0 else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic for calculating recovery
when base_val
is 0 is incorrect. If base_val
is 0, recovery
is set to 0. This will cause the assertion to fail even in cases of perfect recovery (e.g., quant_val
is also 0) or infinite recovery (quant_val
> 0), leading to incorrect test failures.
To fix this, you should handle the base_val == 0
case explicitly. If both values are 0, recovery is 100% (1.0
). If only base_val
is 0, it's an infinite improvement, so we can use a value like 2.0
(200% recovery) to ensure the assertion passes without causing issues in the degradation
calculation.
recovery = quant_val / base_val if base_val != 0 else (1.0 if quant_val == 0 else 2.0)
# Default to 95% recovery if not specified | ||
if "recovery_threshold" not in lmeval_dict: | ||
lmeval_dict["recovery_threshold"] = 0.98 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an inconsistency in the default recovery threshold. The code sets it to 0.98
, while the comment and the PR description state 0.95
. This is confusing. Please align the code, comment, and PR description to a single value (I'd recommend 0.95
to match the PR description).
You can also use dict.setdefault()
for a more concise way to set a default value.
# Default to 95% recovery if not specified
lmeval_dict.setdefault("recovery_threshold", 0.95)
Modifies LM-Eval tests to use recovery-based validation, comparing quantized model performance against base model performance rather than fixed thresholds. This makes tests resilient to upstream model changes while maintaining quantization quality standards.
Motivation
Tests currently fail when base models regress due to external changes (e.g., transformers updates), even when quantization recovery remains excellent. This creates false positives that block CI and require manual investigation.
Example: Qwen2.5-VL tests fail with transformers >= 4.54.0 due to a base model regression (~10% accuracy drop), despite quantization maintaining the same relative performance.
Changes
Core Implementation
tests/lmeval/test_lmeval.py
to evaluate base model before quantizationTechnical Details
Recovery Calculation:
Config Support:
Backward Compatibility
Performance Impact
Files Changed
tests/lmeval/test_lmeval.py
- Added recovery validation logicType of Change:
Testing: