[Tests] Add recovery-based validation to LM-Eval tests #1750

rahul-tuli · 2025-08-18T17:06:40Z

Modifies LM-Eval tests to use recovery-based validation, comparing quantized model performance against base model performance rather than fixed thresholds. This makes tests resilient to upstream model changes while maintaining quantization quality standards.

Motivation

Tests currently fail when base models regress due to external changes (e.g., transformers updates), even when quantization recovery remains excellent. This creates false positives that block CI and require manual investigation.

Example: Qwen2.5-VL tests fail with transformers >= 4.54.0 due to a base model regression (~10% accuracy drop), despite quantization maintaining the same relative performance.

Changes

Core Implementation

Modified tests/lmeval/test_lmeval.py to evaluate base model before quantization
Added recovery validation: tests pass if quantized model maintains ≥95% of base performance
Preserved backward compatibility with warnings for absolute thresholds

Technical Details

Recovery Calculation:

recovery = quantized_accuracy / base_accuracy
assert recovery >= 0.95  # Default threshold

Config Support:

lmeval:
  recovery_threshold: 0.95  # Optional, defaults to 0.95
  metrics:
    acc,none: 0.8333  # Now used as warning threshold

Backward Compatibility

Existing configs work without modification
Default recovery threshold: 95%
Absolute thresholds still checked (warnings only)
No breaking changes to test infrastructure

Performance Impact

Additional base model evaluation per test (~2-3 minutes)
Results cached within test session
Overall impact minimal compared to quantization time

Files Changed

tests/lmeval/test_lmeval.py - Added recovery validation logic
No config changes required

Type of Change:

Bug fix
New feature
Test improvement
Documentation

Testing:

Tests pass locally
Validated with affected VL models
Backward compatibility verified

- Evaluate base model first to establish baseline - Use recovery threshold (default 95%) instead of absolute values - Maintain backward compatibility with warnings for absolute thresholds - Fixes false positives from upstream model changes Signed-off-by: Rahul Tuli <[email protected]>

github-actions · 2025-08-18T17:06:47Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Summary of Changes

Hello @rahul-tuli, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the LM-Eval testing framework by implementing recovery-based validation. Instead of relying on fixed performance thresholds, tests now dynamically compare the accuracy of quantized models against their original, unquantized base models. This strategic shift aims to make the test suite more robust and less susceptible to external regressions in base models, thereby reducing false positives in continuous integration pipelines. The change ensures that the quality of quantization is maintained relative to the base model's performance, providing a more reliable and adaptive testing environment.

Highlights

Recovery-Based Validation for LM-Eval Tests: LM-Eval tests now employ a recovery-based validation strategy, comparing the performance of quantized models directly against their base (unquantized) counterparts. This ensures that tests pass if the quantized model maintains a specified percentage (defaulting to 98%) of the base model's performance.
Improved Test Robustness and Reduced False Positives: This change significantly enhances test resilience against external factors, such as regressions in upstream base models (e.g., due to transformers library updates). By focusing on relative performance recovery, the system avoids false positives that previously blocked CI, streamlining the development workflow.
Backward Compatibility with Absolute Thresholds: The new validation mechanism is backward compatible. Existing test configurations that specify absolute performance thresholds will continue to function, but these thresholds will now serve as warnings rather than strict assertions, allowing for a smooth transition.
Minimal Performance Impact: While the new approach introduces an additional evaluation of the base model per test, adding approximately 2-3 minutes, the results are cached within the test session. This minimizes the overall performance impact relative to the time required for quantization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a robust recovery-based validation system for LM-Eval tests, which is a great improvement for test stability against upstream model changes. The implementation is well-structured, with the evaluation logic nicely refactored into a reusable _evaluate_model function. I've identified a couple of areas for improvement: there's an inconsistency in the default recovery threshold value that should be aligned, and a bug in the recovery calculation when the base model's score is zero, which could lead to false negatives in tests. Addressing these points will make the new validation logic even more reliable.

gemini-code-assist · 2025-08-18T17:08:18Z

tests/lmeval/test_lmeval.py

+                continue
+
+            # Calculate recovery
+            recovery = quant_val / base_val if base_val != 0 else 0


The current logic for calculating recovery when base_val is 0 is incorrect. If base_val is 0, recovery is set to 0. This will cause the assertion to fail even in cases of perfect recovery (e.g., quant_val is also 0) or infinite recovery (quant_val > 0), leading to incorrect test failures.

To fix this, you should handle the base_val == 0 case explicitly. If both values are 0, recovery is 100% (1.0). If only base_val is 0, it's an infinite improvement, so we can use a value like 2.0 (200% recovery) to ensure the assertion passes without causing issues in the degradation calculation.

recovery = quant_val / base_val if base_val != 0 else (1.0 if quant_val == 0 else 2.0)

gemini-code-assist · 2025-08-18T17:08:18Z

tests/lmeval/test_lmeval.py

+        # Default to 95% recovery if not specified
+        if "recovery_threshold" not in lmeval_dict:
+            lmeval_dict["recovery_threshold"] = 0.98


There's an inconsistency in the default recovery threshold. The code sets it to 0.98, while the comment and the PR description state 0.95. This is confusing. Please align the code, comment, and PR description to a single value (I'd recommend 0.95 to match the PR description).

You can also use dict.setdefault() for a more concise way to set a default value.

# Default to 95% recovery if not specified lmeval_dict.setdefault("recovery_threshold", 0.95)

rahul-tuli requested review from brian-dellabetta and kylesayrs August 18, 2025 17:06

gemini-code-assist bot reviewed Aug 18, 2025

View reviewed changes

dsikka self-requested a review August 18, 2025 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tests] Add recovery-based validation to LM-Eval tests #1750

[Tests] Add recovery-based validation to LM-Eval tests #1750

rahul-tuli commented Aug 18, 2025

Uh oh!

github-actions bot commented Aug 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 18, 2025

Uh oh!

gemini-code-assist bot Aug 18, 2025

Uh oh!

Uh oh!

[Tests] Add recovery-based validation to LM-Eval tests #1750

Are you sure you want to change the base?

[Tests] Add recovery-based validation to LM-Eval tests #1750

Conversation

rahul-tuli commented Aug 18, 2025

Motivation

Changes

Core Implementation

Technical Details

Backward Compatibility

Performance Impact

Files Changed

Uh oh!

github-actions bot commented Aug 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!