Skip to content

Conversation

rahul-tuli
Copy link
Collaborator

Modifies LM-Eval tests to use recovery-based validation, comparing quantized model performance against base model performance rather than fixed thresholds. This makes tests resilient to upstream model changes while maintaining quantization quality standards.

Motivation

Tests currently fail when base models regress due to external changes (e.g., transformers updates), even when quantization recovery remains excellent. This creates false positives that block CI and require manual investigation.

Example: Qwen2.5-VL tests fail with transformers >= 4.54.0 due to a base model regression (~10% accuracy drop), despite quantization maintaining the same relative performance.

Changes

Core Implementation

  • Modified tests/lmeval/test_lmeval.py to evaluate base model before quantization
  • Added recovery validation: tests pass if quantized model maintains ≥95% of base performance
  • Preserved backward compatibility with warnings for absolute thresholds

Technical Details

Recovery Calculation:

recovery = quantized_accuracy / base_accuracy
assert recovery >= 0.95  # Default threshold

Config Support:

lmeval:
  recovery_threshold: 0.95  # Optional, defaults to 0.95
  metrics:
    acc,none: 0.8333  # Now used as warning threshold

Backward Compatibility

  • Existing configs work without modification
  • Default recovery threshold: 95%
  • Absolute thresholds still checked (warnings only)
  • No breaking changes to test infrastructure

Performance Impact

  • Additional base model evaluation per test (~2-3 minutes)
  • Results cached within test session
  • Overall impact minimal compared to quantization time

Files Changed

  • tests/lmeval/test_lmeval.py - Added recovery validation logic
  • No config changes required

Type of Change:

  • Bug fix
  • New feature
  • Test improvement
  • Documentation

Testing:

  • Tests pass locally
  • Validated with affected VL models
  • Backward compatibility verified

- Evaluate base model first to establish baseline
- Use recovery threshold (default 95%) instead of absolute values
- Maintain backward compatibility with warnings for absolute thresholds
- Fixes false positives from upstream model changes

Signed-off-by: Rahul Tuli <[email protected]>
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @rahul-tuli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the LM-Eval testing framework by implementing recovery-based validation. Instead of relying on fixed performance thresholds, tests now dynamically compare the accuracy of quantized models against their original, unquantized base models. This strategic shift aims to make the test suite more robust and less susceptible to external regressions in base models, thereby reducing false positives in continuous integration pipelines. The change ensures that the quality of quantization is maintained relative to the base model's performance, providing a more reliable and adaptive testing environment.

Highlights

  • Recovery-Based Validation for LM-Eval Tests: LM-Eval tests now employ a recovery-based validation strategy, comparing the performance of quantized models directly against their base (unquantized) counterparts. This ensures that tests pass if the quantized model maintains a specified percentage (defaulting to 98%) of the base model's performance.
  • Improved Test Robustness and Reduced False Positives: This change significantly enhances test resilience against external factors, such as regressions in upstream base models (e.g., due to transformers library updates). By focusing on relative performance recovery, the system avoids false positives that previously blocked CI, streamlining the development workflow.
  • Backward Compatibility with Absolute Thresholds: The new validation mechanism is backward compatible. Existing test configurations that specify absolute performance thresholds will continue to function, but these thresholds will now serve as warnings rather than strict assertions, allowing for a smooth transition.
  • Minimal Performance Impact: While the new approach introduces an additional evaluation of the base model per test, adding approximately 2-3 minutes, the results are cached within the test session. This minimizes the overall performance impact relative to the time required for quantization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust recovery-based validation system for LM-Eval tests, which is a great improvement for test stability against upstream model changes. The implementation is well-structured, with the evaluation logic nicely refactored into a reusable _evaluate_model function. I've identified a couple of areas for improvement: there's an inconsistency in the default recovery threshold value that should be aligned, and a bug in the recovery calculation when the base model's score is zero, which could lead to false negatives in tests. Addressing these points will make the new validation logic even more reliable.

continue

# Calculate recovery
recovery = quant_val / base_val if base_val != 0 else 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for calculating recovery when base_val is 0 is incorrect. If base_val is 0, recovery is set to 0. This will cause the assertion to fail even in cases of perfect recovery (e.g., quant_val is also 0) or infinite recovery (quant_val > 0), leading to incorrect test failures.

To fix this, you should handle the base_val == 0 case explicitly. If both values are 0, recovery is 100% (1.0). If only base_val is 0, it's an infinite improvement, so we can use a value like 2.0 (200% recovery) to ensure the assertion passes without causing issues in the degradation calculation.

recovery = quant_val / base_val if base_val != 0 else (1.0 if quant_val == 0 else 2.0)

Comment on lines +87 to +89
# Default to 95% recovery if not specified
if "recovery_threshold" not in lmeval_dict:
lmeval_dict["recovery_threshold"] = 0.98
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's an inconsistency in the default recovery threshold. The code sets it to 0.98, while the comment and the PR description state 0.95. This is confusing. Please align the code, comment, and PR description to a single value (I'd recommend 0.95 to match the PR description).

You can also use dict.setdefault() for a more concise way to set a default value.

# Default to 95% recovery if not specified
lmeval_dict.setdefault("recovery_threshold", 0.95)

@dsikka dsikka self-requested a review August 18, 2025 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant