Skip to content

Added GitHub Actions to Test LLM Accuracy Scripts #2206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

SridharRambhatla
Copy link

For issue #1898

Added .github/workflows/llm_accuracy_script_test.yml that tests 4 LLM models:
Script Improvements :

  • language/llama3.1-405b/evaluate-accuracy.py - Added mock dataset support
  • language/mixtral-8x7b/evaluate-accuracy.py - Added mock dataset + error handling
  • language/llama2-70b/evaluate-accuracy.py - Fixed pandas import

Models Tested

  • Llama3.1-405b: Multi-task evaluation (ROUGE, NIAH, QA)
  • Mixtral-8x7b: Three evaluation types (OpenOrca, GSM8K, MBXP)
  • Llama2-70b: Text generation quality (ROUGE metrics)
  • DeepSeek-R1: Academic benchmarks (import testing)

Testing

choco install act-cli
All tests pass successfully:
act -j test-llama3-accuracy
act -j test-mixtral-accuracy
act -j test-llama2-accuracy
act -j test-deepseek-accuracy

@SridharRambhatla SridharRambhatla requested a review from a team as a code owner June 10, 2025 17:07
Copy link
Contributor

github-actions bot commented Jun 10, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@arjunsuresh
Copy link
Contributor

Thank you Sridhar for your contribution. Can you please sign the MLCommons CLA?

@anandhu-eng is it possible to replace the dummy accuracy text content in this PR with real values from the actual accuracy log?

@anandhu-eng
Copy link
Contributor

Hi @arjunsuresh , since results of LLM's models are absent in our unofficial test submission repository, I think we have two options:

  1. Rerun any workflow and obtain the full accuracy log
  2. Obtain truncated accuracy log from any of the previous submissions(with proper reference)

Also, for using original accuracy log, original datasets would have to be downloaded(I think the combined size of the three datasets would not exceed the storage provided by the GitHub runner).

How about using MLCFlow accuracy script for testing as it handles both the dataset download and accuracy check with a single line command? eg

@arjunsuresh
Copy link
Contributor

Thanks @anandhu-eng. We probably don't need the full accuracy log but say the accuracy log for 10 inputs and the expected accuracy metric value for that. That should be good enough to validate that the scripts are working fine. And yes, for the dependency part, we can make use of mlcflow.

@anandhu-eng
Copy link
Contributor

That seems doable, thanks @arjunsuresh

How do we proceed on this? @SridharRambhatla would you be interested in doing the changes? If yes, I can help with whatever information you need.

@SridharRambhatla
Copy link
Author

That seems doable, thanks @arjunsuresh

How do we proceed on this? @SridharRambhatla would you be interested in doing the changes? If yes, I can help with whatever information you need.

Hi @anandhu-eng and @arjunsuresh , Sure I can make the required changes. Could you please share the accuracy logs/other information which is needed?
We can also connect over discord if needed, happy to hop on a call to get this right :)

Also, I'll sign the MLCommons CLA too. Thanks!

@anandhu-eng
Copy link
Contributor

anandhu-eng commented Jun 15, 2025

Hi @SridharRambhatla ,

In inference, we have an automation framework named MLCFlow which simplifies various stages of benchmark runs. When checking accuracy, we typically require the dataset, a set of libraries to run the accuracy checker, the accuracy log file, and sometimes the model itself. These can be considered the dependencies needed for performing an accuracy check.

You can visualize MLC scripts as individual scripts, each handling a specific task (e.g., one for downloading the dataset, another for running the checker, etc.). These scripts can be reused and called as dependencies from other scripts. You can find how scripts are used as dependencies for the LLaMA 2 accuracy check here.

Thanks to this modular design, users only need to call the top-level script to run the accuracy check. For example, refer to this usage.

What needs to be done:

  • Get sample accuracy log files from previous submissions. Ensure proper attribution is given for each source.
  • Create separate directories for each benchmark and copy the respective mlperf_log_accuracy.json files into them.
  • Follow the automation command for the accuracy checker from each benchmark’s README and provide the directory path using the --result_dir argument.

@arjunsuresh , I have two proposals:

  1. How about we create this GitHub Action in the mlperf-automations repo and trigger it from here, similar to how it's done for submission generation?
  2. Let’s skip calling the download-model dependency during accuracy checks for LLMs if the Hugging Face repo path is explicitly provided by the user. Previously, we used download-model to fetch tokenizers needed by the accuracy checker. But based on the recent PR, it seems we can just pass the Hugging Face repo path, which will only download the tokenizer and skip model weights.

Also, @SridharRambhatla — Discord might be a good option for further sync. You can join the Discord channel through this link.

@arjunsuresh
Copy link
Contributor

@anandhu-eng yes, we can skip the download model.

But is there any advantage in having the github action under mlperf-automations repository?

@SridharRambhatla
Copy link
Author

Hi @SridharRambhatla ,

In inference, we have an automation framework named MLCFlow which simplifies various stages of benchmark runs. When checking accuracy, we typically require the dataset, a set of libraries to run the accuracy checker, the accuracy log file, and sometimes the model itself. These can be considered the dependencies needed for performing an accuracy check.

You can visualize MLC scripts as individual scripts, each handling a specific task (e.g., one for downloading the dataset, another for running the checker, etc.). These scripts can be reused and called as dependencies from other scripts. You can find how scripts are used as dependencies for the LLaMA 2 accuracy check here.

Thanks to this modular design, users only need to call the top-level script to run the accuracy check. For example, refer to this usage.

What needs to be done:

  • Get sample accuracy log files from previous submissions. Ensure proper attribution is given for each source.
  • Create separate directories for each benchmark and copy the respective mlperf_log_accuracy.json files into them.
  • Follow the automation command for the accuracy checker from each benchmark’s README and provide the directory path using the --result_dir argument.

@arjunsuresh , I have two proposals:

  1. How about we create this GitHub Action in the mlperf-automations repo and trigger it from here, similar to how it's done for submission generation?
  2. Let’s skip calling the download-model dependency during accuracy checks for LLMs if the Hugging Face repo path is explicitly provided by the user. Previously, we used download-model to fetch tokenizers needed by the accuracy checker. But based on the recent PR, it seems we can just pass the Hugging Face repo path, which will only download the tokenizer and skip model weights.

Also, @SridharRambhatla — Discord might be a good option for further sync. You can join the Discord channel through this link.

Hi @anandhu-eng, got it. I'll take the mlperf_log_accuracy.json files you mentioned and add them to this workflow instead of the dummy data which is generated at the Generate sample accuracy log for Llama3.1 step. (Same for other models too)

Looks like the channel invite is invalid now. I'm already a part of the MLCommons server on discord, maybe we can schedule a call whenever you're free? We can check out the approach once and I can clarify any questions to get this right. Thanks!

@anandhu-eng
Copy link
Contributor

But is there any advantage in having the github action under mlperf-automations repository?

Hi @arjunsuresh , following points make me aligned to keeping tests in mlperf-automations repo and triggering it through workflow call when we want to use it in any other repository within MLCOMMONS Organization:

  • mlperf-automation would be the umbrella repository that houses everything related to MLC Scripts for MLCOMMONS
  • We might be able to catch error at a faster rate as the amount of PR we do per day is higher
  • Reduce bug fix time as we need not hop between multiple repository and wait for PR approvals

maybe we can schedule a call whenever you're free?

Sure @SridharRambhatla , how about we communicate through mail for that. Please e-mail at [email protected]

@arjunsuresh
Copy link
Contributor

@anandhu-eng sure. If you feel that adds a convenience, then that's fine.

@anandhu-eng
Copy link
Contributor

recheck

@anandhu-eng anandhu-eng marked this pull request as draft July 22, 2025 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants