Added GitHub Actions to Test LLM Accuracy Scripts #2206

SridharRambhatla · 2025-06-10T17:07:18Z

For issue #1898

Added .github/workflows/llm_accuracy_script_test.yml that tests 4 LLM models:
Script Improvements :

language/llama3.1-405b/evaluate-accuracy.py - Added mock dataset support
language/mixtral-8x7b/evaluate-accuracy.py - Added mock dataset + error handling
language/llama2-70b/evaluate-accuracy.py - Fixed pandas import

Models Tested

Llama3.1-405b: Multi-task evaluation (ROUGE, NIAH, QA)
Mixtral-8x7b: Three evaluation types (OpenOrca, GSM8K, MBXP)
Llama2-70b: Text generation quality (ROUGE metrics)
DeepSeek-R1: Academic benchmarks (import testing)

Testing

choco install act-cli
All tests pass successfully:
act -j test-llama3-accuracy
act -j test-mixtral-accuracy
act -j test-llama2-accuracy
act -j test-deepseek-accuracy

github-actions · 2025-06-10T17:07:26Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

arjunsuresh · 2025-06-13T21:50:18Z

Thank you Sridhar for your contribution. Can you please sign the MLCommons CLA?

@anandhu-eng is it possible to replace the dummy accuracy text content in this PR with real values from the actual accuracy log?

anandhu-eng · 2025-06-14T04:31:36Z

Hi @arjunsuresh , since results of LLM's models are absent in our unofficial test submission repository, I think we have two options:

Rerun any workflow and obtain the full accuracy log
Obtain truncated accuracy log from any of the previous submissions(with proper reference)

Also, for using original accuracy log, original datasets would have to be downloaded(I think the combined size of the three datasets would not exceed the storage provided by the GitHub runner).

How about using MLCFlow accuracy script for testing as it handles both the dataset download and accuracy check with a single line command? eg

arjunsuresh · 2025-06-14T12:40:43Z

Thanks @anandhu-eng. We probably don't need the full accuracy log but say the accuracy log for 10 inputs and the expected accuracy metric value for that. That should be good enough to validate that the scripts are working fine. And yes, for the dependency part, we can make use of mlcflow.

anandhu-eng · 2025-06-14T18:02:13Z

That seems doable, thanks @arjunsuresh

How do we proceed on this? @SridharRambhatla would you be interested in doing the changes? If yes, I can help with whatever information you need.

SridharRambhatla · 2025-06-14T18:57:03Z

That seems doable, thanks @arjunsuresh

How do we proceed on this? @SridharRambhatla would you be interested in doing the changes? If yes, I can help with whatever information you need.

Hi @anandhu-eng and @arjunsuresh , Sure I can make the required changes. Could you please share the accuracy logs/other information which is needed?
We can also connect over discord if needed, happy to hop on a call to get this right :)

Also, I'll sign the MLCommons CLA too. Thanks!

anandhu-eng · 2025-06-15T08:45:15Z

Hi @SridharRambhatla ,

In inference, we have an automation framework named MLCFlow which simplifies various stages of benchmark runs. When checking accuracy, we typically require the dataset, a set of libraries to run the accuracy checker, the accuracy log file, and sometimes the model itself. These can be considered the dependencies needed for performing an accuracy check.

You can visualize MLC scripts as individual scripts, each handling a specific task (e.g., one for downloading the dataset, another for running the checker, etc.). These scripts can be reused and called as dependencies from other scripts. You can find how scripts are used as dependencies for the LLaMA 2 accuracy check here.

Thanks to this modular design, users only need to call the top-level script to run the accuracy check. For example, refer to this usage.

What needs to be done:

Get sample accuracy log files from previous submissions. Ensure proper attribution is given for each source.
Create separate directories for each benchmark and copy the respective mlperf_log_accuracy.json files into them.
Follow the automation command for the accuracy checker from each benchmark’s README and provide the directory path using the --result_dir argument.

@arjunsuresh , I have two proposals:

How about we create this GitHub Action in the mlperf-automations repo and trigger it from here, similar to how it's done for submission generation?
Let’s skip calling the download-model dependency during accuracy checks for LLMs if the Hugging Face repo path is explicitly provided by the user. Previously, we used download-model to fetch tokenizers needed by the accuracy checker. But based on the recent PR, it seems we can just pass the Hugging Face repo path, which will only download the tokenizer and skip model weights.

Also, @SridharRambhatla — Discord might be a good option for further sync. You can join the Discord channel through this link.

arjunsuresh · 2025-06-15T09:18:30Z

@anandhu-eng yes, we can skip the download model.

But is there any advantage in having the github action under mlperf-automations repository?

SridharRambhatla · 2025-06-15T09:47:49Z

Hi @SridharRambhatla ,

In inference, we have an automation framework named MLCFlow which simplifies various stages of benchmark runs. When checking accuracy, we typically require the dataset, a set of libraries to run the accuracy checker, the accuracy log file, and sometimes the model itself. These can be considered the dependencies needed for performing an accuracy check.

You can visualize MLC scripts as individual scripts, each handling a specific task (e.g., one for downloading the dataset, another for running the checker, etc.). These scripts can be reused and called as dependencies from other scripts. You can find how scripts are used as dependencies for the LLaMA 2 accuracy check here.

Thanks to this modular design, users only need to call the top-level script to run the accuracy check. For example, refer to this usage.

What needs to be done:

Get sample accuracy log files from previous submissions. Ensure proper attribution is given for each source.

Create separate directories for each benchmark and copy the respective mlperf_log_accuracy.json files into them.

Follow the automation command for the accuracy checker from each benchmark’s README and provide the directory path using the --result_dir argument.

@arjunsuresh , I have two proposals:

How about we create this GitHub Action in the mlperf-automations repo and trigger it from here, similar to how it's done for submission generation?

Let’s skip calling the download-model dependency during accuracy checks for LLMs if the Hugging Face repo path is explicitly provided by the user. Previously, we used download-model to fetch tokenizers needed by the accuracy checker. But based on the recent PR, it seems we can just pass the Hugging Face repo path, which will only download the tokenizer and skip model weights.

Also, @SridharRambhatla — Discord might be a good option for further sync. You can join the Discord channel through this link.

Hi @anandhu-eng, got it. I'll take the mlperf_log_accuracy.json files you mentioned and add them to this workflow instead of the dummy data which is generated at the Generate sample accuracy log for Llama3.1 step. (Same for other models too)

Looks like the channel invite is invalid now. I'm already a part of the MLCommons server on discord, maybe we can schedule a call whenever you're free? We can check out the approach once and I can clarify any questions to get this right. Thanks!

anandhu-eng · 2025-06-15T12:43:41Z

But is there any advantage in having the github action under mlperf-automations repository?

Hi @arjunsuresh , following points make me aligned to keeping tests in mlperf-automations repo and triggering it through workflow call when we want to use it in any other repository within MLCOMMONS Organization:

mlperf-automation would be the umbrella repository that houses everything related to MLC Scripts for MLCOMMONS
We might be able to catch error at a faster rate as the amount of PR we do per day is higher
Reduce bug fix time as we need not hop between multiple repository and wait for PR approvals

maybe we can schedule a call whenever you're free?

Sure @SridharRambhatla , how about we communicate through mail for that. Please e-mail at [email protected]

arjunsuresh · 2025-06-15T13:10:19Z

@anandhu-eng sure. If you feel that adds a convenience, then that's fine.

anandhu-eng · 2025-06-17T12:26:47Z

recheck

SridharRambhatla and others added 5 commits June 10, 2025 20:34

Add GitHub Actions workflow for testing LLM accuracy scripts

c230eb1

[Automated Commit] Format Codebase

9b9bf3a

Add GitHub Actions workflow for testing LLM accuracy scripts

d325b03

Resolve merge conflicts

a83e683

[Automated Commit] Format Codebase

92b397c

SridharRambhatla requested a review from a team as a code owner June 10, 2025 17:07

Merge branch 'master' into add-llm-accuracy-tests

15908fa

arjunsuresh added 2 commits July 8, 2025 17:02

Merge branch 'master' into add-llm-accuracy-tests

7258263

Merge branch 'master' into add-llm-accuracy-tests

b106889

anandhu-eng marked this pull request as draft July 22, 2025 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added GitHub Actions to Test LLM Accuracy Scripts #2206

Added GitHub Actions to Test LLM Accuracy Scripts #2206

Uh oh!

SridharRambhatla commented Jun 10, 2025

Uh oh!

github-actions bot commented Jun 10, 2025 •

edited

Loading

Uh oh!

arjunsuresh commented Jun 13, 2025

Uh oh!

anandhu-eng commented Jun 14, 2025

Uh oh!

arjunsuresh commented Jun 14, 2025

Uh oh!

anandhu-eng commented Jun 14, 2025

Uh oh!

SridharRambhatla commented Jun 14, 2025

Uh oh!

anandhu-eng commented Jun 15, 2025 •

edited

Loading

Uh oh!

arjunsuresh commented Jun 15, 2025

Uh oh!

SridharRambhatla commented Jun 15, 2025

What needs to be done:

Uh oh!

anandhu-eng commented Jun 15, 2025

Uh oh!

arjunsuresh commented Jun 15, 2025

Uh oh!

anandhu-eng commented Jun 17, 2025

Uh oh!

Uh oh!

Added GitHub Actions to Test LLM Accuracy Scripts #2206

Are you sure you want to change the base?

Added GitHub Actions to Test LLM Accuracy Scripts #2206

Uh oh!

Conversation

SridharRambhatla commented Jun 10, 2025

Models Tested

Testing

Uh oh!

github-actions bot commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arjunsuresh commented Jun 13, 2025

Uh oh!

anandhu-eng commented Jun 14, 2025

Uh oh!

arjunsuresh commented Jun 14, 2025

Uh oh!

anandhu-eng commented Jun 14, 2025

Uh oh!

SridharRambhatla commented Jun 14, 2025

Uh oh!

anandhu-eng commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What needs to be done:

Uh oh!

arjunsuresh commented Jun 15, 2025

Uh oh!

SridharRambhatla commented Jun 15, 2025

What needs to be done:

Uh oh!

anandhu-eng commented Jun 15, 2025

Uh oh!

arjunsuresh commented Jun 15, 2025

Uh oh!

anandhu-eng commented Jun 17, 2025

Uh oh!

Uh oh!

github-actions bot commented Jun 10, 2025 •

edited

Loading

anandhu-eng commented Jun 15, 2025 •

edited

Loading