From 9f96ca54d176ce9b439b8da5f49686f8ad053dad Mon Sep 17 00:00:00 2001
From: v-shobhit <161510941+v-shobhit@users.noreply.github.com>
Date: Mon, 5 May 2025 21:39:24 -0400
Subject: [PATCH] Update README.md

Cosmetic changes for ease of reproduction
---
 closed/NVIDIA/README.md | 66 ++++++++++++++++++++++++++---------------
 1 file changed, 42 insertions(+), 24 deletions(-)
diff --git a/closed/NVIDIA/README.md b/closed/NVIDIA/README.md
index 3ae3ffebc..798bbc704 100644
--- a/closed/NVIDIA/README.md
+++ b/closed/NVIDIA/README.md
@@ -1,9 +1,12 @@
 # MLPerf Inference v5.0 NVIDIA-Optimized Implementations
 This is a repository of NVIDIA-optimized implementations for the [MLPerf](https://mlcommons.org/en/) Inference Benchmark.
-This README is a quickstart tutorial on how to use our code as a public / external user.
+This README is a quickstart tutorial on how to use our code as a public / external user.  
+TLDR: For a quick reproduction steps of the benchmark, please skip to [Quick repro steps](#quick-repro-steps).
 
 ---
 
+
+
 ### MLPerf Inference Policies and Terminology
 
 This is a new-user guide to learn how to use NVIDIA's MLPerf Inference submission repo. **To get started with MLPerf Inference, first familiarize yourself with the [MLPerf Inference Policies, Rules, and Terminology](https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc)**. This is a document from the MLCommons committee that runs the MLPerf benchmarks, and the rest of all MLPerf Inference guides will assume that you have read and familiarized yourself with its contents. The most important sections of the document to know are:
@@ -14,27 +17,6 @@ This is a new-user guide to learn how to use NVIDIA's MLPerf Inference submissio
 - [LoadGen Operation](https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc#51-loadgen-operation)
 
 
-
-### Quick Start on computelab
-
-Rrequest gpu nodes on computelab
-
-- [Machines on computelab](https://confluence.nvidia.com/display/GCA/MLPerf-Inference+v5.0+Machines)
-
-`export MLPERF_SCRATCH_PATH=/path/to/scratch/space`: set mlperf scratch space
-
-`make prebuild`: builds and launch the container.
-
-`make build`: builds plugins and binaries.
-
-`make generate_engines RUN_ARGS="--benchmarks=<BENCHMARK> --scenarios=<SCENARIO>`: generates engines.
-
-`make run_harness RUN_ARGS="--benchmarks=<BENCHMARK> --scenarios=<SCENARIO>`: runs the harness to get perf results.
-
-`make run_harness RUN_ARGS="--benchmarks=<BENCHMARK> --scenarios=<SCENARIO> --test_mode=AccuracyOnly`: runs the harness to get accuracy results.
-
-Add --config_ver=high_accuracy to run with high accuracy target.
-
 ### NVIDIA's Submission
 
 NVIDIA submits with multiple systems, each of which are in either the datacenter category, edge category, or both. In general, multi-GPU systems are submitted in datacenter, and single-GPU systems are submitted in edge.
@@ -62,8 +44,6 @@ Make sure that your user is in docker group already. If you get permission issue
 
 ### Software Dependencies
 
-### Datacenter systems
-
 Our submission uses Docker to set up the environment. Requirements are:
 
 - [Docker CE](https://docs.docker.com/engine/install/)
@@ -579,3 +559,41 @@ More specific documentation and for debugging:
 - documentation/submission_guide.md - Documentation on officially submitting our repo to MLPerf Inference
 - documentation/calibration.md - Documentation on how we use calibration and quantization for MLPerf Inference
 
+
+### Quick repro steps
+1. From `repo_root/closed/NVIDIA`, run 
+```bash
+make prebuild
+```
+
+2. To build thrid party software dependancies:
+```bash
+make build
+```
+Optionally, for triton harnesses:
+```bash
+make clone_triton && make build_triton
+```
+
+3. To build inference engines (taking `llama2-70b` as an example): 
+```bash
+make generate_engines RUN_ARGS="--benchmarks=llama2-70b --scenarios=Offline,Server"
+```
+
+4. To run the benchmark:
+```bash
+make run_harness RUN_ARGS="--benchmarks=llama2-70b --scenarios=Offline,Server --test_mode=PerformanceOnly" # Performance run
+make run_harness RUN_ARGS="--benchmarks=llama2-70b --scenarios=Offline,Server --test_mode=AccuracyOnly" # Accuracy run
+```
+
+5. To run compliance tests:
+```bash
+make run_audit_harness RUN_ARGS="--benchmarks=llama2-70b --scenarios=Offline,Server"
+```
+
+#### More info:
+- [documentation/performance_tuning_guide.md](documentation/performance_tuning_guide.md) - Documentation related to tuning and benchmarks via configuration changes
+- [documentation/commands.md](documentation/commands.md) - Documentation on commonly used Make targets and RUN_ARGS options
+- [documentation/FAQ.md](documentation/FAQ.md) - An FAQ on common errors or issues that have popped up in the past
+- [documentation/submission_guide.md](documentation/submission_guide.md) - Documentation on officially submitting our repo to MLPerf Inference
+- [documentation/calibration.md](documentation/calibration.md) - Documentation on how we use calibration and quantization for MLPerf Inference