add stuff

terrykong · terrykong · commit 89a0b3162e70 · 2025-09-28T21:59:10.000-07:00
Signed-off-by: Terry Kong &lt;terryk@nvidia.com&gt;
diff --git a/docs/testing.md b/docs/testing.md
@@ -181,18 +181,119 @@ in Docker with this script:
 CONTAINER=... bash run_functional_in_docker.sh functional/sft.sh
 ```
 
+The required `CONTAINER` can be built by following the instructions in the [Docker documentation](docker.md).
+
+## Bisecting Failing Tests
+
+### Bisecting Unit/Functional Tests
+
+Use `tools/bisect-run.sh` to automatically run your test command across a commit range and find the first bad commit. It forces venv rebuilds so dependencies match each commit.
+
+Basic usage:
+
+```sh
+GOOD=<good_ref> BAD=<bad_ref> \
+  tools/bisect-run.sh uv run --group test pytest tests/unit/test_foobar.py::test_case
+```
+
+Examples:
+
+```sh
+GOOD=56a6225 BAD=32faafa \
+  tools/bisect-run.sh uv run --group dev pre-commit run --all-files
+
+GOOD=464ed38 BAD=c843f1b \
+  tools/bisect-run.sh uv run --group test pytest tests/unit/test_foobar.py
+```
+
+Notes:
+
+- Exit codes drive the classification: 0=good, non-zero=bad, 125=skip.
+- The script pre-verifies that `GOOD` is actually good by running your command on it.
+- On failure or interruption, it saves a timestamped `git bisect log` to `<repo>/bisect-logs/`. You can resume later with `BISECT_REPLAY_LOG` (see below).
+- Set `BISECT_NO_RESET=1` to keep the bisect state after the script exits.
+
+Resume from a saved bisect log:
+
+```sh
+BISECT_REPLAY_LOG=/abs/path/to/bisect-2025....log \
+  tools/bisect-run.sh uv run --group test pytest tests/unit/test_foobar.py
+```
+
+### Bisecting nightlies
 
-## Static Type Checking with [MyPy](https://mypy-lang.org/)
-Static type checking can be run with no GPU resources:
+Nightly training scripts can be bisected using the same driver plus a helper that sets up hermetic runs on Slurm.
+
+Vanilla flow:
+
+```sh
+# Copy bisect utilities outside of VCS to ensure a stable runner
+rsync -ahP --delete tools/ tools.bisect/
+
+TEST_CASE=tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
+
+HF_HOME=... \
+HF_DATASETS_CACHE=... \
+CONTAINER=... \
+MOUNTS=... \
+ACCOUNT=... \
+PARTITION=... \
+GOOD=$(git log --format="%h" --diff-filter=A -- "$TEST_CASE") \
+BAD=HEAD \
+  tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh "$TEST_CASE"
+```
+
+::::{note}
+The command `GOOD=$(git log --format="%h" --diff-filter=A -- "$TEST_CASE")` selects the commit that introduced the test script. Because the path is typically added only once, this yields the introduction commit to use as the known good baseline.
+::::
+
+- `tools.bisect/launch-bisect-helper.sh` ensures each commit runs in a fresh venv, creates an isolated code snapshot per commit, blocks until metrics are checked, and returns a suitable exit code for bisect.
+
+Progressively more advanced cases:
+
+1) Adjusting the test case on the fly with `SED_CLAUSES`
+
+ - If a test script needs small textual edits during bisect (e.g., relax a threshold; drop a noisy metric you don’t care to bisect over when focusing on convergence vs. perf), provide a sed script via `SED_CLAUSES`. You can also use this to adjust runtime controls like `MAX_STEPS`, `STEPS_PER_RUN`, or `NUM_MINUTES` when a perf regression slows runs down so they still complete and emit metrics. The helper applies it and automatically restores the test script after the run.
+
+```sh
+SED_CLAUSES=$(cat <<'SED'
+s#mean(data\["timing/train/total_step_time"\], -6, -1) < 0\.6#mean(data["timing/train/total_step_time"], -6, -1) < 0.63#
+/ray\/node\.0\.gpu\.0\.mem_gb/d
+SED
+) \
+GOOD=$(git log --format="%h" --diff-filter=A -- "$TEST_CASE") \
+BAD=HEAD \
+  tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh "$TEST_CASE"
+```
+
+1) Passing extra script arguments
+
+- If the nightly script supports Hydra/CLI overrides, pass them via `EXTRA_SCRIPT_ARGS` so each run adopts those overrides (e.g., fix a transient incompatibility):
+
+:::{important}
+Changing script arguments can materially affect performance characteristics and/or convergence behavior. This may influence the validity of the bisect outcome relative to your baseline configuration. Prefer the smallest, clearly-justified overrides, keep them consistent across all commits, and document them alongside your results so conclusions are interpreted correctly.
+:::
 
 ```sh
-uv run --group test mypy {program}.py
+EXTRA_SCRIPT_ARGS="++data.num_workers=1" \
+GOOD=$(git log --format="%h" --diff-filter=A -- "$TEST_CASE") \
+BAD=HEAD \
+  tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh "$TEST_CASE"
 ```
 
-For example,
+1) Resuming from an earlier interrupted or misclassified session
+
+- Use `BISECT_REPLAY_LOG` with the bisect driver to replay prior markings and continue running. This is handy if a run failed for an unrelated reason or you manually edited a log to change `bad` → `skip` or to drop an incorrect line.
+
 ```sh
-uv run --group test mypy examples/run_grpo_math.py
-uv run --group test mypy examples/run_sft.py
+BISECT_REPLAY_LOG=/abs/path/to/bisect-logs/bisect-YYYYmmdd-HHMMSS-<sha>.log \
+HF_HOME=... HF_DATASETS_CACHE=... CONTAINER=... MOUNTS=... ACCOUNT=... PARTITION=... \
+  tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh "$TEST_CASE"
 ```
 
-mypy.ini controls the configuration of mypy.
+Tips and conventions:
+
+- Exit code 125 means “skip this commit” in git bisect; our helper returns 125 if required env is missing or if it needs to abort safely.
+- Submodules must be clean. The bisect script enforces `submodule.recurse=true` and `fetch.recurseSubmodules=on-demand` so submodules follow commit checkouts.
+- Each commit uses a fresh code snapshot directory and a separate Megatron checkpoint dir to avoid cross-commit contamination.
+- On failure/interrupt, a timestamped bisect log is saved under `<repo>/bisect-logs/`. Use it with `BISECT_REPLAY_LOG` to resume.
diff --git a/tools/bisect-run.sh b/tools/bisect-run.sh
@@ -20,7 +20,7 @@ set -euo pipefail
 export NRL_FORCE_REBUILD_VENVS=true
 print_usage() {
   cat <<'EOF'
-Usage: GOOD=<good_ref> BAD=<bad_ref> tools/bisect-script.sh [command ...]
+Usage: GOOD=<good_ref> BAD=<bad_ref> tools/bisect-run.sh [command ...]
 
 Runs a git bisect session between GOOD and BAD to find the first bad commit.
 Sets NRL_FORCE_REBUILD_VENVS=true to ensure test environments are rebuilt to match commit's uv.lock.
@@ -30,8 +30,8 @@ commit to verify it actually passes. If it does not, the script aborts early so
 you can pick a truly good baseline.
 
 Examples:
-  GOOD=56a6225 BAD=32faafa tools/bisect-script.sh uv run --group dev pre-commit run --all-files
-  GOOD=464ed38 BAD=c843f1b tools/bisect-script.sh uv run --group test pytest tests/unit/test_foobar.py
+  GOOD=56a6225 BAD=32faafa tools/bisect-run.sh uv run --group dev pre-commit run --all-files
+  GOOD=464ed38 BAD=c843f1b tools/bisect-run.sh uv run --group test pytest tests/unit/test_foobar.py
 
   # Example ouptut:
   #    1. Will run until hits the first bad commit.
@@ -85,7 +85,7 @@ SED
 ) \
 GOOD=$(git log --format="%h" --diff-filter=A -- $TEST_CASE) \
 BAD=5b9ab15799c35428c557ab6f8687ec461b69383e \
-  tools.bisect/bisect-script.sh tools.bisect/launch-bisect-helper.sh $TEST_CASE
+  tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh $TEST_CASE
 
 Requirements (ensure submodules update when switching commits):
   Per-repo (recommended inside this repo):
@@ -111,7 +111,7 @@ Additional features:
     to '<repo_root>/bisect-logs/'. Override with BISECT_SAVE_DIR.
   - Resume from a prior bisect log via replay:
         BISECT_REPLAY_LOG=/path/to/bisect-YYYYmmdd-HHMMSS-<sha>.log \
-          tools.bisect/bisect-script.sh [command ...]
+          tools.bisect/bisect-run.sh [command ...]
     This will 'git bisect replay' the provided log, then continue with 'git bisect run'.
   - Set BISECT_NO_RESET=1 to keep the bisect state after the script exits.
     By default, the script resets the bisect on exit.
@@ -214,7 +214,7 @@ on_interrupt_or_error() {
       local saved
       saved=$(save_bisect_log "interrupt") || true
       if [[ -n "$saved" ]]; then
-        iecho "[bisect] To resume later: BISECT_REPLAY_LOG=$saved <other_env_vars>... tools.bisect/bisect-script.sh ${USER_CMD[@]}"
+        iecho "[bisect] To resume later: BISECT_REPLAY_LOG=$saved <other_env_vars>... tools.bisect/bisect-run.sh ${USER_CMD[@]}"
       fi
       iecho "[bisect] Restoring original state with 'git bisect reset' on exit."
     fi
@@ -362,7 +362,7 @@ fi
 if [[ $RUN_STATUS -ne 0 ]]; then
   saved_after_run=$(save_bisect_log "run-exit-${RUN_STATUS}") || true
   if [[ -n "$saved_after_run" ]]; then
-    iecho "[bisect] To resume later: BISECT_REPLAY_LOG=$saved_after_run tools.bisect/bisect-script.sh ${USER_CMD[@]}"
+    iecho "[bisect] To resume later: BISECT_REPLAY_LOG=$saved_after_run tools.bisect/bisect-run.sh ${USER_CMD[@]}"
   fi
 fi
 
diff --git a/tools/launch-bisect.sh b/tools/launch-bisect.sh