Skip to content

Conversation

@terrykong
Copy link
Contributor

@terrykong terrykong commented Sep 29, 2025

Follow up to #1215 to enable bisecting nightly tests.

example 1 sft failures

Here's an example invocation that found one regression with tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh

rsync -ahP --delete tools/ tools.bisect/  # This copies bisect utilities outside of VCS so we always run the latest copy
TEST_CASE=tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh

HF_HOME=... \
HF_DATASETS_CACHE=... \
CONTAINER=... \
MOUNTS=... \
ACCOUNT=... \
PARTITION=... \
SED_CLAUSES=$(cat <<'SED'
s@'mean(data\["timing/train/total_step_time@#'mean(data["timing/train/total_step_time@
/ray\/node\.0\.gpu\.0\.mem_gb/d
SED
) \
EXTRA_SCRIPT_ARGS="++data.num_workers=1" \
  GOOD=$(git log --format="%h" --diff-filter=A -- $TEST_CASE) \
  BAD=HEAD \
  tools.bisect/bisect-script.sh tools.bisect/launch-bisect-helper.sh $TEST_CASE 2>&1 | tee -a bisect.log
image

https://wandb.ai/nvidia/nemo-rl?nw=fnhia71y43d

Which produces this git bisect log

# bad: [5b9ab15799c35428c557ab6f8687ec461b69383e] fix all logs glob
# good: [ac7469ffabf6eebe0b014b3baa04551474a3a66b] test: Add Megatron tests (#713)
git bisect start '5b9ab15799c35428c557ab6f8687ec461b69383e' 'ac7469ff'
# good: [5a9f7acc59ed70e6eb52dd065a55ec015c895204] feat: Expose async vLLM engine as HTTP server (#1110)
git bisect good 5a9f7acc59ed70e6eb52dd065a55ec015c895204
# good: [3a1ca3fee69ac139d2b68fef89b749200e6daa00] perf: Remove empty_cache for performance optimization (#1071)
git bisect good 3a1ca3fee69ac139d2b68fef89b749200e6daa00
# good: [ef60b3341c2ea1b6c3d046f2ea2e381e4535e54c] ci: Run nightly Github tests (#1172)
git bisect good ef60b3341c2ea1b6c3d046f2ea2e381e4535e54c
# good: [42aa41b6617b355865038ed24511118d4fb1c0d6] feat: add async RL support (#1098)
git bisect good 42aa41b6617b355865038ed24511118d4fb1c0d6
# bad: [c01f9d7ceb53a7f0246ae53c09ccb054cdcbcdd7] ci: Add status badge and prevent merging if no tests ran (#1192)
git bisect bad c01f9d7ceb53a7f0246ae53c09ccb054cdcbcdd7
# bad: [e22a340b515f2814b6b19e8d7805c94c15a46b6f] docs: Restructure README with backend-specific quick start and setup guides (#1091)
git bisect bad e22a340b515f2814b6b19e8d7805c94c15a46b6f
# bad: [051c2f761a0a4606517bfe3bff84ddcc9b3291ce] fix: Add check for world size and parallelism enabled (#1190)
git bisect bad 051c2f761a0a4606517bfe3bff84ddcc9b3291ce
# bad: [64ee0d030246d0ea04fc49e5a1513fe84082ee70] feat: support chat_template_kwargs in tokenizer config (#1165)
git bisect bad 64ee0d030246d0ea04fc49e5a1513fe84082ee70
# bad: [cde2acd6e4d9a9514ee4646f384b8aba3bcc8b62] perf: Add a field in SFT data config to modify num_workers for loading data (#1143)
git bisect bad cde2acd6e4d9a9514ee4646f384b8aba3bcc8b62
# first bad commit: [cde2acd6e4d9a9514ee4646f384b8aba3bcc8b62] perf: Add a field in SFT data config to modify num_workers for loading data (#1143)

which turned out to be a regression because the metric check was too noisy. In particular 'data["train/loss"]["250"] < 0.5' \ failed once we started using num_workers=1 which changed the determinism of the run.

example 2 (fp8 bisect)

rm -rf code_snapshots_bisect/
rsync -ahP --delete tools/ tools.bisect/  # This copies bisect utilities outside of VCS so we always run the latest copy
TEST_CASE=tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8.sh

HF_HOME=... \
HF_DATASETS_CACHE=... \
CONTAINER=... \
MOUNTS=... \
ACCOUNT=... \
PARTITION=... \
EXTRA_SCRIPT_ARGS="++data.num_workers=1 cluster.num_nodes=1" \
  GOOD=$(git log --format="%h" --diff-filter=A -- $TEST_CASE) \
  BAD=f521459c5848b0f7c804e8df2551242d96b48369 \
  tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh $TEST_CASE 2>&1 | tee -a bisect.log

found that even the initial commit failed

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Sep 29, 2025
add passing foobar

Signed-off-by: Terry Kong <[email protected]>

add bisect

Signed-off-by: Terry Kong <[email protected]>

add better error

Signed-off-by: Terry Kong <[email protected]>

some error messagin

Signed-off-by: Terry Kong <[email protected]>

try --

Signed-off-by: Terry Kong <[email protected]>

ok try this

Signed-off-by: Terry Kong <[email protected]>

try visualize

Signed-off-by: Terry Kong <[email protected]>

could you try this

Signed-off-by: Terry Kong <[email protected]>

2==2

Signed-off-by: Terry Kong <[email protected]>

3==4

Signed-off-by: Terry Kong <[email protected]>

3==4 (n=5)

Signed-off-by: Terry Kong <[email protected]>

3==4 (n=8)

Signed-off-by: Terry Kong <[email protected]>

try echo

Signed-off-by: Terry Kong <[email protected]>

go

Signed-off-by: Terry Kong <[email protected]>

ok

Signed-off-by: Terry Kong <[email protected]>

copywrite bisect

Signed-off-by: Terry Kong <[email protected]>

get rid of foobar test

Signed-off-by: Terry Kong <[email protected]>

fix

Signed-off-by: Terry Kong <[email protected]>

coderabbit

Signed-off-by: Terry Kong <[email protected]>

slurm bisect changes

Signed-off-by: Terry Kong <[email protected]>

comment fix

Signed-off-by: Terry Kong <[email protected]>

fix

Signed-off-by: Terry Kong <[email protected]>

more info

Signed-off-by: Terry Kong <[email protected]>

submodules

Signed-off-by: Terry Kong <[email protected]>

add a check to see if submodule dirty and also clean the submodules
after the GOOD commit check

Signed-off-by: Terry Kong <[email protected]>

account for  NRL_MEGATRON_CHECKPOINT_DIR

Signed-off-by: Terry Kong <[email protected]>

chose a better dir location

Signed-off-by: Terry Kong <[email protected]>

switch to an explicit test case

Signed-off-by: Terry Kong <[email protected]>

add sed clause stuff

Signed-off-by: Terry Kong <[email protected]>

fix all logs glob

Signed-off-by: Terry Kong <[email protected]>

allow replay log

Signed-off-by: Terry Kong <[email protected]>

set bad commit to not head

Signed-off-by: Terry Kong <[email protected]>

put the mcore ckpt dir inside the experiment dir

Signed-off-by: Terry Kong <[email protected]>

fix up submodule reset within bisect

Signed-off-by: Terry Kong <[email protected]>

EXTRA_SCRIPT_ARGS

Signed-off-by: Terry Kong <[email protected]>

make check metrics standalone

Signed-off-by: Terry Kong <[email protected]>

check metrics docstring

Signed-off-by: Terry Kong <[email protected]>

add stuff

Signed-off-by: Terry Kong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants