Skip to content

Conversation

@katec846
Copy link
Contributor

@katec846 katec846 commented Oct 1, 2025

What does this PR do ?

Add a field in megatron_cfg to enable bias_activation_fusion.

Issues

This PR is related to (#917): This will reduce the train step time
This flag along with policy.megatron_cfg.apply_rope_fusion=True and policy.megatron_cfg.sequence_parallel=True can improve ~25% speedup

Usage

  megatron_cfg:
    bias_activation_fusion: True 

or
overide the params with

++policy.megatron_cfg.bias_activation_fusion=True

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features
    • Introduced a new configuration option: bias_activation_fusion for Megatron-based models.
    • Enabled bias-activation fusion in example configs to improve training efficiency.
  • Documentation
    • Added an inline note in the config explaining the expected training performance speedup.

@katec846 katec846 requested a review from guyueh1 October 1, 2025 01:14
@katec846 katec846 self-assigned this Oct 1, 2025
@katec846 katec846 requested review from a team as code owners October 1, 2025 01:14
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 1, 2025

📝 Walkthrough

Walkthrough

Added a new bias_activation_fusion option to the SFT config and propagated it into the Megatron model configuration during policy worker initialization, alongside the existing apply_rope_fusion setting. No control-flow changes.

Changes

Cohort / File(s) Summary
Config updates
examples/configs/sft.yaml
Introduced megatron_cfg.bias_activation_fusion: True with a comment; retained existing apply_rope_fusion.
Megatron policy worker config propagation
nemo_rl/models/policy/megatron_policy_worker.py
Assigned model_cfg.bias_activation_fusion from Megatron config during initialization, adjacent to apply_rope_fusion assignment.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title accurately and concisely describes the primary change of adding a bias_activation_fusion field to the megatron_cfg for performance optimization, clearly reflecting the main diff without unnecessary details or noise.
Test Results For Major Changes ✅ Passed The PR description introduces only a new optional configuration flag that simply forwards bias_activation_fusion into the Megatron policy setup without altering existing defaults or behavior, so the change is minor. Because no tests are required for such a configuration-plumbing update, the absence of explicit test results in the description is acceptable for this check. Therefore the requirement that major changes document testing does not apply here.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
examples/configs/sft.yaml (1)

93-94: Improve the comment to document the performance benefit and dependencies.

The comment should be more descriptive to match the detail provided for apply_rope_fusion above. Per the PR description, this flag provides ~25% speedup when used together with apply_rope_fusion=True and sequence_parallel=True.

Consider updating the comment:

-    # gives training perf speedup
-    bias_activation_fusion: True 
+    # Gives ~25% training perf speedup when combined with apply_rope_fusion=True and sequence_parallel=True
+    bias_activation_fusion: True

As per coding guidelines.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b50bfca and 9817e88.

📒 Files selected for processing (2)
  • examples/configs/sft.yaml (1 hunks)
  • nemo_rl/models/policy/megatron_policy_worker.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
examples/configs/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/
.yaml

Files:

  • examples/configs/sft.yaml
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/models/policy/megatron_policy_worker.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/models/policy/megatron_policy_worker.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR

@guyueh1 guyueh1 requested a review from terrykong October 1, 2025 02:38
@guyueh1
Copy link
Contributor

guyueh1 commented Oct 1, 2025

@katec846 please also add this field to examples/configs/sft_openmathinstruct2_megatron.yaml, examples/configs/grpo_math_1B.yaml

@katec846 katec846 force-pushed the add_bias_activation_fusion branch from 9817e88 to 55954aa Compare October 1, 2025 03:41
@katec846 katec846 requested a review from a team as a code owner October 1, 2025 03:41
@guyueh1 guyueh1 added the CI:L0 Run doctests and unit tests label Oct 1, 2025
@guyueh1 guyueh1 linked an issue Oct 6, 2025 that may be closed by this pull request
@katec846 katec846 force-pushed the add_bias_activation_fusion branch from 55954aa to f8eb594 Compare October 8, 2025 17:14
@katec846 katec846 requested a review from a team as a code owner October 8, 2025 17:14
@katec846
Copy link
Contributor Author

katec846 commented Oct 8, 2025

@terrykong @guyueh1 can you help me review this PR?

@terrykong do you know how can I fix this pipeline error? I didn't see anything failing. The results are either success or skipped.

@terrykong
Copy link
Contributor

@katec846 it just means the tests weren't run ( i just triggered them for you by adding the label ). @chtruong814 maybe we need to give a better error message on the status check job name in this case

@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L0 Run doctests and unit tests labels Oct 8, 2025
@terrykong terrykong enabled auto-merge (squash) October 8, 2025 17:49
@terrykong
Copy link
Contributor

set to automerge pending @guyueh1 's review

guyueh1
guyueh1 previously approved these changes Oct 8, 2025
@guyueh1
Copy link
Contributor

guyueh1 commented Oct 8, 2025

@katec846 please also add the field to some of the default config dict in the test, refer to #1242

Copy link
Member

@joyang-nv joyang-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff. I am involved due to automodel reveiwers.
Support for this change but I don't think I am the right person to approve.

@katec846 katec846 force-pushed the add_bias_activation_fusion branch from a8a2a65 to f547c0d Compare October 9, 2025 22:18
@katec846 katec846 requested a review from a team as a code owner October 9, 2025 22:18
@katec846 katec846 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 9, 2025
@terrykong terrykong merged commit d975e39 into NVIDIA-NeMo:main Oct 10, 2025
41 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance issues for SFT with megatron backend

4 participants