megatron checkpoint conversion failed

**Describe the bug**

`convert_megatron_to_hf` failed in `main`.

**Steps/Code to reproduce bug**

```
uv run --extra mcore python examples/converters/convert_megatron_to_hf.py     --config /path/to/config.yaml     --megatron-ckpt-path /path/to/policy/weights/iter_0000000     --hf-ckpt-path /path/to/hf
```

**Expected behavior**

Error message expected:
```
[rank0]: Traceback (most recent call last):
[rank0]:   File "/app/nemo-rl/examples/converters/convert_megatron_to_hf.py", line 74, in <module>
[rank0]:     main()
[rank0]:   File "/app/nemo-rl/examples/converters/convert_megatron_to_hf.py", line 65, in main
[rank0]:     export_model_from_megatron(
[rank0]:   File "/app/nemo-rl/nemo_rl/models/megatron/community_import.py", line 109, in export_model_from_megatron
[rank0]:     bridge.save_hf_pretrained(megatron_model, output_path)
[rank0]:   File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/conversion/auto_bridge.py", line 386, in save_hf_pretrained
[rank0]:     self.save_hf_weights(model, path, show_progress)
[rank0]:   File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/conversion/auto_bridge.py", line 434, in save_hf_weights
[rank0]:     self.hf_pretrained.state.source.save_generator(generator, path)
[rank0]:   File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/hf_pretrained/state.py", line 729, in save_generator
[rank0]:     for name, tensor in generator:
[rank0]:                         ^^^^^^^^^
[rank0]:   File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/conversion/model_bridge.py", line 573, in stream_weights_megatron_to_hf
[rank0]:     conversion_tasks = self.build_conversion_tasks(hf_pretrained, megatron_model)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/conversion/model_bridge.py", line 781, in build_conversion_tasks
[rank0]:     pp_rank = parallel_state.get_pipeline_model_parallel_rank()
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/app/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/parallel_state.py", line 1536, in get_pipeline_model_parallel_rank
[rank0]:     rank = torch.distributed.get_rank()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/app/nemo_rl_venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2291, in get_rank
[rank0]:     default_pg = _get_default_group()
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/app/nemo_rl_venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 1298, in _get_default_group
[rank0]:     raise ValueError(
[rank0]: ValueError: Default process group has not been initialized, please make sure to call init_process_group.
```

**Environment overview (please complete the following information)**

 - Environment location: Docker
 - Method of install: Building the Release docker container (https://github.com/NVIDIA-NeMo/RL/blob/main/docker/Dockerfile)
 - If method of install is [Docker], provide `docker pull` & `docker run` commands used

**Environment details**

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version

**Additional context**

Add any other context about the problem here.
Example: GPU model


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

megatron checkpoint conversion failed #1102

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

megatron checkpoint conversion failed #1102

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions