Skip to content

megatron checkpoint conversion failed #1102

@xxman-google

Description

@xxman-google

Describe the bug

convert_megatron_to_hf failed in main.

Steps/Code to reproduce bug

uv run --extra mcore python examples/converters/convert_megatron_to_hf.py     --config /path/to/config.yaml     --megatron-ckpt-path /path/to/policy/weights/iter_0000000     --hf-ckpt-path /path/to/hf

Expected behavior

Error message expected:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/app/nemo-rl/examples/converters/convert_megatron_to_hf.py", line 74, in <module>
[rank0]:     main()
[rank0]:   File "/app/nemo-rl/examples/converters/convert_megatron_to_hf.py", line 65, in main
[rank0]:     export_model_from_megatron(
[rank0]:   File "/app/nemo-rl/nemo_rl/models/megatron/community_import.py", line 109, in export_model_from_megatron
[rank0]:     bridge.save_hf_pretrained(megatron_model, output_path)
[rank0]:   File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/conversion/auto_bridge.py", line 386, in save_hf_pretrained
[rank0]:     self.save_hf_weights(model, path, show_progress)
[rank0]:   File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/conversion/auto_bridge.py", line 434, in save_hf_weights
[rank0]:     self.hf_pretrained.state.source.save_generator(generator, path)
[rank0]:   File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/hf_pretrained/state.py", line 729, in save_generator
[rank0]:     for name, tensor in generator:
[rank0]:                         ^^^^^^^^^
[rank0]:   File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/conversion/model_bridge.py", line 573, in stream_weights_megatron_to_hf
[rank0]:     conversion_tasks = self.build_conversion_tasks(hf_pretrained, megatron_model)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/conversion/model_bridge.py", line 781, in build_conversion_tasks
[rank0]:     pp_rank = parallel_state.get_pipeline_model_parallel_rank()
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/app/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/parallel_state.py", line 1536, in get_pipeline_model_parallel_rank
[rank0]:     rank = torch.distributed.get_rank()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/app/nemo_rl_venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2291, in get_rank
[rank0]:     default_pg = _get_default_group()
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/app/nemo_rl_venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 1298, in _get_default_group
[rank0]:     raise ValueError(
[rank0]: ValueError: Default process group has not been initialized, please make sure to call init_process_group.

Environment overview (please complete the following information)

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version

Additional context

Add any other context about the problem here.
Example: GPU model

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions