-
Notifications
You must be signed in to change notification settings - Fork 163
Open
Labels
Description
Describe the bug
convert_megatron_to_hf failed in main.
Steps/Code to reproduce bug
uv run --extra mcore python examples/converters/convert_megatron_to_hf.py --config /path/to/config.yaml --megatron-ckpt-path /path/to/policy/weights/iter_0000000 --hf-ckpt-path /path/to/hf
Expected behavior
Error message expected:
[rank0]: Traceback (most recent call last):
[rank0]: File "/app/nemo-rl/examples/converters/convert_megatron_to_hf.py", line 74, in <module>
[rank0]: main()
[rank0]: File "/app/nemo-rl/examples/converters/convert_megatron_to_hf.py", line 65, in main
[rank0]: export_model_from_megatron(
[rank0]: File "/app/nemo-rl/nemo_rl/models/megatron/community_import.py", line 109, in export_model_from_megatron
[rank0]: bridge.save_hf_pretrained(megatron_model, output_path)
[rank0]: File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/conversion/auto_bridge.py", line 386, in save_hf_pretrained
[rank0]: self.save_hf_weights(model, path, show_progress)
[rank0]: File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/conversion/auto_bridge.py", line 434, in save_hf_weights
[rank0]: self.hf_pretrained.state.source.save_generator(generator, path)
[rank0]: File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/hf_pretrained/state.py", line 729, in save_generator
[rank0]: for name, tensor in generator:
[rank0]: ^^^^^^^^^
[rank0]: File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/conversion/model_bridge.py", line 573, in stream_weights_megatron_to_hf
[rank0]: conversion_tasks = self.build_conversion_tasks(hf_pretrained, megatron_model)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/app/nemo-rl/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge/src/megatron/bridge/models/conversion/model_bridge.py", line 781, in build_conversion_tasks
[rank0]: pp_rank = parallel_state.get_pipeline_model_parallel_rank()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/app/nemo-rl/3rdparty/Megatron-LM-workspace/Megatron-LM/megatron/core/parallel_state.py", line 1536, in get_pipeline_model_parallel_rank
[rank0]: rank = torch.distributed.get_rank()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/app/nemo_rl_venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2291, in get_rank
[rank0]: default_pg = _get_default_group()
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/app/nemo_rl_venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 1298, in _get_default_group
[rank0]: raise ValueError(
[rank0]: ValueError: Default process group has not been initialized, please make sure to call init_process_group.
Environment overview (please complete the following information)
- Environment location: Docker
- Method of install: Building the Release docker container (https://github.com/NVIDIA-NeMo/RL/blob/main/docker/Dockerfile)
- If method of install is [Docker], provide
docker pull&docker runcommands used
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version
Additional context
Add any other context about the problem here.
Example: GPU model
samodi-nv