Skip to content

Conversation

yaoyu-33
Copy link
Contributor

@yaoyu-33 yaoyu-33 commented Oct 1, 2025

No description provided.

Copy link

copy-pr-bot bot commented Oct 1, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment on lines +172 to +188
# Handle Hugging Face GenerationConfig / PretrainedConfig by converting to a callable dict
# compatible with our YAML representer logic
try:
from transformers import GenerationConfig, PretrainedConfig # type: ignore

if isinstance(val_to_convert, (GenerationConfig, PretrainedConfig)):
cfg_class = val_to_convert.__class__
target = f"{inspect.getmodule(cfg_class).__name__}.{cfg_class.__qualname__}.from_dict"
logger.debug(f"Converting {cfg_class.__qualname__} at {current_path} to callable dict")
return {
"_target_": target,
"_call_": True,
"config_dict": val_to_convert.to_dict(),
}
except ModuleNotFoundError:
# transformers is optional; if unavailable, fall through to other handlers
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be handled here:

# Try to add GenerationConfig representer if available
try:
from transformers import GenerationConfig
yaml.SafeDumper.add_representer(GenerationConfig, _generation_config_representer)
except ModuleNotFoundError:
pass
# Try to add PretrainedConfig representer if available (generic for HF configs)
try:
from transformers import PretrainedConfig
# Use multi-representer so subclasses of PretrainedConfig are also handled
yaml.SafeDumper.add_multi_representer(PretrainedConfig, _pretrained_config_representer)
except ModuleNotFoundError:
pass

were you seeing serialization issues outside of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's because we are using transformers' model config in the config, that's causing some issues here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think the yaml would work, here it's not for saving just for omega conf creation

@@ -0,0 +1,233 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we offer higher-level functions like pretrain_gpt / finetune_gpt / pretrain_qwen25_vl, finetune_qwen25_vl that automatically select the right step for users?
or how else should we associate the custom forward step func with the recipe? otherwise it might not be obvious to users that an implementation exists

previously this was bound to the model config, but we want to keep the model providers independent of the training side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, what do you think let recipe returns cfg and the step function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning the config and the step function will be a breaking change for existing users

Comment on lines 25 to 28
from megatron.bridge.training.gpt_step import (
_create_loss_function,
get_packed_seq_params,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can spilt these functions out of gpt_step to indicate they are more generic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel gpt_step is the general one lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved them out to utils

@@ -0,0 +1,233 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning the config and the step function will be a breaking change for existing users



def forward_step(
state: GlobalState, data_iterator: Iterable, model: GPTModel, return_schedule_plan: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it still a GPTModel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants