-
Notifications
You must be signed in to change notification settings - Fork 28
Qwen 25vl training pipeline #830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
# Conflicts: # src/megatron/bridge/training/config.py
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
model Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
# Handle Hugging Face GenerationConfig / PretrainedConfig by converting to a callable dict | ||
# compatible with our YAML representer logic | ||
try: | ||
from transformers import GenerationConfig, PretrainedConfig # type: ignore | ||
|
||
if isinstance(val_to_convert, (GenerationConfig, PretrainedConfig)): | ||
cfg_class = val_to_convert.__class__ | ||
target = f"{inspect.getmodule(cfg_class).__name__}.{cfg_class.__qualname__}.from_dict" | ||
logger.debug(f"Converting {cfg_class.__qualname__} at {current_path} to callable dict") | ||
return { | ||
"_target_": target, | ||
"_call_": True, | ||
"config_dict": val_to_convert.to_dict(), | ||
} | ||
except ModuleNotFoundError: | ||
# transformers is optional; if unavailable, fall through to other handlers | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be handled here:
Megatron-Bridge/src/megatron/bridge/utils/yaml_utils.py
Lines 62 to 77 in f9aad1f
# Try to add GenerationConfig representer if available | |
try: | |
from transformers import GenerationConfig | |
yaml.SafeDumper.add_representer(GenerationConfig, _generation_config_representer) | |
except ModuleNotFoundError: | |
pass | |
# Try to add PretrainedConfig representer if available (generic for HF configs) | |
try: | |
from transformers import PretrainedConfig | |
# Use multi-representer so subclasses of PretrainedConfig are also handled | |
yaml.SafeDumper.add_multi_representer(PretrainedConfig, _pretrained_config_representer) | |
except ModuleNotFoundError: | |
pass |
were you seeing serialization issues outside of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's because we are using transformers' model config in the config, that's causing some issues here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think the yaml would work, here it's not for saving just for omega conf creation
@@ -0,0 +1,233 @@ | |||
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we offer higher-level functions like pretrain_gpt
/ finetune_gpt
/ pretrain_qwen25_vl
, finetune_qwen25_vl
that automatically select the right step for users?
or how else should we associate the custom forward step func with the recipe? otherwise it might not be obvious to users that an implementation exists
previously this was bound to the model config, but we want to keep the model providers independent of the training side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, what do you think let recipe returns cfg and the step function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning the config and the step function will be a breaking change for existing users
from megatron.bridge.training.gpt_step import ( | ||
_create_loss_function, | ||
get_packed_seq_params, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can spilt these functions out of gpt_step
to indicate they are more generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel gpt_step is the general one lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved them out to utils
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
@@ -0,0 +1,233 @@ | |||
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning the config and the step function will be a breaking change for existing users
|
||
|
||
def forward_step( | ||
state: GlobalState, data_iterator: Iterable, model: GPTModel, return_schedule_plan: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it still a GPTModel?
Signed-off-by: yaoyu-33 <[email protected]>
No description provided.