Skip to content

Conversation

Edenzzzz
Copy link
Collaborator

@Edenzzzz Edenzzzz commented Jul 3, 2025

Part of #572. Please check that this doesn't cause multi-node training to hang.
cc @SolitaryThinker

@SolitaryThinker SolitaryThinker added the go Trigger Buildkite CI label Jul 3, 2025
@SolitaryThinker SolitaryThinker self-requested a review July 3, 2025 04:31
@SolitaryThinker
Copy link
Collaborator

need to rebase for the buildkite CI fix

@Edenzzzz Edenzzzz force-pushed the offload_text_enc branch from 6c5f09f to 62ec225 Compare July 3, 2025 17:12
@Edenzzzz
Copy link
Collaborator Author

Edenzzzz commented Jul 3, 2025

@SolitaryThinker Done

@SolitaryThinker
Copy link
Collaborator

thanks, testing now on multi-node

Copy link
Collaborator

@SolitaryThinker SolitaryThinker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the actual logic for offloading text encoders? I think you forgot to include that logic in this PR?


# Load the module
return loader.load(component_model_path, architecture, fastvideo_args)
return loader.load(component_model_path, architecture, fastvideo_args)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this?


# choose `assign=True` since we cannot call `copy_` on meta tensor
return model.load_state_dict(sharded_sd, strict=strict, assign=True)
return model.load_state_dict(sharded_sd, strict=strict, assign=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here


# If there were no matches, return the untouched param name
return name
return name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

@Edenzzzz
Copy link
Collaborator Author

Edenzzzz commented Jul 3, 2025

where is the actual logic for offloading text encoders? I think you forgot to include that logic in this PR?

Added

@SolitaryThinker SolitaryThinker merged commit ed1e8d6 into hao-ai-lab:main Jul 4, 2025
1 check passed
@Edenzzzz Edenzzzz deleted the offload_text_enc branch July 4, 2025 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go Trigger Buildkite CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants