[Training] [0/n] Add preprocessing pipeline #442

JerryZhou54 · 2025-05-28T05:44:12Z

Save to parquets periodically
Allow resuming from the middle

JerryZhou54 · 2025-05-28T05:46:44Z

Need to merge #438 first, because this PR requires v1/datasets

Edenzzzz · 2025-05-29T00:40:22Z

fastvideo/data_preprocess/preprocess.py

+                                  local_dir=os.path.join(
+                                      'data', BASE_MODEL_PATH))


I'm not sure why we should use local_dir--this can make cache sharing more complicated. I've PRed to remove this

it may have been me that added the hardcoded path? @JerryZhou54 perhaps lets just use model_path arg here and have the registry detect the correct pipeline config by directly using the HF string instead of a local path?

SolitaryThinker · 2025-05-29T17:18:22Z

scripts/preprocess/preprocess_wan_data.sh

+# export WANDB_MODE="offline"
+GPU_NUM=1 # 2,4,8
+MODEL_PATH="Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
+TEXT_ENCODER_PATH="/Wan-AI/Wan2.1-T2V-1.3B-Diffusers/tokenizer"


instead of hardcoding this TEXT_ENCODER_PATH, we can simply do:

path = maybe_download_model(args.model_path) encoder_path = os.join(path, 'tokenizer')

SolitaryThinker · 2025-05-29T17:18:49Z

fastvideo/data_preprocess/preprocess.py

+
+logger = init_logger(__name__)
+
+BASE_MODEL_PATH = "/workspace/data/Wan-AI/Wan2.1-T2V-1.3B-Diffusers"


let's just use args.model_path here instead?

SolitaryThinker · 2025-05-29T17:19:29Z

feel free to merge after addressing my comments

Co-authored-by: Yongqi Chen <[email protected]>

Edenzzzz · 2025-05-31T01:15:34Z

fastvideo/data_preprocess/preprocess.py

+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    # dataset & dataloader
+    parser.add_argument("--model_path", type=str, default="data/mochi")


it's better to not use data except only for testers running on Runpod, just use hf's default cache path

JerryZhou54 force-pushed the wei/preprocess branch 2 times, most recently from 93fd4ac to 5c3ec38 Compare May 28, 2025 06:00

Edenzzzz reviewed May 29, 2025

View reviewed changes

JerryZhou54 had a problem deploying to runpod-runners May 29, 2025 06:02 — with GitHub Actions Failure

SolitaryThinker approved these changes May 29, 2025

View reviewed changes

JerryZhou54 had a problem deploying to runpod-runners May 29, 2025 20:50 — with GitHub Actions Error

JerryZhou54 and others added 6 commits May 29, 2025 20:52

Add preprocessing

1c52ebe

Co-authored-by: Yongqi Chen <[email protected]>

update preprocess

29d3140

pass lint test

1ea1c82

Passed pre-commit test

83b7093

Remove hard coded model path

a29c848

remove hard coded tokenizer path

31c34d0

JerryZhou54 force-pushed the wei/preprocess branch from 3e9c40a to 31c34d0 Compare May 29, 2025 21:28

JerryZhou54 temporarily deployed to runpod-runners May 29, 2025 21:29 — with GitHub Actions Inactive

JerryZhou54 merged commit a004408 into main May 29, 2025
6 checks passed

SolitaryThinker deleted the wei/preprocess branch May 29, 2025 21:31

Edenzzzz reviewed May 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Training] [0/n] Add preprocessing pipeline #442

[Training] [0/n] Add preprocessing pipeline #442

Uh oh!

JerryZhou54 commented May 28, 2025 •

edited

Loading

Uh oh!

JerryZhou54 commented May 28, 2025 •

edited

Loading

Uh oh!

Edenzzzz May 29, 2025

Uh oh!

SolitaryThinker May 29, 2025

Uh oh!

SolitaryThinker May 29, 2025

Uh oh!

SolitaryThinker May 29, 2025

Uh oh!

SolitaryThinker commented May 29, 2025

Uh oh!

Uh oh!

Edenzzzz May 31, 2025

Uh oh!

Uh oh!


		logger = init_logger(__name__)

		BASE_MODEL_PATH = "/workspace/data/Wan-AI/Wan2.1-T2V-1.3B-Diffusers"

[Training] [0/n] Add preprocessing pipeline #442

[Training] [0/n] Add preprocessing pipeline #442

Uh oh!

Conversation

JerryZhou54 commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JerryZhou54 commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edenzzzz May 29, 2025

Choose a reason for hiding this comment

Uh oh!

SolitaryThinker May 29, 2025

Choose a reason for hiding this comment

Uh oh!

SolitaryThinker May 29, 2025

Choose a reason for hiding this comment

Uh oh!

SolitaryThinker May 29, 2025

Choose a reason for hiding this comment

Uh oh!

SolitaryThinker commented May 29, 2025

Uh oh!

Uh oh!

Edenzzzz May 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JerryZhou54 commented May 28, 2025 •

edited

Loading

JerryZhou54 commented May 28, 2025 •

edited

Loading