82 Commits

Author SHA1 Message Date
OleehyO
0e78f20629 Merge remote-tracking branch 'upstream/CogVideoX_dev' into dev 2025-01-14 04:00:11 +00:00
Yuxuan Zhang
4615479b51 move to tools 2025-01-14 11:33:02 +08:00
Yuxuan Zhang
7993670957 zero_to_bf16 2025-01-14 11:31:25 +08:00
OleehyO
4878edd0cf fix: correct do_validation argument parsing 2025-01-13 12:48:21 +00:00
Yuxuan Zhang
78275b0480 add comment of bash scripts 2025-01-13 20:02:06 +08:00
OleehyO
455b44a7b5 chore: code cleanup and parameter optimization
- Remove redundant comments and debug information
- Adjust default parameters in training scripts
- Clean up code in lora_trainer and trainer implementations
2025-01-13 11:56:28 +00:00
zR
1534bf33eb add pipeline 2025-01-12 19:27:21 +08:00
OleehyO
70c899f444 chore: update default training configurations 2025-01-12 08:50:15 +00:00
OleehyO
b362663679 fix: normalize image tensors in I2VDataset 2025-01-12 06:01:48 +00:00
OleehyO
30ba1085ff Merge remote-tracking branch 'upstream/main' into dev 2025-01-12 05:58:07 +00:00
OleehyO
f5169385bd docs: add SFT support documentation in multilingual README 2025-01-12 05:53:13 +00:00
OleehyO
795dd144a4 Rename lora training scripts as ddp 2025-01-12 05:36:32 +00:00
OleehyO
fdb9820949 feat: support DeepSpeed ZeRO-3 and optimize peak memory usage
- Add DeepSpeed ZeRO-3 configuration support
- Optimize memory usage during training
- Rename training scripts to reflect ZeRO usage
- Update related configuration files and trainers
2025-01-12 05:33:56 +00:00
Zheng Guang Cong
09a49d3546
fix bug of i2v; video is already 0-255
video is already 0-255 and should not be multiplied 255 any more
2025-01-11 17:29:27 +08:00
Zheng Guang Cong
cd861bbe1e
Update i2v_dataset.py
image should also be transformed to [-1, 1]
2025-01-11 17:24:35 +08:00
Zheng Guang Cong
35383e2db3
fix potential bug of i2v
Image value is in [0, 255] and should be transformed into [-1, 1], similar to video.
2025-01-11 17:08:25 +08:00
OleehyO
caa24bdc36 feat: add SFT support with ZeRO optimization strategies
- Add SFT (Supervised Fine-Tuning) trainers for all model variants:
  - CogVideoX I2V and T2V
  - CogVideoX-1.5 I2V and T2V
- Add DeepSpeed ZeRO configuration files:
  - ZeRO-2 with and without CPU offload
  - ZeRO-3 with and without CPU offload
- Add base accelerate config for distributed training
- Update trainer.py to support SFT training mode

This enables full-parameter fine-tuning with memory-efficient distributed training using DeepSpeed ZeRO optimization.
2025-01-11 02:13:32 +00:00
OleehyO
e213b6c083 fix: pad latent frames to match patch_size_t requirements 2025-01-11 02:08:07 +00:00
OleehyO
f6d722cec7 fix: remove copying first video frame as conditioning image 2025-01-09 15:52:51 +00:00
OleehyO
07766001f6 feat(dataset): pad short videos by repeating last frame
When loading videos with fewer frames than max_num_frames, repeat the last
frame to reach the required length instead of failing. This ensures consistent
tensor dimensions across the dataset while preserving as much original video
content as possible.
2025-01-08 02:14:56 +00:00
OleehyO
249fadfb76 docs: add hardware requirements for model training
Add a table in README files showing hardware requirements for training
different CogVideoX models, including:
- Memory requirements for each model variant
- Supported training types (LoRA)
- Training resolutions
- Mixed precision settings

Updated in all language versions (EN/ZH/JA).
2025-01-08 01:39:37 +00:00
OleehyO
10de04fc08 perf: cast VAE and text encoder to target dtype before precomputing cache
Before precomputing the latent cache and text embeddings, cast the VAE and
text encoder to the target training dtype (fp16/bf16) instead of keeping them
in fp32. This reduces memory usage during the precomputation phase.

The change occurs in prepare_dataset() where the models are moved to device
and cast to weight_dtype before being used to generate the cache.
2025-01-08 01:38:13 +00:00
OleehyO
0e21d41b12 Merge remote-tracking branch 'upstream/CogVideoX_dev' into dev 2025-01-07 09:51:48 +00:00
OleehyO
392e37021a Add video path to error message for better debugging 2025-01-07 09:50:21 +00:00
zR
11935892ae remove --image_column 2025-01-07 16:37:11 +08:00
OleehyO
ee1f666206 docs: update READMEs with auto first-frame extraction feature 2025-01-07 06:45:10 +00:00
OleehyO
e084a4a270 feat: auto-extract first frames as conditioning images for i2v model
When training i2v models without specifying image_column, automatically extract
and use first frames from training videos as conditioning images. This includes:

- Add load_images_from_videos() utility function to extract and cache first frames
- Update BaseI2VDataset to support auto-extraction when image_column is None
- Add validation and warning message in Args schema for i2v without image_column

The first frames are extracted once and cached to avoid repeated video loading.
2025-01-07 06:43:26 +00:00
OleehyO
96e511b413 feat: add warning for fp16 mixed precision training 2025-01-07 06:00:38 +00:00
OleehyO
36427274d6 style: format import statements across finetune module 2025-01-07 05:54:52 +00:00
zR
1789f07256 format and check fp16 for cogvideox2b 2025-01-07 13:16:18 +08:00
OleehyO
9157e0cbc8 Adapt dataset for text embeddings and add noise padding
- Add text embedding support in dataset collation
- Pad 2 random noise frames at the beginning of latent space during training
2025-01-06 10:44:58 +00:00
OleehyO
49dc370de6 fix: remove pipeline hooks after validation
- Add pipe.remove_all_hooks() after validation to prevent memory leaks
- Clean up validation pipeline properly to avoid potential issues in subsequent training steps
2025-01-04 06:21:17 +00:00
OleehyO
93b906b3fb docs: clarify train_frames includes padding frame
Add docstring to train_frames field in State schema to explicitly indicate
that it includes one image padding frame
2025-01-04 06:20:25 +00:00
OleehyO
7e1ac76847 feat(cogvideox): add prompt embedding caching support
This change enables caching of prompt embeddings in the CogVideoX text-to-video
LoRA trainer, which can improve training efficiency by avoiding redundant text
encoding operations.
2025-01-04 06:17:56 +00:00
OleehyO
66e4ba2592 fix(cogvideox): add prompt embedding caching and fix frame padding
- Add support for cached prompt embeddings in dataset
- Fix bug where first frame wasn't properly padded in latent space
2025-01-04 06:16:42 +00:00
OleehyO
de5bef6611 feat(args): add train_resolution validation for video frames and resolution
- Add validation to ensure (frames - 1) is multiple of 8
- Add specific resolution check (480x720) for cogvideox-5b models
- Add error handling for invalid resolution format
2025-01-04 06:16:42 +00:00
OleehyO
ffb6ee36b4 docs: update finetune documentation in all languages 2025-01-04 06:16:42 +00:00
OleehyO
c817e7f062 chore: update default training parameters for t2v and i2v scripts 2025-01-04 06:16:42 +00:00
OleehyO
e5b8f9a2ee feat: add caching for prompt embeddings
- Add caching for prompt embeddings
- Store cached files using safetensors format
- Add cache directory structure under data_root/cache
- Optimize memory usage by moving tensors to CPU after caching
- Add debug logging for cache hits
- Add info logging for cache writes

The caching system helps reduce redundant computation and memory usage during training by:
1. Caching prompt embeddings based on prompt text hash
2. Caching encoded video latents based on video filename
3. Moving tensors to CPU after caching to free GPU memory
2025-01-04 06:16:31 +00:00
OleehyO
f731c35f70 Add unload_model function 2025-01-03 08:21:27 +00:00
OleehyO
a88c1ede69 feat(args): add validation for training resolution
- Add validation check to ensure number of frames is multiple of 8
- Add format validation for train_resolution string (frames x height x width)
2025-01-02 03:12:09 +00:00
OleehyO
362b7bf273 docs: update README in multiple languages 2025-01-02 03:07:34 +00:00
OleehyO
7fa1bb48be refactor: remove deprecated training scripts 2025-01-01 15:56:14 +00:00
OleehyO
48ad178818 Reorganize training script arguments 2025-01-01 15:52:39 +00:00
OleehyO
6e79472417 feat: add training launch scripts for I2V and T2V models
Add two shell scripts to simplify model training:
- accelerate_train_i2v.sh: Launch script for Image-to-Video training
- accelerate_train_t2v.sh: Launch script for Text-to-Video training

Both scripts provide comprehensive configurations for:
- Model settings
- Data pipeline
- Training parameters
- System resources
- Checkpointing
- Validation
2025-01-01 15:10:55 +00:00
OleehyO
26b87cd4ff feat(args): add validation and arg interface for training parameters
- Add field validators for model type and validation settings
- Implement command line argument parsing with argparse
- Add type hints and documentation for training parameters
- Support configuration of model, training, and validation parameters
2025-01-01 15:10:55 +00:00
OleehyO
04a60e7435 Change logger name to trainer 2025-01-01 15:10:55 +00:00
OleehyO
a001842834 feat: implement CogVideoX trainers for I2V and T2V tasks
Add and refactor trainers for CogVideoX model variants:
- Implement CogVideoXT2VLoraTrainer for text-to-video generation
- Refactor CogVideoXI2VLoraTrainer for image-to-video generation

Both trainers support LoRA fine-tuning with proper handling of:
- Model components loading and initialization
- Video encoding and batch collation
- Loss computation with noise prediction
- Validation step for generation
2025-01-01 15:10:54 +00:00
OleehyO
91d79fd9a4 feat: add schemas module for configuration and state management
Add Pydantic models to handle:
- CLI arguments and configuration (Args)
- Model components and pipeline (Components)
- Training state and parameters (State)
2025-01-01 15:10:54 +00:00
OleehyO
45d40450a1 refactor: simplify dataset implementation and add latent precomputation
- Replace bucket-based dataset with simpler resize-based implementation
- Add video latent precomputation during dataset initialization
- Improve code readability and user experience
- Remove complexity of bucket sampling for better maintainability

This change makes the codebase more straightforward and easier to use while
maintaining functionality through resize-based video processing.
2025-01-01 15:10:54 +00:00