CogVideo/finetune/README.md
2024-09-16 12:02:27 +08:00

192 lines
5.6 KiB
Markdown

# CogVideoX diffusers Fine-tuning Guide
If you want to see the SAT version fine-tuning, please check [here](../sat/README.md). The dataset format is different
from this version.
This tutorial aims to quickly fine-tune the diffusers version of the CogVideoX model.
### Hardware Requirements
+ CogVideoX-2B LORA: 1 * A100
+ CogVideoX-2B SFT: 8 * A100
+ CogVideoX-5B/5B-I2V not yet supported
### Prepare the Dataset
First, you need to prepare the dataset. The format of the dataset is as follows, where `videos.txt` contains paths to
the videos in the `videos` directory.
```
.
├── prompts.txt
├── videos
└── videos.txt
```
### Configuration Files and Execution
`accelerate` configuration files are as follows:
+ accelerate_config_machine_multi.yaml for multi-GPU use
+ accelerate_config_machine_single.yaml for single-GPU use
The `finetune` script configuration is as follows:
```shell
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# This command sets PyTorch's CUDA memory allocation strategy to segment-based memory management to prevent OOM (Out of Memory) errors.
accelerate launch --config_file accelerate_config_machine_single.yaml --multi_gpu \
# Use Accelerate to start training, specifying the `accelerate_config_machine_single.yaml` configuration file, and using multiple GPUs.
train_cogvideox_lora.py \
# This is the training script you will execute for LoRA fine-tuning of the CogVideoX model.
--pretrained_model_name_or_path THUDM/CogVideoX-2b \
# The path to the pretrained model, pointing to the CogVideoX-5b model you want to fine-tune.
--cache_dir ~/.cache \
# The directory where downloaded models and datasets will be stored.
--enable_tiling \
# Enable VAE tiling functionality, which reduces memory usage by processing smaller blocks of the image.
--enable_slicing \
# Enable VAE slicing functionality, which slices the image across channels to save memory.
--instance_data_root ~/disney/ \
# The root directory of the instance data, the folder of the dataset used during training.
--caption_column prompts.txt \
# Specifies the column or file containing instance prompts (text descriptions), in this case, the `prompts.txt` file.
--video_column videos.txt \
# Specifies the column or file containing paths to videos, in this case, the `videos.txt` file.
--validation_prompt "Mickey with the captain and friends:::Mickey and the bear" \
# The prompt(s) used for validation, multiple prompts should be separated by the specified delimiter (`:::`).
--validation_prompt_separator ::: \
# The delimiter for validation prompts, set here as `:::`.
--num_validation_videos 1 \
# The number of videos to be generated during validation, set to 1.
--validation_epochs 2 \
# How many epochs to run validation, set to validate every 2 epochs.
--seed 3407 \
# Sets the random seed for reproducible training, set to 3407.
--rank 128 \
# The dimension of the LoRA update matrices, controlling the size of the LoRA layer parameters, set to 128.
--mixed_precision bf16 \
# Use mixed precision training, set to `bf16` (bfloat16), which can reduce memory usage and speed up training.
--output_dir cogvideox-lora-single-gpu \
# Output directory, where model predictions and checkpoints will be stored.
--height 480 \
# The height of input videos, all videos will be resized to 480 pixels.
--width 720 \
# The width of input videos, all videos will be resized to 720 pixels.
--fps 8 \
# The frame rate of input videos, all videos will be processed at 8 frames per second.
--max_num_frames 49 \
# The maximum number of frames for input videos, videos will be truncated to a maximum of 49 frames.
--skip_frames_start 0 \
# The number of frames to skip at the beginning of each video, set to 0, indicating no frames are skipped.
--skip_frames_end 0 \
# The number of frames to skip at the end of each video, set to 0, indicating no frames are skipped.
--train_batch_size 1 \
# The batch size for training, set to 1 per device.
--num_train_epochs 10 \
# The total number of epochs for training, set to 10.
--checkpointing_steps 500 \
# Save a checkpoint every 500 steps.
--gradient_accumulation_steps 1 \
# The number of gradient accumulation steps, indicating that a gradient update is performed every 1 step.
--learning_rate 1e-4 \
# The initial learning rate, set to 1e-4.
--optimizer AdamW \
# The type of optimizer, choosing AdamW.
--adam_beta1 0.9 \
# The beta1 parameter for the Adam optimizer, set to 0.9.
--adam_beta2 0.95 \
# The beta2 parameter for the Adam optimizer, set to 0.95.
```
### Run the script to start fine-tuning
Single GPU fine-tuning:
```shell
bash finetune_single_gpu.sh
```
Multi-GPU fine-tuning:
```shell
bash finetune_multi_gpus_1.sh # needs to be run on each node
```
### Best Practices
+ Include 70 videos with a resolution of `200 x 480 x 720` (frames x height x width). Through data preprocessing's frame
skipping, we created two smaller datasets of 49 and 16 frames to speed up experiments, as the CogVideoX team suggests
a maximum frame count of 49. We divided the 70 videos into three groups of 10, 25, and 50 videos. These videos are
conceptually similar.
+ 25 or more videos work best when training new concepts and styles.
+ Now using an identifier token specified through `--id_token` enhances training results. This is similar to Dreambooth
training, but regular fine-tuning without this token also works.
+ The original repository uses `lora_alpha` set to 1. We found this value to be ineffective in multiple runs, likely due
to differences in model backend and training setups. Our recommendation is to set lora_alpha to the same as rank or
rank // 2.
+ Using settings with a rank of 64 or above is recommended.