CogVideoX diffusers Fine-tuning Guide

If you want to see the SAT version fine-tuning, please check here. The dataset format is different from this version.

This tutorial aims to quickly fine-tune the diffusers version of the CogVideoX model.

Hardware Requirements

  • CogVideoX-2B LORA: 1 * A100
  • CogVideoX-2B SFT: 8 * A100
  • CogVideoX-5B/5B-I2V not yet supported

Prepare the Dataset

First, you need to prepare the dataset. The format of the dataset is as follows, where videos.txt contains paths to the videos in the videos directory.

.
├── prompts.txt
├── videos
└── videos.txt

Configuration Files and Execution

accelerate configuration files are as follows:

  • accelerate_config_machine_multi.yaml for multi-GPU use
  • accelerate_config_machine_single.yaml for single-GPU use

The finetune script configuration is as follows:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# This command sets PyTorch's CUDA memory allocation strategy to segment-based memory management to prevent OOM (Out of Memory) errors.

accelerate launch --config_file accelerate_config_machine_single.yaml --multi_gpu \

# Use Accelerate to start training, specifying the `accelerate_config_machine_single.yaml` configuration file, and using multiple GPUs.

train_cogvideox_lora.py \

# This is the training script you will execute for LoRA fine-tuning of the CogVideoX model.

--pretrained_model_name_or_path THUDM/CogVideoX-2b \

# The path to the pretrained model, pointing to the CogVideoX-5b model you want to fine-tune.

--cache_dir ~/.cache \

# The directory where downloaded models and datasets will be stored.

--enable_tiling \

# Enable VAE tiling functionality, which reduces memory usage by processing smaller blocks of the image.

--enable_slicing \

# Enable VAE slicing functionality, which slices the image across channels to save memory.

--instance_data_root ~/disney/ \

# The root directory of the instance data, the folder of the dataset used during training.

--caption_column prompts.txt \

# Specifies the column or file containing instance prompts (text descriptions), in this case, the `prompts.txt` file.

--video_column videos.txt \

# Specifies the column or file containing paths to videos, in this case, the `videos.txt` file.

--validation_prompt "Mickey with the captain and friends:::Mickey and the bear" \

# The prompt(s) used for validation, multiple prompts should be separated by the specified delimiter (`:::`).

--validation_prompt_separator ::: \

# The delimiter for validation prompts, set here as `:::`.

--num_validation_videos 1 \

# The number of videos to be generated during validation, set to 1.

--validation_epochs 2 \

# How many epochs to run validation, set to validate every 2 epochs.

--seed 3407 \

# Sets the random seed for reproducible training, set to 3407.

--rank 128 \

# The dimension of the LoRA update matrices, controlling the size of the LoRA layer parameters, set to 128.

--mixed_precision bf16 \

# Use mixed precision training, set to `bf16` (bfloat16), which can reduce memory usage and speed up training.

--output_dir cogvideox-lora-single-gpu \

# Output directory, where model predictions and checkpoints will be stored.

--height 480 \

# The height of input videos, all videos will be resized to 480 pixels.

--width 720 \

# The width of input videos, all videos will be resized to 720 pixels.

--fps 8 \

# The frame rate of input videos, all videos will be processed at 8 frames per second.

--max_num_frames 49 \

# The maximum number of frames for input videos, videos will be truncated to a maximum of 49 frames.

--skip_frames_start 0 \

# The number of frames to skip at the beginning of each video, set to 0, indicating no frames are skipped.

--skip_frames_end 0 \

# The number of frames to skip at the end of each video, set to 0, indicating no frames are skipped.

--train_batch_size 1 \

# The batch size for training, set to 1 per device.

--num_train_epochs 10 \

# The total number of epochs for training, set to 10.

--checkpointing_steps 500 \

# Save a checkpoint every 500 steps.

--gradient_accumulation_steps 1 \

# The number of gradient accumulation steps, indicating that a gradient update is performed every 1 step.

--learning_rate 1e-4 \

# The initial learning rate, set to 1e-4.

--optimizer AdamW \

# The type of optimizer, choosing AdamW.

--adam_beta1 0.9 \

# The beta1 parameter for the Adam optimizer, set to 0.9.

--adam_beta2 0.95 \

# The beta2 parameter for the Adam optimizer, set to 0.95.

Run the script to start fine-tuning

Single GPU fine-tuning:

bash finetune_single_gpu.sh

Multi-GPU fine-tuning:

bash finetune_multi_gpus_1.sh # needs to be run on each node

Best Practices

  • Include 70 videos with a resolution of 200 x 480 x 720 (frames x height x width). Through data preprocessing's frame skipping, we created two smaller datasets of 49 and 16 frames to speed up experiments, as the CogVideoX team suggests a maximum frame count of 49. We divided the 70 videos into three groups of 10, 25, and 50 videos. These videos are conceptually similar.
  • 25 or more videos work best when training new concepts and styles.
  • Now using an identifier token specified through --id_token enhances training results. This is similar to Dreambooth training, but regular fine-tuning without this token also works.
  • The original repository uses lora_alpha set to 1. We found this value to be ineffective in multiple runs, likely due to differences in model backend and training setups. Our recommendation is to set lora_alpha to the same as rank or rank // 2.
  • Using settings with a rank of 64 or above is recommended.