# CogVideoX diffusers Fine-tuning Guide If you want to see the SAT version fine-tuning, please check [here](../sat/README.md). The dataset format is different from this version. This tutorial aims to quickly fine-tune the diffusers version of the CogVideoX model. ### Hardware Requirements + CogVideoX-2B LORA: 1 * A100 + CogVideoX-2B SFT: 8 * A100 + CogVideoX-5B/5B-I2V not yet supported ### Prepare the Dataset First, you need to prepare the dataset. The format of the dataset is as follows, where `videos.txt` contains paths to the videos in the `videos` directory. ``` . ├── prompts.txt ├── videos └── videos.txt ``` ### Configuration Files and Execution `accelerate` configuration files are as follows: + accelerate_config_machine_multi.yaml for multi-GPU use + accelerate_config_machine_single.yaml for single-GPU use The `finetune` script configuration is as follows: ```shell export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True # This command sets PyTorch's CUDA memory allocation strategy to segment-based memory management to prevent OOM (Out of Memory) errors. accelerate launch --config_file accelerate_config_machine_single.yaml --multi_gpu \ # Use Accelerate to start training, specifying the `accelerate_config_machine_single.yaml` configuration file, and using multiple GPUs. train_cogvideox_lora.py \ # This is the training script you will execute for LoRA fine-tuning of the CogVideoX model. --pretrained_model_name_or_path THUDM/CogVideoX-2b \ # The path to the pretrained model, pointing to the CogVideoX-5b model you want to fine-tune. --cache_dir ~/.cache \ # The directory where downloaded models and datasets will be stored. --enable_tiling \ # Enable VAE tiling functionality, which reduces memory usage by processing smaller blocks of the image. --enable_slicing \ # Enable VAE slicing functionality, which slices the image across channels to save memory. --instance_data_root ~/disney/ \ # The root directory of the instance data, the folder of the dataset used during training. --caption_column prompts.txt \ # Specifies the column or file containing instance prompts (text descriptions), in this case, the `prompts.txt` file. --video_column videos.txt \ # Specifies the column or file containing paths to videos, in this case, the `videos.txt` file. --validation_prompt "Mickey with the captain and friends:::Mickey and the bear" \ # The prompt(s) used for validation, multiple prompts should be separated by the specified delimiter (`:::`). --validation_prompt_separator ::: \ # The delimiter for validation prompts, set here as `:::`. --num_validation_videos 1 \ # The number of videos to be generated during validation, set to 1. --validation_epochs 2 \ # How many epochs to run validation, set to validate every 2 epochs. --seed 3407 \ # Sets the random seed for reproducible training, set to 3407. --rank 128 \ # The dimension of the LoRA update matrices, controlling the size of the LoRA layer parameters, set to 128. --mixed_precision bf16 \ # Use mixed precision training, set to `bf16` (bfloat16), which can reduce memory usage and speed up training. --output_dir cogvideox-lora-single-gpu \ # Output directory, where model predictions and checkpoints will be stored. --height 480 \ # The height of input videos, all videos will be resized to 480 pixels. --width 720 \ # The width of input videos, all videos will be resized to 720 pixels. --fps 8 \ # The frame rate of input videos, all videos will be processed at 8 frames per second. --max_num_frames 49 \ # The maximum number of frames for input videos, videos will be truncated to a maximum of 49 frames. --skip_frames_start 0 \ # The number of frames to skip at the beginning of each video, set to 0, indicating no frames are skipped. --skip_frames_end 0 \ # The number of frames to skip at the end of each video, set to 0, indicating no frames are skipped. --train_batch_size 1 \ # The batch size for training, set to 1 per device. --num_train_epochs 10 \ # The total number of epochs for training, set to 10. --checkpointing_steps 500 \ # Save a checkpoint every 500 steps. --gradient_accumulation_steps 1 \ # The number of gradient accumulation steps, indicating that a gradient update is performed every 1 step. --learning_rate 1e-4 \ # The initial learning rate, set to 1e-4. --optimizer AdamW \ # The type of optimizer, choosing AdamW. --adam_beta1 0.9 \ # The beta1 parameter for the Adam optimizer, set to 0.9. --adam_beta2 0.95 \ # The beta2 parameter for the Adam optimizer, set to 0.95. ``` ### Run the script to start fine-tuning Single GPU fine-tuning: ```shell bash finetune_single_gpu.sh ``` Multi-GPU fine-tuning: ```shell bash finetune_multi_gpus_1.sh # needs to be run on each node ``` ### Best Practices + Include 70 videos with a resolution of `200 x 480 x 720` (frames x height x width). Through data preprocessing's frame skipping, we created two smaller datasets of 49 and 16 frames to speed up experiments, as the CogVideoX team suggests a maximum frame count of 49. We divided the 70 videos into three groups of 10, 25, and 50 videos. These videos are conceptually similar. + 25 or more videos work best when training new concepts and styles. + Now using an identifier token specified through `--id_token` enhances training results. This is similar to Dreambooth training, but regular fine-tuning without this token also works. + The original repository uses `lora_alpha` set to 1. We found this value to be ineffective in multiple runs, likely due to differences in model backend and training setups. Our recommendation is to set lora_alpha to the same as rank or rank // 2. + Using settings with a rank of 64 or above is recommended.