diff --git a/README.md b/README.md index a27782c..4e83ec2 100644 --- a/README.md +++ b/README.md @@ -22,7 +22,10 @@ Experience the CogVideoX-5B model online at CogVideoX-2B CogVideoX-5B CogVideoX-5B-I2V + CogVideoX1.5-5B + CogVideoX1.5-5B-I2V - Model Description - Entry-level model, balancing compatibility. Low cost for running and secondary development. - Larger model with higher video generation quality and better visual effects. - CogVideoX-5B image-to-video version. + Release Date + August 6, 2024 + August 27, 2024 + September 19, 2024 + November 8, 2024 + November 8, 2024 + + + Video Resolution + 720 * 480 + 1360 * 768 + 256 <= W <=1360
256 <= H <=768
W,H % 16 == 0 Inference Precision FP16*(recommended), BF16, FP32, FP8*, INT8, not supported: INT4 - BF16 (recommended), FP16, FP32, FP8*, INT8, not supported: INT4 + BF16(recommended), FP16, FP32, FP8*, INT8, not supported: INT4 + BF16 - Single GPU Memory Usage
-
SAT FP16: 18GB
diffusers FP16: from 4GB*
diffusers INT8 (torchao): from 3.6GB* - SAT BF16: 26GB
diffusers BF16: from 5GB*
diffusers INT8 (torchao): from 4.4GB* + Single GPU Memory Usage + SAT FP16: 18GB
diffusers FP16: from 4GB*
diffusers INT8(torchao): from 3.6GB* + SAT BF16: 26GB
diffusers BF16 : from 5GB*
diffusers INT8(torchao): from 4.4GB* + SAT BF16: 66GB
- Multi-GPU Inference Memory Usage + Multi-GPU Memory Usage FP16: 10GB* using diffusers
BF16: 15GB* using diffusers
+ Not supported
Inference Speed
(Step = 50, FP/BF16) Single A100: ~90 seconds
Single H100: ~45 seconds Single A100: ~180 seconds
Single H100: ~90 seconds - - - Fine-tuning Precision - FP16 - BF16 - - - Fine-tuning Memory Usage - 47 GB (bs=1, LORA)
61 GB (bs=2, LORA)
62GB (bs=1, SFT) - 63 GB (bs=1, LORA)
80 GB (bs=2, LORA)
75GB (bs=1, SFT)
- 78 GB (bs=1, LORA)
75GB (bs=1, SFT, 16GPU)
+ Single A100: ~1000 seconds (5-second video)
Single H100: ~550 seconds (5-second video) Prompt Language - English* + English* - Maximum Prompt Length + Prompt Token Limit 226 Tokens + 224 Tokens Video Length - 6 Seconds + 6 seconds + 5 or 10 seconds Frame Rate - 8 Frames / Second + 8 frames / second + 16 frames / second - Video Resolution - 720 x 480, no support for other resolutions (including fine-tuning) - - - Position Encoding + Positional Encoding 3d_sincos_pos_embed 3d_sincos_pos_embed 3d_rope_pos_embed + learnable_pos_embed + 3d_sincos_pos_embed + 3d_rope_pos_embed + learnable_pos_embed Download Link (Diffusers) 🀗 HuggingFace
🀖 ModelScope
🟣 WiseModel 🀗 HuggingFace
🀖 ModelScope
🟣 WiseModel 🀗 HuggingFace
🀖 ModelScope
🟣 WiseModel + Coming Soon Download Link (SAT) - SAT + SAT + 🀗 HuggingFace
🀖 ModelScope
🟣 WiseModel @@ -422,7 +430,7 @@ hands-on practice on text-to-video generation. *The original input is in Chinese We welcome your contributions! You can click [here](resources/contribute.md) for more information. -## License Agreement +## Model-License The code in this repository is released under the [Apache 2.0 License](LICENSE). diff --git a/README_ja.md b/README_ja.md index 69b46b6..aa7ae37 100644 --- a/README_ja.md +++ b/README_ja.md @@ -1,6 +1,6 @@ # CogVideo & CogVideoX -[Read this in English](./README_zh.md) +[Read this in English](./README.md) [䞭文阅读](./README_zh.md) @@ -22,9 +22,14 @@ ## 曎新ずニュヌス -- 🔥🔥 **ニュヌス**: ```2024/10/13```: コスト削枛のため、単䞀の4090 GPUで`CogVideoX-5B` +- 🔥🔥 ニュヌス: ```2024/11/08```: `CogVideoX1.5` モデルをリリヌスしたした。CogVideoX1.5 は CogVideoX オヌプン゜ヌスモデルのアップグレヌドバヌゞョンです。 +CogVideoX1.5-5B シリヌズモデルは、10秒 長の動画ずより高い解像床をサポヌトしおおり、`CogVideoX1.5-5B-I2V` は任意の解像床での動画生成に察応しおいたす。 +SAT コヌドはすでに曎新されおおり、`diffusers` バヌゞョンは珟圚適応䞭です。 +SAT バヌゞョンのコヌドは [こちら](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT) からダりンロヌドできたす。 +- 🔥 **ニュヌス**: ```2024/10/13```: コスト削枛のため、単䞀の4090 GPUで`CogVideoX-5B` を埮調敎できるフレヌムワヌク [cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory) - がリリヌスされたした。耇数の解像床での埮調敎に察応しおいたす。ぜひご利甚ください- 🔥**ニュヌス**: ```2024/10/10```: + がリリヌスされたした。耇数の解像床での埮調敎に察応しおいたす。ぜひご利甚ください +- 🔥**ニュヌス**: ```2024/10/10```: 技術報告曞を曎新し、より詳现なトレヌニング情報ずデモを远加したした。 - 🔥 **ニュヌス**: ```2024/10/10```: 技術報告曞を曎新したした。[こちら](https://arxiv.org/pdf/2408.06072) をクリックしおご芧ください。さらにトレヌニングの詳现ずデモを远加したした。デモを芋るには[こちら](https://yzy-thu.github.io/CogVideoX-demo/) @@ -34,7 +39,7 @@ - 🔥**ニュヌス**: ```2024/9/19```: CogVideoXシリヌズの画像生成ビデオモデル **CogVideoX-5B-I2V** をオヌプン゜ヌス化したした。このモデルは、画像を背景入力ずしお䜿甚し、プロンプトワヌドず組み合わせおビデオを生成するこずができ、より高い制埡性を提䟛したす。これにより、CogVideoXシリヌズのモデルは、テキストからビデオ生成、ビデオの継続、画像からビデオ生成の3぀のタスクをサポヌトするようになりたした。オンラむンでの[䜓隓](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space) をお楜しみください。 -- 🔥🔥 **ニュヌス**: ```2024/9/19```: +- 🔥 **ニュヌス**: ```2024/9/19```: CogVideoXのトレヌニングプロセスでビデオデヌタをテキスト蚘述に倉換するために䜿甚されるキャプションモデル [CogVLM2-Caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption) をオヌプン゜ヌス化したした。ダりンロヌドしおご利甚ください。 - 🔥 ```2024/8/27```: CogVideoXシリヌズのより倧きなモデル **CogVideoX-5B** @@ -63,11 +68,10 @@ - [プロゞェクト構造](#プロゞェクト構造) - [掚論](#掚論) - [sat](#sat) - - [ツヌル](#ツヌル) -- [プロゞェクト蚈画](#プロゞェクト蚈画) -- [モデルラむセンス](#モデルラむセンス) + - [ツヌル](#ツヌル)= - [CogVideo(ICLR'23)モデル玹介](#CogVideoICLR23) - [匕甚](#匕甚) +- [ラむセンス契玄](#ラむセンス契玄) ## クむックスタヌト @@ -156,79 +160,91 @@ pip install -r requirements.txt CogVideoXは、[枅圱](https://chatglm.cn/video?fr=osm_cogvideox) ず同源のオヌプン゜ヌス版ビデオ生成モデルです。 以䞋の衚に、提䟛しおいるビデオ生成モデルの基本情報を瀺したす: - +
- + + + + + + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + - + - + + + - - - - - + + + + + - + +
モデル名 CogVideoX-2B CogVideoX-5BCogVideoX-5B-I2V CogVideoX-5B-I2VCogVideoX1.5-5BCogVideoX1.5-5B-I2V
リリヌス日2024幎8月6日2024幎8月27日2024幎9月19日2024幎11月8日2024幎11月8日
ビデオ解像床720 * 4801360 * 768256 <= W <=1360
256 <= H <=768
W,H % 16 == 0
掚論粟床 FP16*(掚奚), BF16, FP32, FP8*, INT8, INT4は非察応 BF16(掚奚), FP16, FP32, FP8*, INT8, INT4は非察応
単䞀GPUのメモリ消費
SAT FP16: 18GB
diffusers FP16: 4GBから*
diffusers INT8(torchao): 3.6GBから*
SAT BF16: 26GB
diffusers BF16 : 5GBから*
diffusers INT8(torchao): 4.4GBから*
マルチGPUのメモリ消費FP16: 10GB* using diffusers
BF16: 15GB* using diffusers
掚論速床
(ステップ = 50, FP/BF16)
単䞀A100: 箄90秒
単䞀H100: 箄45秒
単䞀A100: 箄180秒
単䞀H100: 箄90秒
ファむンチュヌニング粟床FP16 BF16
ファむンチュヌニング時のメモリ消費47 GB (bs=1, LORA)
61 GB (bs=2, LORA)
62GB (bs=1, SFT)
63 GB (bs=1, LORA)
80 GB (bs=2, LORA)
75GB (bs=1, SFT)
78 GB (bs=1, LORA)
75GB (bs=1, SFT, 16GPU)
シングルGPUメモリ消費SAT FP16: 18GB
diffusers FP16: 4GBから*
diffusers INT8(torchao): 3.6GBから*
SAT BF16: 26GB
diffusers BF16: 5GBから*
diffusers INT8(torchao): 4.4GBから*
SAT BF16: 66GB
マルチGPUメモリ消費FP16: 10GB* using diffusers
BF16: 15GB* using diffusers
サポヌトなし
掚論速床
(ステップ数 = 50, FP/BF16)
単䞀A100: 箄90秒
単䞀H100: 箄45秒
単䞀A100: 箄180秒
単䞀H100: 箄90秒
単䞀A100: 箄1000秒(5秒動画)
単䞀H100: 箄550秒(5秒動画)
プロンプト蚀語英語*英語*
プロンプトの最倧トヌクン数プロンプトトヌクン制限 226トヌクン224トヌクン
ビデオの長さ 6秒5秒たたは10秒
フレヌムレヌト8フレヌム/秒
ビデオ解像床720 * 480、他の解像床は非察応(ファむンチュヌニング含む)8 フレヌム / 秒16 フレヌム / 秒
䜍眮゚ンコヌディング 3d_sincos_pos_embed 3d_sincos_pos_embed 3d_rope_pos_embed + learnable_pos_embed3d_sincos_pos_embed3d_rope_pos_embed + learnable_pos_embed
ダりンロヌドリンク (Diffusers) 🀗 HuggingFace
🀖 ModelScope
🟣 WiseModel
🀗 HuggingFace
🀖 ModelScope
🟣 WiseModel
🀗 HuggingFace
🀖 ModelScope
🟣 WiseModel
近日公開
ダりンロヌドリンク (SAT)SATSAT🀗 HuggingFace
🀖 ModelScope
🟣 WiseModel
diff --git a/README_zh.md b/README_zh.md index 9f84f84..3574e7d 100644 --- a/README_zh.md +++ b/README_zh.md @@ -1,10 +1,9 @@ # CogVideo & CogVideoX -[Read this in English](./README_zh.md) +[Read this in English](./README.md) [日本語で読む](./README_ja.md) -
@@ -23,7 +22,9 @@ ## 项目曎新 -- 🔥🔥 **News**: ```2024/10/13```: 成本曎䜎单卡4090可埮调`CogVideoX-5B` +- 🔥🔥 **News**: ```2024/11/08```: 我们发垃 `CogVideoX1.5` 暡型。CogVideoX1.5 是 CogVideoX 匀源暡型的升级版本。 +CogVideoX1.5-5B 系列暡型支持 **10秒** 长床的视频和曎高的分蟚率其䞭 `CogVideoX1.5-5B-I2V` 支持 **任意分蟚率** 的视频生成SAT代码已经曎新。`diffusers`版本还圚适配䞭。SAT版本代码前埀 [这里](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT) 䞋蜜。 +- 🔥**News**: ```2024/10/13```: 成本曎䜎单卡4090可埮调 `CogVideoX-5B` 的埮调框架[cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory)已经掚出倚种分蟚率埮调欢迎䜿甚。 - 🔥 **News**: ```2024/10/10```: 我们曎新了我们的技术报告,请点击 [这里](https://arxiv.org/pdf/2408.06072) 查看附䞊了曎倚的训练细节和demo关于demo点击[这里](https://yzy-thu.github.io/CogVideoX-demo/) 查看。 @@ -58,10 +59,9 @@ - [Inference](#inference) - [SAT](#sat) - [Tools](#tools) -- [匀源项目规划](#匀源项目规划) -- [暡型协议](#暡型协议) - [CogVideo(ICLR'23)暡型介绍](#cogvideoiclr23) - [匕甚](#匕甚) +- [暡型协议](#暡型协议) ## 快速匀始 @@ -157,62 +157,72 @@ CogVideoX是 [枅圱](https://chatglm.cn/video?fr=osm_cogvideox) 同源的匀源 CogVideoX-2B CogVideoX-5B CogVideoX-5B-I2V + CogVideoX1.5-5B + CogVideoX1.5-5B-I2V + + + 发垃时闎 + 2024幎8月6日 + 2024幎8月27日 + 2024幎9月19日 + 2024幎11月8日 + 2024幎11月8日 + + + 视频分蟚率 + 720 * 480 + 1360 * 768 + 256 <= W <=1360
256 <= H <=768
W,H % 16 == 0 掚理粟床 FP16*(掚荐), BF16, FP32FP8*INT8䞍支持INT4 BF16(掚荐), FP16, FP32FP8*INT8䞍支持INT4 + BF16 单GPU星存消耗
SAT FP16: 18GB
diffusers FP16: 4GBèµ·*
diffusers INT8(torchao): 3.6Gèµ·* SAT BF16: 26GB
diffusers BF16 : 5GBèµ·*
diffusers INT8(torchao): 4.4Gèµ·* + SAT BF16: 66GB
倚GPU掚理星存消耗 FP16: 10GB* using diffusers
BF16: 15GB* using diffusers
+ Not support
掚理速床
(Step = 50, FP/BF16) 单卡A100: ~90秒
单卡H100: ~45秒 单卡A100: ~180秒
单卡H100: ~90秒 - - - 埮调粟床 - FP16 - BF16 - - - 埮调星存消耗 - 47 GB (bs=1, LORA)
61 GB (bs=2, LORA)
62GB (bs=1, SFT) - 63 GB (bs=1, LORA)
80 GB (bs=2, LORA)
75GB (bs=1, SFT)
- 78 GB (bs=1, LORA)
75GB (bs=1, SFT, 16GPU)
+ 单卡A100: ~1000秒(5秒视频)
单卡H100: ~550秒(5秒视频) 提瀺词语蚀 - English* + English* 提瀺词长床䞊限 226 Tokens + 224 Tokens 视频长床 6 秒 + 5 秒 或 10 秒 垧率 8 垧 / 秒 + 16 垧 / 秒 - 视频分蟚率 - 720 * 480䞍支持其他分蟚率(含埮调) - - 䜍眮猖码 3d_sincos_pos_embed - 3d_sincos_pos_embed + 3d_sincos_pos_embed + 3d_rope_pos_embed + learnable_pos_embed + 3d_sincos_pos_embed 3d_rope_pos_embed + learnable_pos_embed @@ -220,10 +230,13 @@ CogVideoX是 [枅圱](https://chatglm.cn/video?fr=osm_cogvideox) 同源的匀源 🀗 HuggingFace
🀖 ModelScope
🟣 WiseModel 🀗 HuggingFace
🀖 ModelScope
🟣 WiseModel 🀗 HuggingFace
🀖 ModelScope
🟣 WiseModel + 即将掚出 䞋蜜铟接 (SAT) SAT + 🀗 HuggingFace
🀖 ModelScope
🟣 WiseModel + diff --git a/sat/README.md b/sat/README.md index 48c4552..c67e15c 100644 --- a/sat/README.md +++ b/sat/README.md @@ -1,29 +1,39 @@ -# SAT CogVideoX-2B +# SAT CogVideoX -[䞭文阅读](./README_zh.md) +[Read this in English.](./README_zh.md) [日本語で読む](./README_ja.md) -This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the -fine-tuning code for SAT weights. +This folder contains inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights, along with fine-tuning code for SAT weights. -This code is the framework used by the team to train the model. It has few comments and requires careful study. +This code framework was used by our team during model training. There are few comments, so careful study is required. ## Inference Model -### 1. Ensure that you have correctly installed the dependencies required by this folder. +### 1. Make sure you have installed all dependencies in this folder -```shell +``` pip install -r requirements.txt ``` -### 2. Download the model weights +### 2. Download the Model Weights -### 2. Download model weights +First, download the model weights from the SAT mirror. -First, go to the SAT mirror to download the model weights. For the CogVideoX-2B model, please download as follows: +#### CogVideoX1.5 Model -```shell +``` +git lfs install +git clone https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT +``` + +This command downloads three models: Transformers, VAE, and T5 Encoder. + +#### CogVideoX Model + +For the CogVideoX-2B model, download as follows: + +``` mkdir CogVideoX-2b-sat cd CogVideoX-2b-sat wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1 @@ -34,13 +44,12 @@ mv 'index.html?dl=1' transformer.zip unzip transformer.zip ``` -For the CogVideoX-5B model, please download the `transformers` file as follows link: -(VAE files are the same as 2B) +Download the `transformers` file for the CogVideoX-5B model (the VAE file is the same as for 2B): + [CogVideoX-5B](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list) + [CogVideoX-5B-I2V](https://cloud.tsinghua.edu.cn/d/5cc62a2d6e7d45c0a2f6/?p=%2F1&mode=list) -Next, you need to format the model files as follows: +Arrange the model files in the following structure: ``` . @@ -52,20 +61,24 @@ Next, you need to format the model files as follows: └── 3d-vae.pt ``` -Due to large size of model weight file, using `git lfs` is recommended. Installation of `git lfs` can be -found [here](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing) +Since model weight files are large, it’s recommended to use `git lfs`. +See [here](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing) for `git lfs` installation. -Next, clone the T5 model, which is not used for training and fine-tuning, but must be used. -> T5 model is available on [Modelscope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) as well. +``` +git lfs install +``` -```shell -git clone https://huggingface.co/THUDM/CogVideoX-2b.git +Next, clone the T5 model, which is used as an encoder and doesn’t require training or fine-tuning. +> You may also use the model file location on [Modelscope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b). + +``` +git clone https://huggingface.co/THUDM/CogVideoX-2b.git # Download model from Huggingface +# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git # Download from Modelscope mkdir t5-v1_1-xxl mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl ``` -By following the above approach, you will obtain a safetensor format T5 file. Ensure that there are no errors when -loading it into Deepspeed in Finetune. +This will yield a safetensor format T5 file that can be loaded without error during Deepspeed fine-tuning. ``` ├── added_tokens.json @@ -80,11 +93,11 @@ loading it into Deepspeed in Finetune. 0 directories, 8 files ``` -### 3. Modify the file in `configs/cogvideox_2b.yaml`. +### 3. Modify `configs/cogvideox_*.yaml` file. ```yaml model: - scale_factor: 1.15258426 + scale_factor: 1.55258426 disable_first_stage_autocast: true log_keys: - txt @@ -160,14 +173,14 @@ model: ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: - model_dir: "t5-v1_1-xxl" # Absolute path to the CogVideoX-2b/t5-v1_1-xxl weights folder + model_dir: "t5-v1_1-xxl" # absolute path to CogVideoX-2b/t5-v1_1-xxl weight folder max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 - ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # Absolute path to the CogVideoX-2b-sat/vae/3d-vae.pt folder + ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # absolute path to CogVideoX-2b-sat/vae/3d-vae.pt file ignore_keys: [ 'loss' ] loss_config: @@ -239,48 +252,46 @@ model: num_steps: 50 ``` -### 4. Modify the file in `configs/inference.yaml`. +### 4. Modify `configs/inference.yaml` file. ```yaml args: latent_channels: 16 mode: inference - load: "{absolute_path/to/your}/transformer" # Absolute path to the CogVideoX-2b-sat/transformer folder + load: "{absolute_path/to/your}/transformer" # Absolute path to CogVideoX-2b-sat/transformer folder # load: "{your lora folder} such as zRzRzRzRzRzRzR/lora-disney-08-20-13-28" # This is for Full model without lora adapter batch_size: 1 - input_type: txt # You can choose txt for pure text input, or change to cli for command line input - input_file: configs/test.txt # Pure text file, which can be edited - sampling_num_frames: 13 # Must be 13, 11 or 9 + input_type: txt # You can choose "txt" for plain text input or change to "cli" for command-line input + input_file: configs/test.txt # Plain text file, can be edited + sampling_num_frames: 13 # For CogVideoX1.5-5B it must be 42 or 22. For CogVideoX-5B / 2B, it must be 13, 11, or 9. sampling_fps: 8 fp16: True # For CogVideoX-2B - # bf16: True # For CogVideoX-5B + # bf16: True # For CogVideoX-5B output_dir: outputs/ force_inference: True ``` -+ Modify `configs/test.txt` if multiple prompts is required, in which each line makes a prompt. -+ For better prompt formatting, refer to [convert_demo.py](../inference/convert_demo.py), for which you should set the - OPENAI_API_KEY as your environmental variable. -+ Modify `input_type` in `configs/inference.yaml` if use command line as prompt iuput. ++ If using a text file to save multiple prompts, modify `configs/test.txt` as needed. One prompt per line. If you are unsure how to write prompts, use [this code](../inference/convert_demo.py) to call an LLM for refinement. ++ To use command-line input, modify: -```yaml +``` input_type: cli ``` -This allows input from the command line as prompts. +This allows you to enter prompts from the command line. -Change `output_dir` if you wish to modify the address of the output video +To modify the output video location, change: -```yaml +``` output_dir: outputs/ ``` -It is saved by default in the `.outputs/` folder. +The default location is the `.outputs/` folder. -### 5. Run the inference code to perform inference. +### 5. Run the Inference Code to Perform Inference -```shell +``` bash inference.sh ``` @@ -288,95 +299,91 @@ bash inference.sh ### Preparing the Dataset -The dataset format should be as follows: +The dataset should be structured as follows: ``` . ├── labels -│   ├── 1.txt -│   ├── 2.txt -│   ├── ... +│ ├── 1.txt +│ ├── 2.txt +│ ├── ... └── videos ├── 1.mp4 ├── 2.mp4 ├── ... ``` -Each text file shares the same name as its corresponding video, serving as the label for that video. Videos and labels -should be matched one-to-one. Generally, a single video should not be associated with multiple labels. +Each txt file should have the same name as the corresponding video file and contain the label for that video. The videos and labels should correspond one-to-one. Generally, avoid using one video with multiple labels. -For style fine-tuning, please prepare at least 50 videos and labels with similar styles to ensure proper fitting. +For style fine-tuning, prepare at least 50 videos and labels with a similar style to facilitate fitting. -### Modifying Configuration Files +### Modifying the Configuration File -We support two fine-tuning methods: `Lora` and full-parameter fine-tuning. Please note that both methods only fine-tune -the `transformer` part and do not modify the `VAE` section. `T5` is used solely as an Encoder. Please modify -the `configs/sft.yaml` (for full-parameter fine-tuning) file as follows: +We support two fine-tuning methods: `Lora` and full-parameter fine-tuning. Note that both methods only fine-tune the `transformer` part. The `VAE` part is not modified, and `T5` is only used as an encoder. +Modify the files in `configs/sft.yaml` (full fine-tuning) as follows: -``` - # checkpoint_activations: True ## Using gradient checkpointing (Both checkpoint_activations in the config file need to be set to True) +```yaml + # checkpoint_activations: True ## using gradient checkpointing (both `checkpoint_activations` in the config file need to be set to True) model_parallel_size: 1 # Model parallel size - experiment_name: lora-disney # Experiment name (do not modify) - mode: finetune # Mode (do not modify) - load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer model path - no_load_rng: True # Whether to load random seed + experiment_name: lora-disney # Experiment name (do not change) + mode: finetune # Mode (do not change) + load: "{your_CogVideoX-2b-sat_path}/transformer" ## Path to Transformer model + no_load_rng: True # Whether to load random number seed train_iters: 1000 # Training iterations eval_iters: 1 # Evaluation iterations eval_interval: 100 # Evaluation interval eval_batch_size: 1 # Evaluation batch size - save: ckpts # Model save path - save_interval: 100 # Model save interval + save: ckpts # Model save path + save_interval: 100 # Save interval log_interval: 20 # Log output interval train_data: [ "your train data path" ] - valid_data: [ "your val data path" ] # Training and validation datasets can be the same - split: 1,0,0 # Training, validation, and test set ratio - num_workers: 8 # Number of worker threads for data loader - force_train: True # Allow missing keys when loading checkpoint (T5 and VAE are loaded separately) - only_log_video_latents: True # Avoid memory overhead caused by VAE decode + valid_data: [ "your val data path" ] # Training and validation sets can be the same + split: 1,0,0 # Proportion for training, validation, and test sets + num_workers: 8 # Number of data loader workers + force_train: True # Allow missing keys when loading checkpoint (T5 and VAE loaded separately) + only_log_video_latents: True # Avoid memory usage from VAE decoding deepspeed: bf16: - enabled: False # For CogVideoX-2B set to False and for CogVideoX-5B set to True + enabled: False # For CogVideoX-2B Turn to False and For CogVideoX-5B Turn to True fp16: - enabled: True # For CogVideoX-2B set to True and for CogVideoX-5B set to False + enabled: True # For CogVideoX-2B Turn to True and For CogVideoX-5B Turn to False ``` -If you wish to use Lora fine-tuning, you also need to modify the `cogvideox__lora` file: +``` To use Lora fine-tuning, you also need to modify `cogvideox__lora` file: -Here, take `CogVideoX-2B` as a reference: +Here's an example using `CogVideoX-2B`: ``` model: - scale_factor: 1.15258426 + scale_factor: 1.55258426 disable_first_stage_autocast: true - not_trainable_prefixes: [ 'all' ] ## Uncomment + not_trainable_prefixes: [ 'all' ] ## Uncomment to unlock log_keys: - - txt' + - txt - lora_config: ## Uncomment + lora_config: ## Uncomment to unlock target: sat.model.finetune.lora2.LoraMixin params: r: 256 ``` -### Modifying Run Scripts +### Modify the Run Script -Edit `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` to select the configuration file. Below are two examples: +Edit `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` and select the config file. Below are two examples: -1. If you want to use the `CogVideoX-2B` model and the `Lora` method, you need to modify `finetune_single_gpu.sh` - or `finetune_multi_gpus.sh`: +1. If you want to use the `CogVideoX-2B` model with `Lora`, modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` as follows: ``` run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM" ``` -2. If you want to use the `CogVideoX-2B` model and the `full-parameter fine-tuning` method, you need to - modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh`: +2. If you want to use the `CogVideoX-2B` model with full fine-tuning, modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` as follows: ``` run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM" ``` -### Fine-Tuning and Evaluation +### Fine-tuning and Validation Run the inference code to start fine-tuning. @@ -385,45 +392,42 @@ bash finetune_single_gpu.sh # Single GPU bash finetune_multi_gpus.sh # Multi GPUs ``` -### Using the Fine-Tuned Model +### Using the Fine-tuned Model -The fine-tuned model cannot be merged; here is how to modify the inference configuration file `inference.sh`: +The fine-tuned model cannot be merged. Here’s how to modify the inference configuration file `inference.sh` ``` -run_cmd="$environs python sample_video.py --base configs/cogvideox__lora.yaml configs/inference.yaml --seed 42" +run_cmd="$environs python sample_video.py --base configs/cogvideox__lora.yaml configs/inference.yaml --seed 42" ``` -Then, execute the code: +Then, run the code: ``` bash inference.sh ``` -### Converting to Huggingface Diffusers Supported Weights +### Converting to Huggingface Diffusers-compatible Weights -The SAT weight format is different from Huggingface's weight format and needs to be converted. Please run: +The SAT weight format is different from Huggingface’s format and requires conversion. Run -```shell +``` python ../tools/convert_weight_sat2hf.py ``` -### Exporting Huggingface Diffusers lora LoRA Weights from SAT Checkpoints +### Exporting Lora Weights from SAT to Huggingface Diffusers -After completing the training using the above steps, we get a SAT checkpoint with LoRA weights. You can find the file -at `{args.save}/1000/1000/mp_rank_00_model_states.pt`. +Support is provided for exporting Lora weights from SAT to Huggingface Diffusers format. + After training with the above steps, you’ll find the SAT model with Lora weights in {args.save}/1000/1000/mp_rank_00_model_states.pt -The script for exporting LoRA weights can be found in the CogVideoX repository at `tools/export_sat_lora_weight.py`. -After exporting, you can use `load_cogvideox_lora.py` for inference. +The export script `export_sat_lora_weight.py` is located in the CogVideoX repository under `tools/`. After exporting, use `load_cogvideox_lora.py` for inference. Export command: -```bash -python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory {args.save}/export_hf_lora_weights_1/ +``` +python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory {args.save}/export_hf_lora_weights_1/ ``` -This training mainly modified the following model structures. The table below lists the corresponding structure mappings -for converting to the HF (Hugging Face) format LoRA structure. As you can see, LoRA adds a low-rank weight to the -model's attention structure. +The following model structures were modified during training. Here is the mapping between SAT and HF Lora structures. Lora adds a low-rank weight to the attention structure of the model. ``` 'attention.query_key_value.matrix_A.0': 'attn1.to_q.lora_A.weight', @@ -436,5 +440,5 @@ model's attention structure. 'attention.dense.matrix_B.0': 'attn1.to_out.0.lora_B.weight' ``` -Using export_sat_lora_weight.py, you can convert the SAT checkpoint into the HF LoRA format. -![alt text](../resources/hf_lora_weights.png) +Using `export_sat_lora_weight.py` will convert these to the HF format Lora structure. +![alt text](../resources/hf_lora_weights.png) \ No newline at end of file diff --git a/sat/README_ja.md b/sat/README_ja.md index ee1abcd..3685ba3 100644 --- a/sat/README_ja.md +++ b/sat/README_ja.md @@ -1,27 +1,37 @@ -# SAT CogVideoX-2B +# SAT CogVideoX -[Read this in English.](./README_zh) +[Read this in English.](./README.md) [䞭文阅读](./README_zh.md) -このフォルダには、[SAT](https://github.com/THUDM/SwissArmyTransformer) りェむトを䜿甚した掚論コヌドず、SAT -りェむトのファむンチュヌニングコヌドが含たれおいたす。 - -このコヌドは、チヌムがモデルをトレヌニングするために䜿甚したフレヌムワヌクです。コメントが少なく、泚意深く研究する必芁がありたす。 +このフォルダには、[SAT](https://github.com/THUDM/SwissArmyTransformer)の重みを䜿甚した掚論コヌドず、SAT重みのファむンチュヌニングコヌドが含たれおいたす。 +このコヌドは、チヌムがモデルを蚓緎する際に䜿甚したフレヌムワヌクです。コメントが少ないため、泚意深く確認する必芁がありたす。 ## 掚論モデル -### 1. このフォルダに必芁な䟝存関係が正しくむンストヌルされおいるこずを確認しおください。 +### 1. このフォルダ内の必芁な䟝存関係がすべおむンストヌルされおいるこずを確認しおください -```shell +``` pip install -r requirements.txt ``` -### 2. モデルりェむトをダりンロヌドしたす +### 2. モデルの重みをダりンロヌド + たず、SATミラヌからモデルの重みをダりンロヌドしおください。 -たず、SAT ミラヌに移動しおモデルの重みをダりンロヌドしたす。 CogVideoX-2B モデルの堎合は、次のようにダりンロヌドしおください。 +#### CogVideoX1.5 モデル -```shell +``` +git lfs install +git clone https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT +``` + +これにより、Transformers、VAE、T5 Encoderの3぀のモデルがダりンロヌドされたす。 + +#### CogVideoX モデル + +CogVideoX-2B モデルに぀いおは、以䞋のようにダりンロヌドしおください + +``` mkdir CogVideoX-2b-sat cd CogVideoX-2b-sat wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1 @@ -32,12 +42,12 @@ mv 'index.html?dl=1' transformer.zip unzip transformer.zip ``` -CogVideoX-5B モデルの `transformers` ファむルを以䞋のリンクからダりンロヌドしおください VAE ファむルは 2B ず同じです +CogVideoX-5B モデルの `transformers` ファむルをダりンロヌドしおくださいVAEファむルは2Bず同じです + [CogVideoX-5B](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list) + [CogVideoX-5B-I2V](https://cloud.tsinghua.edu.cn/d/5cc62a2d6e7d45c0a2f6/?p=%2F1&mode=list) -次に、モデルファむルを以䞋の圢匏にフォヌマットする必芁がありたす +モデルファむルを以䞋のように配眮しおください ``` . @@ -49,24 +59,24 @@ CogVideoX-5B モデルの `transformers` ファむルを以䞋のリンクから └── 3d-vae.pt ``` -モデルの重みファむルが倧きいため、`git lfs`を䜿甚するこずをお勧めいたしたす。`git lfs` -のむンストヌルに぀いおは、[こちら](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing)をご参照ください。 +モデルの重みファむルが倧きいため、`git lfs`の䜿甚をお勧めしたす。 +`git lfs`のむンストヌル方法は[こちら](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing)を参照しおください。 -```shell +``` git lfs install ``` -次に、T5 モデルをクロヌンしたす。これはトレヌニングやファむンチュヌニングには䜿甚されたせんが、䜿甚する必芁がありたす。 -> モデルを耇補する際には、[Modelscope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b)のモデルファむルの堎所もご䜿甚いただけたす。 +次に、T5モデルをクロヌンしたす。このモデルはEncoderずしおのみ䜿甚され、蚓緎やファむンチュヌニングは必芁ありたせん。 +> [Modelscope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b)䞊のモデルファむルも䜿甚可胜です。 -```shell -git clone https://huggingface.co/THUDM/CogVideoX-2b.git #ハギングフェむス(huggingface.org)からモデルをダりンロヌドいただきたす -# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git #Modelscopeからモデルをダりンロヌドいただきたす +``` +git clone https://huggingface.co/THUDM/CogVideoX-2b.git # Huggingfaceからモデルをダりンロヌド +# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git # Modelscopeからダりンロヌド mkdir t5-v1_1-xxl mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl ``` -䞊蚘の方法に埓うこずで、safetensor 圢匏の T5 ファむルを取埗できたす。これにより、Deepspeed でのファむンチュヌニング䞭に゚ラヌが発生しないようにしたす。 +これにより、Deepspeedファむンチュヌニング䞭に゚ラヌなくロヌドできるsafetensor圢匏のT5ファむルが䜜成されたす。 ``` ├── added_tokens.json @@ -81,11 +91,11 @@ mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl 0 directories, 8 files ``` -### 3. `configs/cogvideox_2b.yaml` ファむルを倉曎したす。 +### 3. `configs/cogvideox_*.yaml`ファむルを線集 ```yaml model: - scale_factor: 1.15258426 + scale_factor: 1.55258426 disable_first_stage_autocast: true log_keys: - txt @@ -123,7 +133,7 @@ model: num_attention_heads: 30 transformer_args: - checkpoint_activations: True ## グラデヌション チェックポむントを䜿甚する + checkpoint_activations: True ## using gradient checkpointing vocab_size: 1 max_sequence_length: 64 layernorm_order: pre @@ -161,14 +171,14 @@ model: ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: - model_dir: "t5-v1_1-xxl" # CogVideoX-2b/t5-v1_1-xxlフォルダの絶察パス + model_dir: "t5-v1_1-xxl" # CogVideoX-2b/t5-v1_1-xxl 重みフォルダの絶察パス max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 - ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # CogVideoX-2b-sat/vae/3d-vae.ptフォルダの絶察パス + ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # CogVideoX-2b-sat/vae/3d-vae.ptファむルの絶察パス ignore_keys: [ 'loss' ] loss_config: @@ -240,7 +250,7 @@ model: num_steps: 50 ``` -### 4. `configs/inference.yaml` ファむルを倉曎したす。 +### 4. `configs/inference.yaml`ファむルを線集 ```yaml args: @@ -250,38 +260,36 @@ args: # load: "{your lora folder} such as zRzRzRzRzRzRzR/lora-disney-08-20-13-28" # This is for Full model without lora adapter batch_size: 1 - input_type: txt #TXTのテキストファむルを入力ずしお遞択されたり、CLIコマンドラむンを入力ずしお倉曎されたりいただけたす - input_file: configs/test.txt #テキストファむルのパスで、これに察しお線集がさせおいただけたす - sampling_num_frames: 13 # Must be 13, 11 or 9 + input_type: txt # "txt"でプレヌンテキスト入力、"cli"でコマンドラむン入力を遞択可胜 + input_file: configs/test.txt # プレヌンテキストファむル、線集可胜 + sampling_num_frames: 13 # CogVideoX1.5-5Bでは42たたは22、CogVideoX-5B / 2Bでは13, 11, たたは9 sampling_fps: 8 - fp16: True # For CogVideoX-2B - # bf16: True # For CogVideoX-5B + fp16: True # CogVideoX-2B甹 + # bf16: True # CogVideoX-5B甹 output_dir: outputs/ force_inference: True ``` -+ 耇数のプロンプトを保存するために txt を䜿甚する堎合は、`configs/test.txt` - を参照しお倉曎しおください。1行に1぀のプロンプトを蚘述したす。プロンプトの曞き方がわからない堎合は、最初に [このコヌド](../inference/convert_demo.py) - を䜿甚しお LLM によるリファむンメントを呌び出すこずができたす。 -+ コマンドラむンを入力ずしお䜿甚する堎合は、次のように倉曎したす。 ++ 耇数のプロンプトを含むテキストファむルを䜿甚する堎合、`configs/test.txt`を適宜線集しおください。1行に぀き1プロンプトです。プロンプトの曞き方が分からない堎合は、[こちらのコヌド](../inference/convert_demo.py)を䜿甚しおLLMで補正できたす。 ++ コマンドラむン入力を䜿甚する堎合、以䞋のように倉曎したす -```yaml +``` input_type: cli ``` これにより、コマンドラむンからプロンプトを入力できたす。 -出力ビデオのディレクトリを倉曎したい堎合は、次のように倉曎できたす +出力ビデオの保存堎所を倉曎する堎合は、以䞋を線集しおください -```yaml +``` output_dir: outputs/ ``` -デフォルトでは `.outputs/` フォルダに保存されたす。 +デフォルトでは`.outputs/`フォルダに保存されたす。 -### 5. 掚論コヌドを実行しお掚論を開始したす。 +### 5. 掚論コヌドを実行しお掚論を開始 -```shell +``` bash inference.sh ``` @@ -289,7 +297,7 @@ bash inference.sh ### デヌタセットの準備 -デヌタセットの圢匏は次のようになりたす +デヌタセットは以䞋の構造である必芁がありたす ``` . @@ -303,123 +311,215 @@ bash inference.sh ├── ... ``` -各 txt ファむルは察応するビデオファむルず同じ名前であり、そのビデオのラベルを含んでいたす。各ビデオはラベルず䞀察䞀で察応する必芁がありたす。通垞、1぀のビデオに耇数のラベルを持たせるこずはありたせん。 +各txtファむルは察応するビデオファむルず同じ名前で、ビデオのラベルを含んでいたす。ビデオずラベルは䞀察䞀で察応させる必芁がありたす。通垞、1぀のビデオに耇数のラベルを䜿甚するこずは避けおください。 -スタむルファむンチュヌニングの堎合、少なくずも50本のスタむルが䌌たビデオずラベルを準備し、フィッティングを容易にしたす。 +スタむルのファむンチュヌニングの堎合、スタむルが䌌たビデオずラベルを少なくずも50本準備し、フィッティングを促進したす。 -### 蚭定ファむルの倉曎 +### 蚭定ファむルの線集 -`Lora` ずフルパラメヌタ埮調敎の2぀の方法をサポヌトしおいたす。䞡方の埮調敎方法は、`transformer` 郚分のみを埮調敎し、`VAE` -郚分には倉曎を加えないこずに泚意しおください。`T5` ぱンコヌダヌずしおのみ䜿甚されたす。以䞋のように `configs/sft.yaml` ( -フルパラメヌタ埮調敎甚) ファむルを倉曎しおください。 +``` `Lora`ず党パラメヌタのファむンチュヌニングの2皮類をサポヌトしおいたす。どちらも`transformer`郚分のみをファむンチュヌニングし、`VAE`郚分は倉曎されず、`T5`ぱンコヌダヌずしおのみ䜿甚されたす。 +``` 以䞋のようにしお`configs/sft.yaml`党量ファむンチュヌニングファむルを線集しおください ``` - # checkpoint_activations: True ## 募配チェックポむントを䜿甚する堎合 (蚭定ファむル内の2぀の checkpoint_activations を True に蚭定する必芁がありたす) + # checkpoint_activations: True ## using gradient checkpointing (configファむル内の2぀の`checkpoint_activations`ã‚’äž¡æ–¹Trueに蚭定) model_parallel_size: 1 # モデル䞊列サむズ - experiment_name: lora-disney # 実隓名 (倉曎しないでください) - mode: finetune # モヌド (倉曎しないでください) - load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer モデルのパス - no_load_rng: True # 乱数シヌドを読み蟌むかどうか + experiment_name: lora-disney # 実隓名倉曎䞍芁 + mode: finetune # モヌド倉曎䞍芁 + load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformerモデルのパス + no_load_rng: True # 乱数シヌドをロヌドするかどうか train_iters: 1000 # トレヌニングむテレヌション数 - eval_iters: 1 # 評䟡むテレヌション数 - eval_interval: 100 # 評䟡間隔 - eval_batch_size: 1 # 評䟡バッチサむズ - save: ckpts # モデル保存パス - save_interval: 100 # モデル保存間隔 + eval_iters: 1 # 怜蚌むテレヌション数 + eval_interval: 100 # 怜蚌間隔 + eval_batch_size: 1 # 怜蚌バッチサむズ + save: ckpts # モデル保存パス + save_interval: 100 # 保存間隔 log_interval: 20 # ログ出力間隔 train_data: [ "your train data path" ] - valid_data: [ "your val data path" ] # トレヌニングデヌタず評䟡デヌタは同じでも構いたせん - split: 1,0,0 # トレヌニングセット、評䟡セット、テストセットの割合 - num_workers: 8 # デヌタロヌダヌのワヌカヌスレッド数 - force_train: True # チェックポむントをロヌドするずきに欠萜したキヌを蚱可 (T5 ず VAE は別々にロヌドされたす) - only_log_video_latents: True # VAE のデコヌドによるメモリオヌバヌヘッドを回避 + valid_data: [ "your val data path" ] # トレヌニングセットず怜蚌セットは同じでも構いたせん + split: 1,0,0 # トレヌニングセット、怜蚌セット、テストセットの割合 + num_workers: 8 # デヌタロヌダヌのワヌカヌ数 + force_train: True # チェックポむントをロヌドする際に`missing keys`を蚱可T5ずVAEは別途ロヌド + only_log_video_latents: True # VAEのデコヌドによるメモリ䜿甚量を抑える deepspeed: bf16: - enabled: False # CogVideoX-2B の堎合は False に蚭定し、CogVideoX-5B の堎合は True に蚭定 + enabled: False # CogVideoX-2B 甚は False、CogVideoX-5B 甚は True に蚭定 fp16: - enabled: True # CogVideoX-2B の堎合は True に蚭定し、CogVideoX-5B の堎合は False に蚭定 + enabled: True # CogVideoX-2B 甚は True、CogVideoX-5B 甚は False に蚭定 +``` +```yaml +args: + latent_channels: 16 + mode: inference + load: "{absolute_path/to/your}/transformer" # Absolute path to CogVideoX-2b-sat/transformer folder + # load: "{your lora folder} such as zRzRzRzRzRzRzR/lora-disney-08-20-13-28" # This is for Full model without lora adapter + + batch_size: 1 + input_type: txt # You can choose "txt" for plain text input or change to "cli" for command-line input + input_file: configs/test.txt # Plain text file, can be edited + sampling_num_frames: 13 # For CogVideoX1.5-5B it must be 42 or 22. For CogVideoX-5B / 2B, it must be 13, 11, or 9. + sampling_fps: 8 + fp16: True # For CogVideoX-2B + # bf16: True # For CogVideoX-5B + output_dir: outputs/ + force_inference: True ``` -Lora 埮調敎を䜿甚したい堎合は、`cogvideox__lora` ファむルも倉曎する必芁がありたす。 - -ここでは、`CogVideoX-2B` を参考にしたす。 ++ If using a text file to save multiple prompts, modify `configs/test.txt` as needed. One prompt per line. If you are unsure how to write prompts, use [this code](../inference/convert_demo.py) to call an LLM for refinement. ++ To use command-line input, modify: ``` +input_type: cli +``` + +This allows you to enter prompts from the command line. + +To modify the output video location, change: + +``` +output_dir: outputs/ +``` + +The default location is the `.outputs/` folder. + +### 5. Run the Inference Code to Perform Inference + +``` +bash inference.sh +``` + +## Fine-tuning the Model + +### Preparing the Dataset + +The dataset should be structured as follows: + +``` +. +├── labels +│ ├── 1.txt +│ ├── 2.txt +│ ├── ... +└── videos + ├── 1.mp4 + ├── 2.mp4 + ├── ... +``` + +Each txt file should have the same name as the corresponding video file and contain the label for that video. The videos and labels should correspond one-to-one. Generally, avoid using one video with multiple labels. + +For style fine-tuning, prepare at least 50 videos and labels with a similar style to facilitate fitting. + +### Modifying the Configuration File + +We support two fine-tuning methods: `Lora` and full-parameter fine-tuning. Note that both methods only fine-tune the `transformer` part. The `VAE` part is not modified, and `T5` is only used as an encoder. +Modify the files in `configs/sft.yaml` (full fine-tuning) as follows: + +```yaml + # checkpoint_activations: True ## using gradient checkpointing (both `checkpoint_activations` in the config file need to be set to True) + model_parallel_size: 1 # Model parallel size + experiment_name: lora-disney # Experiment name (do not change) + mode: finetune # Mode (do not change) + load: "{your_CogVideoX-2b-sat_path}/transformer" ## Path to Transformer model + no_load_rng: True # Whether to load random number seed + train_iters: 1000 # Training iterations + eval_iters: 1 # Evaluation iterations + eval_interval: 100 # Evaluation interval + eval_batch_size: 1 # Evaluation batch size + save: ckpts # Model save path + save_interval: 100 # Save interval + log_interval: 20 # Log output interval + train_data: [ "your train data path" ] + valid_data: [ "your val data path" ] # Training and validation sets can be the same + split: 1,0,0 # Proportion for training, validation, and test sets + num_workers: 8 # Number of data loader workers + force_train: True # Allow missing keys when loading checkpoint (T5 and VAE loaded separately) + only_log_video_latents: True # Avoid memory usage from VAE decoding + deepspeed: + bf16: + enabled: False # For CogVideoX-2B Turn to False and For CogVideoX-5B Turn to True + fp16: + enabled: True # For CogVideoX-2B Turn to True and For CogVideoX-5B Turn to False +``` + +``` To use Lora fine-tuning, you also need to modify `cogvideox__lora` file: + +Here's an example using `CogVideoX-2B`: + +```yaml model: - scale_factor: 1.15258426 + scale_factor: 1.55258426 disable_first_stage_autocast: true - not_trainable_prefixes: [ 'all' ] ## コメントを解陀 + not_trainable_prefixes: [ 'all' ] ## Uncomment to unlock log_keys: - - txt' + - txt - lora_config: ## コメントを解陀 + lora_config: ## Uncomment to unlock target: sat.model.finetune.lora2.LoraMixin params: r: 256 ``` -### 実行スクリプトの倉曎 +### Modify the Run Script -蚭定ファむルを遞択するために `finetune_single_gpu.sh` たたは `finetune_multi_gpus.sh` を線集したす。以䞋に2぀の䟋を瀺したす。 +Edit `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` and select the config file. Below are two examples: -1. `CogVideoX-2B` モデルを䜿甚し、`Lora` 手法を利甚する堎合は、`finetune_single_gpu.sh` たたは `finetune_multi_gpus.sh` - を倉曎する必芁がありたす。 +1. If you want to use the `CogVideoX-2B` model with `Lora`, modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` as follows: ``` run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM" ``` -2. `CogVideoX-2B` モデルを䜿甚し、`フルパラメヌタ埮調敎` 手法を利甚する堎合は、`finetune_single_gpu.sh` - たたは `finetune_multi_gpus.sh` を倉曎する必芁がありたす。 +2. If you want to use the `CogVideoX-2B` model with full fine-tuning, modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` as follows: ``` run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM" ``` -### 埮調敎ず評䟡 +### Fine-tuning and Validation -掚論コヌドを実行しお埮調敎を開始したす。 +Run the inference code to start fine-tuning. ``` -bash finetune_single_gpu.sh # シングルGPU -bash finetune_multi_gpus.sh # マルチGPU +bash finetune_single_gpu.sh # Single GPU +bash finetune_multi_gpus.sh # Multi GPUs ``` -### 埮調敎埌のモデルの䜿甚 +### Using the Fine-tuned Model -埮調敎されたモデルは統合できたせん。ここでは、掚論蚭定ファむル `inference.sh` を倉曎する方法を瀺したす。 +The fine-tuned model cannot be merged. Here’s how to modify the inference configuration file `inference.sh` ``` -run_cmd="$environs python sample_video.py --base configs/cogvideox__lora.yaml configs/inference.yaml --seed 42" +run_cmd="$environs python sample_video.py --base configs/cogvideox__lora.yaml configs/inference.yaml --seed 42" ``` -その埌、次のコヌドを実行したす。 +Then, run the code: ``` bash inference.sh ``` -### Huggingface Diffusers サポヌトのりェむトに倉換 +### Converting to Huggingface Diffusers-compatible Weights -SAT りェむト圢匏は Huggingface のりェむト圢匏ず異なり、倉換が必芁です。次のコマンドを実行しおください +The SAT weight format is different from Huggingface’s format and requires conversion. Run -```shell +``` python ../tools/convert_weight_sat2hf.py ``` -### SATチェックポむントからHuggingface Diffusers lora LoRAりェむトを゚クスポヌト +### Exporting Lora Weights from SAT to Huggingface Diffusers -䞊蚘のステップを完了するず、LoRAりェむト付きのSATチェックポむントが埗られたす。ファむルは `{args.save}/1000/1000/mp_rank_00_model_states.pt` にありたす。 +Support is provided for exporting Lora weights from SAT to Huggingface Diffusers format. +After training with the above steps, you’ll find the SAT model with Lora weights in {args.save}/1000/1000/mp_rank_00_model_states.pt -LoRAりェむトを゚クスポヌトするためのスクリプトは、CogVideoXリポゞトリの `tools/export_sat_lora_weight.py` にありたす。゚クスポヌト埌、`load_cogvideox_lora.py` を䜿甚しお掚論を行うこずができたす。 +The export script `export_sat_lora_weight.py` is located in the CogVideoX repository under `tools/`. After exporting, use `load_cogvideox_lora.py` for inference. -゚クスポヌトコマンド: +Export command: -```bash -python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory {args.save}/export_hf_lora_weights_1/ +``` +python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory {args.save}/export_hf_lora_weights_1/ ``` -このトレヌニングでは䞻に以䞋のモデル構造が倉曎されたした。以䞋の衚は、HF (Hugging Face) 圢匏のLoRA構造に倉換する際の察応関係を瀺しおいたす。ご芧の通り、LoRAはモデルの泚意メカニズムに䜎ランクの重みを远加しおいたす。 +The following model structures were modified during training. Here is the mapping between SAT and HF Lora structures. Lora adds a low-rank weight to the attention structure of the model. ``` 'attention.query_key_value.matrix_A.0': 'attn1.to_q.lora_A.weight', @@ -431,8 +531,6 @@ python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_nam 'attention.dense.matrix_A.0': 'attn1.to_out.0.lora_A.weight', 'attention.dense.matrix_B.0': 'attn1.to_out.0.lora_B.weight' ``` - -export_sat_lora_weight.py を䜿甚しお、SATチェックポむントをHF LoRA圢匏に倉換できたす。 - -![alt text](../resources/hf_lora_weights.png) +Using `export_sat_lora_weight.py` will convert these to the HF format Lora structure. +![alt text](../resources/hf_lora_weights.png) \ No newline at end of file diff --git a/sat/README_zh.md b/sat/README_zh.md index c605da8..c25c6b7 100644 --- a/sat/README_zh.md +++ b/sat/README_zh.md @@ -1,6 +1,6 @@ -# SAT CogVideoX-2B +# SAT CogVideoX -[Read this in English.](./README_zh) +[Read this in English.](./README.md) [日本語で読む](./README_ja.md) @@ -20,6 +20,15 @@ pip install -r requirements.txt 銖先前埀 SAT 镜像䞋蜜暡型权重。 +#### CogVideoX1.5 æš¡åž‹ + +```shell +git lfs install +git clone https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT +``` +歀操䜜䌚䞋蜜 Transformers, VAE, T5 Encoder 这䞉䞪暡型。 + +#### CogVideoX æš¡åž‹ 对于 CogVideoX-2B 暡型请按照劂䞋方匏䞋蜜: ```shell @@ -82,11 +91,11 @@ mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl 0 directories, 8 files ``` -### 3. 修改`configs/cogvideox_2b.yaml`䞭的文件。 +### 3. 修改`configs/cogvideox_*.yaml`䞭的文件。 ```yaml model: - scale_factor: 1.15258426 + scale_factor: 1.55258426 disable_first_stage_autocast: true log_keys: - txt @@ -253,7 +262,7 @@ args: batch_size: 1 input_type: txt #可以选择txt纯文字档䜜䞺蟓入或者改成cli呜什行䜜䞺蟓入 input_file: configs/test.txt #纯文字档可以对歀做猖蟑 - sampling_num_frames: 13 # Must be 13, 11 or 9 + sampling_num_frames: 13 #CogVideoX1.5-5B 必须是 42 或 22。 CogVideoX-5B / 2B 必须是 13 11 或 9。 sampling_fps: 8 fp16: True # For CogVideoX-2B # bf16: True # For CogVideoX-5B @@ -346,7 +355,7 @@ Encoder 䜿甚。 ```yaml model: - scale_factor: 1.15258426 + scale_factor: 1.55258426 disable_first_stage_autocast: true not_trainable_prefixes: [ 'all' ] ## 解陀泚释 log_keys: diff --git a/sat/configs/cogvideox1.5_5b.yaml b/sat/configs/cogvideox1.5_5b.yaml new file mode 100644 index 0000000..0000ec2 --- /dev/null +++ b/sat/configs/cogvideox1.5_5b.yaml @@ -0,0 +1,149 @@ +model: + scale_factor: 0.7 + disable_first_stage_autocast: true + latent_input: true + log_keys: + - txt + + denoiser_config: + target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser + params: + num_idx: 1000 + quantize_c_noise: False + + weighting_config: + target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting + scaling_config: + target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling + discretization_config: + target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization + + network_config: + target: dit_video_concat.DiffusionTransformer + params: + time_embed_dim: 512 + elementwise_affine: True + num_frames: 81 + time_compressed_rate: 4 + latent_width: 300 + latent_height: 300 + num_layers: 42 + patch_size: [2, 2, 2] + in_channels: 16 + out_channels: 16 + hidden_size: 3072 + adm_in_channels: 256 + num_attention_heads: 48 + + transformer_args: + checkpoint_activations: True + vocab_size: 1 + max_sequence_length: 64 + layernorm_order: pre + skip_init: false + model_parallel_size: 1 + is_decoder: false + + modules: + pos_embed_config: + target: dit_video_concat.Rotary3DPositionEmbeddingMixin + params: + hidden_size_head: 64 + text_length: 224 + + patch_embed_config: + target: dit_video_concat.ImagePatchEmbeddingMixin + params: + text_hidden_size: 4096 + + adaln_layer_config: + target: dit_video_concat.AdaLNMixin + params: + qk_ln: True + + final_layer_config: + target: dit_video_concat.FinalLayerMixin + + conditioner_config: + target: sgm.modules.GeneralConditioner + params: + emb_models: + - is_trainable: false + input_key: txt + ucg_rate: 0.1 + target: sgm.modules.encoders.modules.FrozenT5Embedder + params: + model_dir: "google/t5-v1_1-xxl" + max_length: 224 + + + first_stage_config: + target : vae_modules.autoencoder.VideoAutoencoderInferenceWrapper + params: + cp_size: 1 + ckpt_path: "cogvideox-5b-sat/vae/3d-vae.pt" + ignore_keys: ['loss'] + + loss_config: + target: torch.nn.Identity + + regularizer_config: + target: vae_modules.regularizers.DiagonalGaussianRegularizer + + encoder_config: + target: vae_modules.cp_enc_dec.ContextParallelEncoder3D + params: + double_z: true + z_channels: 16 + resolution: 256 + in_channels: 3 + out_ch: 3 + ch: 128 + ch_mult: [1, 2, 2, 4] + attn_resolutions: [] + num_res_blocks: 3 + dropout: 0.0 + gather_norm: True + + decoder_config: + target: vae_modules.cp_enc_dec.ContextParallelDecoder3D + params: + double_z: True + z_channels: 16 + resolution: 256 + in_channels: 3 + out_ch: 3 + ch: 128 + ch_mult: [1, 2, 2, 4] + attn_resolutions: [] + num_res_blocks: 3 + dropout: 0.0 + gather_norm: True + + loss_fn_config: + target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss + params: + offset_noise_level: 0 + sigma_sampler_config: + target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling + params: + uniform_sampling: True + group_num: 40 + num_idx: 1000 + discretization_config: + target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization + + sampler_config: + target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler + params: + num_steps: 50 + verbose: True + + discretization_config: + target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization + guider_config: + target: sgm.modules.diffusionmodules.guiders.DynamicCFG + params: + scale: 6 + exp: 5 + num_steps: 50 diff --git a/sat/configs/cogvideox1.5_5b_i2v.yaml b/sat/configs/cogvideox1.5_5b_i2v.yaml new file mode 100644 index 0000000..c65f0b7 --- /dev/null +++ b/sat/configs/cogvideox1.5_5b_i2v.yaml @@ -0,0 +1,160 @@ +model: + scale_factor: 0.7 + disable_first_stage_autocast: true + latent_input: false + noised_image_input: true + noised_image_all_concat: false + noised_image_dropout: 0.05 + augmentation_dropout: 0.15 + log_keys: + - txt + + denoiser_config: + target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser + params: + num_idx: 1000 + quantize_c_noise: False + + weighting_config: + target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting + scaling_config: + target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling + discretization_config: + target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization + + network_config: + target: dit_video_concat.DiffusionTransformer + params: +# space_interpolation: 1.875 + ofs_embed_dim: 512 + time_embed_dim: 512 + elementwise_affine: True + num_frames: 81 + time_compressed_rate: 4 + latent_width: 300 + latent_height: 300 + num_layers: 42 + patch_size: [2, 2, 2] + in_channels: 32 + out_channels: 16 + hidden_size: 3072 + adm_in_channels: 256 + num_attention_heads: 48 + + transformer_args: + checkpoint_activations: True + vocab_size: 1 + max_sequence_length: 64 + layernorm_order: pre + skip_init: false + model_parallel_size: 1 + is_decoder: false + + modules: + pos_embed_config: + target: dit_video_concat.Rotary3DPositionEmbeddingMixin + params: + hidden_size_head: 64 + text_length: 224 + + patch_embed_config: + target: dit_video_concat.ImagePatchEmbeddingMixin + params: + text_hidden_size: 4096 + + + adaln_layer_config: + target: dit_video_concat.AdaLNMixin + params: + qk_ln: True + + final_layer_config: + target: dit_video_concat.FinalLayerMixin + + conditioner_config: + target: sgm.modules.GeneralConditioner + params: + emb_models: + + - is_trainable: false + input_key: txt + ucg_rate: 0.1 + target: sgm.modules.encoders.modules.FrozenT5Embedder + params: + model_dir: "google/t5-v1_1-xxl" + max_length: 224 + + + first_stage_config: + target : vae_modules.autoencoder.VideoAutoencoderInferenceWrapper + params: + cp_size: 1 + ckpt_path: "cogvideox-5b-i2v-sat/vae/3d-vae.pt" + ignore_keys: ['loss'] + + loss_config: + target: torch.nn.Identity + + regularizer_config: + target: vae_modules.regularizers.DiagonalGaussianRegularizer + + encoder_config: + target: vae_modules.cp_enc_dec.ContextParallelEncoder3D + params: + double_z: true + z_channels: 16 + resolution: 256 + in_channels: 3 + out_ch: 3 + ch: 128 + ch_mult: [1, 2, 2, 4] + attn_resolutions: [] + num_res_blocks: 3 + dropout: 0.0 + gather_norm: True + + decoder_config: + target: vae_modules.cp_enc_dec.ContextParallelDecoder3D + params: + double_z: True + z_channels: 16 + resolution: 256 + in_channels: 3 + out_ch: 3 + ch: 128 + ch_mult: [1, 2, 2, 4] + attn_resolutions: [] + num_res_blocks: 3 + dropout: 0.0 + gather_norm: True + + loss_fn_config: + target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss + params: + fixed_frames: 0 + offset_noise_level: 0.0 + sigma_sampler_config: + target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling + params: + uniform_sampling: True + group_num: 40 + num_idx: 1000 + discretization_config: + target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization + + sampler_config: + target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler + params: + fixed_frames: 0 + num_steps: 50 + verbose: True + + discretization_config: + target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization + + guider_config: + target: sgm.modules.diffusionmodules.guiders.DynamicCFG + params: + scale: 6 + exp: 5 + num_steps: 50 \ No newline at end of file diff --git a/sat/configs/test.txt b/sat/configs/test.txt index 8d035c0..94ad730 100644 --- a/sat/configs/test.txt +++ b/sat/configs/test.txt @@ -1,4 +1,4 @@ In the haunting backdrop of a warIn the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict. The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from its tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds. A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting. -A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict. \ No newline at end of file +A street artist, clad in a worn-out denim jacket and a colorful banana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict. \ No newline at end of file