From 6b2287b454348ada9df9b0256b8e764ef1899645 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Wed, 7 Aug 2024 11:49:15 +0800
Subject: [PATCH 01/11] Table of Contents

---
 README.md    | 107 ++++++++++++++++++++++++++++++++++-----------------
 README_zh.md |  58 ++++++++++++++++++----------
 2 files changed, 109 insertions(+), 56 deletions(-)

diff --git a/README.md b/README.md
index 4b7f3af..b397e58 100644
--- a/README.md
+++ b/README.md
@@ -24,16 +24,38 @@
   the video almost losslessly.
 - 🔥 **News**: ``2024/8/6``: We have open-sourced **CogVideoX-2B**，the first model in the CogVideoX series of video
   generation models.
-- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch)，the **first** open-sourced pretrained text-to-video model, and you can check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details.
+- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch)，the **first**
+  open-sourced pretrained text-to-video model, and you can
+  check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details.
 
 **More powerful models with larger parameter sizes are on the way~ Stay tuned!**
 
+## Table of Contents
+
+Jump to a specific section:
+
+- [Quick Start](#Quick-Start)
+    - [SAT](#sat)
+    - [Diffusers](#Diffusers)
+- [CogVideoX-2B Video Works](#cogvideox-2b-gallery)
+- [Introduction to the CogVideoX Model](#Model-Introduction)
+- [Full Project Structure](#project-structure)
+    - [Inference](#inference)
+    - [SAT](#sat)
+    - [Tools](#tools)
+- [Introduction to CogVideo(ICLR'23) Model](#cogvideoiclr23)
+- [Citations](#Citation)
+- [Open Source Project Plan](#Open-Source-Project-Plan)
+- [Model License](#Model-License)
+
 ## Quick Start
 
 ### SAT
 
-Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.
-		(18 GB for inference, 40GB for lora finetune)
+Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is
+recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform
+rapid stacking and development.
+(18 GB for inference, 40GB for lora finetune)
 
 ### Diffusers
 
@@ -41,8 +63,9 @@ Follow instructions in [sat_demo](sat/README.md): Contains the inference code an
 pip install -r requirements.txt
 ```
 
-Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters.
-		(36GB for inference, smaller memory and fine-tuned code are under development)
+Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the
+significance of common parameters.
+(36GB for inference, smaller memory and fine-tuned code are under development)
 
 ## CogVideoX-2B Gallery
 
@@ -95,16 +118,23 @@ of the **CogVideoX** open-source model.
 
 ### Inference
 
-+ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters.
-+ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of memory, but it will be optimized in the future.
-+ [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as GPT, Gemini, etc.
-+ [gradio_demo](gradio_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate videos.
++ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the
+  significance of common parameters.
++ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of
+  memory, but it will be optimized in the future.
++ [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because
+  CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training
+  distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as
+  GPT, Gemini, etc.
++ [gradio_demo](gradio_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate
+  videos.
 
 <div style="text-align: center;">
     <img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
 </div>
 
-+ [web_demo](inference/web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model to generate videos.
++ [web_demo](inference/web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model
+  to generate videos.
 
 <div style="text-align: center;">
     <img src="resources/web_demo.png" style="width: 100%; height: auto;" />
@@ -112,40 +142,25 @@ of the **CogVideoX** open-source model.
 
 ### sat
 
-+ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.
++ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to
+  improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking
+  and development.
 
 ### Tools
 
 This folder contains some tools for model conversion / caption generation, etc.
 
-+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): Convert SAT model weights to Huggingface model weights. 
++ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): Convert SAT model weights to Huggingface model weights.
 + [caption_demo](tools/caption): Caption tool, a model that understands videos and outputs them in text.
 
-## Project Plan
-
-- [x] Open source CogVideoX model
-    - [x] Open source 3D Causal VAE used in CogVideoX.
-    - [x] CogVideoX model inference example (CLI / Web Demo)
-    - [x] CogVideoX online experience demo (Huggingface Space)
-    - [x] CogVideoX open source model API interface example (Huggingface)
-    - [x] CogVideoX model fine-tuning example (SAT)
-    - [ ] CogVideoX model fine-tuning example (Huggingface / SAT)
-    - [ ] Open source CogVideoX-Pro (adapted for CogVideoX-2B suite)
-    - [x] Release CogVideoX technical report
-
-We welcome your contributions. You can click [here](resources/contribute.md) for more information.
-
-## Model License
-
-The code in this repository is released under the [Apache 2.0 License](LICENSE).
-
-The model weights and implementation code are released under the [CogVideoX LICENSE](MODEL_LICENSE).
-
 ## CogVideo(ICLR'23)
-The official repo for the paper: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) is on the [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)
+
+The official repo for the
+paper: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868)
+is on the [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)
 
 **CogVideo is able to generate relatively high-frame-rate videos.**
-A 4-second clip of 32 frames is shown below. 
+A 4-second clip of 32 frames is shown below.
 
 ![High-frame-rate sample](https://raw.githubusercontent.com/THUDM/CogVideo/CogVideo/assets/appendix-sample-highframerate.png)
 
@@ -155,8 +170,8 @@ A 4-second clip of 32 frames is shown below.
 </div>
 
 
-The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/), where you can get hands-on practice on text-to-video generation. *The original input is in Chinese.*
-
+The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/), where you can get
+hands-on practice on text-to-video generation. *The original input is in Chinese.*
 
 ## Citation
 
@@ -175,3 +190,23 @@ The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.a
   year={2022}
 }
 ```
+
+## Open Source Project Plan
+
+- [x] Open source CogVideoX model
+    - [x] Open source 3D Causal VAE used in CogVideoX.
+    - [x] CogVideoX model inference example (CLI / Web Demo)
+    - [x] CogVideoX online experience demo (Huggingface Space)
+    - [x] CogVideoX open source model API interface example (Huggingface)
+    - [x] CogVideoX model fine-tuning example (SAT)
+    - [ ] CogVideoX model fine-tuning example (Huggingface / SAT)
+    - [ ] Open source CogVideoX-Pro (adapted for CogVideoX-2B suite)
+    - [x] Release CogVideoX technical report
+
+We welcome your contributions. You can click [here](resources/contribute.md) for more information.
+
+## Model License
+
+The code in this repository is released under the [Apache 2.0 License](LICENSE).
+
+The model weights and implementation code are released under the [CogVideoX LICENSE](MODEL_LICENSE).
diff --git a/README_zh.md b/README_zh.md
index 70419c9..a0b3c0b 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -26,6 +26,23 @@
 - 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型（现在你可以在 `CogVideo` 分支中看到），这是首个开源的基于 Transformer 的大型文本生成视频模型，您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。
 **性能更强，参数量更大的模型正在到来的路上～，欢迎关注**
 
+## 目录
+
+跳转到指定部分：
+
+- [快速开始](#快速开始)
+  - [SAT](#sat) 
+  - [Diffusers](#Diffusers)
+- [CogVideoX-2B 视频作品](#cogvideox-2b-视频作品)
+- [CogVideoX模型介绍](#模型介绍)
+- [完整项目代码结构](#完整项目代码结构)
+  - [Inference](#inference)
+  - [SAT](#sat)
+  - [Tools](#tools)
+- [开源项目规划](#开源项目规划)
+- [模型协议](#模型协议)
+- [CogVideo(ICLR'23)模型介绍](#cogvideoiclr23)
+- [引用](#引用)
 
 ## 快速开始
 
@@ -84,7 +101,7 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
 | 下载地址 (Diffusers 模型) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B)  [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b)      |
 | 下载地址 (SAT 模型)       | [SAT](./sat/README_zh.md)                                                                                                            |
 
-## 项目结构
+## 完整项目代码结构
 
 本开源仓库将带领开发者快速上手 **CogVideoX** 开源模型的基础调用方式、微调示例。
 
@@ -117,24 +134,6 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
 + [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): 将 SAT 模型权重转换为 Huggingface 模型权重。
 + [caption_demo](tools/caption/README_zh.md):  Caption 工具，对视频理解并用文字输出的模型。
 
-## 项目规划
-
-- [x] CogVideoX 模型开源
-    - [x] CogVideoX 模型推理示例 (CLI / Web Demo)
-    - [x] CogVideoX 在线体验示例 (Huggingface Space)
-    - [x] CogVideoX 开源模型API接口示例 (Huggingface)
-    - [x] CogVideoX 模型微调示例 (SAT)
-    - [ ] CogVideoX 模型微调示例 (Huggingface / SAT)
-    - [ ] CogVideoX-Pro 开源(适配 CogVideoX-2B 套件)
-    - [ ] CogVideoX 技术报告公开
-
-我们欢迎您的贡献，您可以点击[这里](resources/contribute_zh.md)查看更多信息。
-
-## 模型协议
-
-本仓库代码使用 [Apache 2.0 协议](LICENSE) 发布。
-
-本模型权重和模型实现代码根据 [CogVideoX LICENSE](MODEL_LICENSE) 许可证发布。
 
 ## CogVideo(ICLR'23) 
  [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) 的官方repo位于[CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)。
@@ -168,4 +167,23 @@ CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.amine
   journal={arXiv preprint arXiv:2205.15868},
   year={2022}
 }
-```
\ No newline at end of file
+```
+
+## 开源项目规划
+
+- [x] CogVideoX 模型开源
+    - [x] CogVideoX 模型推理示例 (CLI / Web Demo)
+    - [x] CogVideoX 在线体验示例 (Huggingface Space)
+    - [x] CogVideoX 开源模型API接口示例 (Huggingface)
+    - [x] CogVideoX 模型微调示例 (SAT)
+    - [ ] CogVideoX 模型微调示例 (Huggingface / SAT)
+    - [ ] CogVideoX-Pro 开源(适配 CogVideoX-2B 套件)
+    - [X] CogVideoX 技术报告公开
+
+我们欢迎您的贡献，您可以点击[这里](resources/contribute_zh.md)查看更多信息。
+
+## 模型协议
+
+本仓库代码使用 [Apache 2.0 协议](LICENSE) 发布。
+
+本模型权重和模型实现代码根据 [CogVideoX LICENSE](MODEL_LICENSE) 许可证发布。

From 4c2a1ff22debc8565fad4b0a0b812fba72557af0 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Wed, 7 Aug 2024 13:30:00 +0800
Subject: [PATCH 02/11] =?UTF-8?q?=E4=BE=9D=E8=B5=96=E8=A1=A5=E5=85=85?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 requirements.txt | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/requirements.txt b/requirements.txt
index bc64475..e195b53 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,8 +1,9 @@
 git+https://github.com/huggingface/diffusers.git@d1c575ad7ee0390c2735f50cc59a79aae666567a#egg=diffusers
-SwissArmyTransformer
+SwissArmyTransformer==0.4.11 # Inference
 torch==2.4.0
 torchvision==0.19.0
-streamlit==1.37.0
+gradio==4.40.0 # For HF gradio demo
+streamlit==1.37.0 # For web demo
 opencv-python==4.10
 imageio-ffmpeg==0.5.1
 openai==1.38.0

From f0b5f35934cae882450031411f727efe5af9dbd0 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Wed, 7 Aug 2024 16:19:19 +0800
Subject: [PATCH 03/11] =?UTF-8?q?=E6=8F=90=E7=A4=BA=E8=AF=8D=E6=9B=B4?=
 =?UTF-8?q?=E6=96=B0?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 README.md        |  6 ++++++
 README_zh.md     |  6 ++++++
 requirements.txt |  2 +-
 sat/README.md    | 19 +++++++++++++------
 sat/README_zh.md | 12 +++++++-----
 5 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/README.md b/README.md
index b397e58..c292386 100644
--- a/README.md
+++ b/README.md
@@ -50,6 +50,12 @@ Jump to a specific section:
 
 ## Quick Start
 
+### Prompt Optimization
+
+Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use the GLM-4 model to
+optimize the prompt. This is crucial because the model is trained with long prompts, and a good prompt directly affects
+the quality of the generated video.
+
 ### SAT
 
 Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is
diff --git a/README_zh.md b/README_zh.md
index a0b3c0b..cae5b40 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -46,6 +46,11 @@
 
 ## 快速开始
 
+### 提示词优化
+
+在开始运行模型之前，请参考[这里](inference/convert_demo.py) 查看我们是怎么使用GLM-4大模型对模型进行优化的，这很重要，
+由于模型是在长提示词下训练的，一额好的直接影响了视频生成的质量。
+
 ### SAT
 
 查看sat文件夹下的[sat_demo](sat/README.md)：包含了 SAT 权重的推理代码和微调代码，推荐基于此代码进行 CogVideoX 模型结构的改进，研究者使用该代码可以更好的进行快速的迭代和开发。
@@ -59,6 +64,7 @@ pip install -r requirements.txt
 
 查看[diffusers_demo](inference/cli_demo.py)：包含对推理代码更详细的解释，包括各种关键的参数。（36GB 推理，显存优化以及微调代码正在开发）
 
+
 ## CogVideoX-2B 视频作品
 
 <div align="center">
diff --git a/requirements.txt b/requirements.txt
index e195b53..55b376e 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,4 +1,4 @@
-git+https://github.com/huggingface/diffusers.git@d1c575ad7ee0390c2735f50cc59a79aae666567a#egg=diffusers
+diffusers>=0.3.0
 SwissArmyTransformer==0.4.11 # Inference
 torch==2.4.0
 torchvision==0.19.0
diff --git a/sat/README.md b/sat/README.md
index a2e69d6..be3fb0f 100644
--- a/sat/README.md
+++ b/sat/README.md
@@ -1,6 +1,7 @@
 # SAT CogVideoX-2B
 
-This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the fine-tuning code for SAT weights.
+This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the
+fine-tuning code for SAT weights.
 
 This code is the framework used by the team to train the model. It has few comments and requires careful study.
 
@@ -41,12 +42,14 @@ Then unzip, the model structure should look like this:
 
 Next, clone the T5 model, which is not used for training and fine-tuning, but must be used.
 
-```shell
-git lfs install 
-git clone https://huggingface.co/google/t5-v1_1-xxl.git
+```
+git clone https://huggingface.co/THUDM/CogVideoX-2b.git
+mkdir t5-v1_1-xxl
+mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
 ```
 
-**We don't need the tf_model.h5** file. This file can be deleted.
+By following the above approach, you will obtain a safetensor format T5 file. Ensure that there are no errors when
+loading it into Deepspeed in Finetune.
 
 3. Modify the file `configs/cogvideox_2b_infer.yaml`.
 
@@ -101,6 +104,9 @@ bash inference.sh
 
 ### Preparing the Environment
 
+Please note that currently, SAT needs to be installed from the source code for proper fine-tuning. We will address this
+issue in future stable releases.
+
 ```
 git clone https://github.com/THUDM/SwissArmyTransformer.git
 cd SwissArmyTransformer
@@ -130,7 +136,8 @@ For style fine-tuning, please prepare at least 50 videos and labels with similar
 
 ### Modifying the Configuration File
 
-We support both `Lora` and `full-parameter fine-tuning` methods. Please note that both fine-tuning methods only apply to the `transformer` part. The `VAE part` is not modified. `T5` is only used as an Encoder.
+We support both `Lora` and `full-parameter fine-tuning` methods. Please note that both fine-tuning methods only apply to
+the `transformer` part. The `VAE part` is not modified. `T5` is only used as an Encoder.
 
 the `configs/cogvideox_2b_sft.yaml` (for full fine-tuning) as follows.
 
diff --git a/sat/README_zh.md b/sat/README_zh.md
index e2d9be9..833a221 100644
--- a/sat/README_zh.md
+++ b/sat/README_zh.md
@@ -41,12 +41,12 @@ unzip transformer.zip
 
 接着，克隆 T5 模型，该模型不用做训练和微调，但是必须使用。
 
-```shell
-git lfs install 
-git clone https://huggingface.co/google/t5-v1_1-xxl.git
 ```
-
-**我们不需要使用tf_model.h5**文件。该文件可以删除。
+git clone https://huggingface.co/THUDM/CogVideoX-2b.git
+mkdir t5-v1_1-xxl
+mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
+```
+通过上述方案，你将会得到一个 safetensor 格式的T5文件，确保在 Deepspeed微调过程中读入的时候不会报错。
 
 3. 修改`configs/cogvideox_2b_infer.yaml`中的文件。
 
@@ -101,6 +101,8 @@ bash inference.sh
 
 ### 准备环境
 
+请注意，目前，SAT需要从源码安装，才能正常微调, 我们将会在未来的稳定版本解决这个问题。
+
 ```
 git clone https://github.com/THUDM/SwissArmyTransformer.git
 cd SwissArmyTransformer

From a73101c958b5d134cb408cba391447b52b0208b4 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Wed, 7 Aug 2024 16:25:17 +0800
Subject: [PATCH 04/11] =?UTF-8?q?T5=20=E8=AF=B4=E6=98=8E?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 sat/README.md    | 13 +++++++++++++
 sat/README_zh.md | 11 +++++++++++
 2 files changed, 24 insertions(+)

diff --git a/sat/README.md b/sat/README.md
index be3fb0f..7325be0 100644
--- a/sat/README.md
+++ b/sat/README.md
@@ -51,6 +51,19 @@ mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
 By following the above approach, you will obtain a safetensor format T5 file. Ensure that there are no errors when
 loading it into Deepspeed in Finetune.
 
+```
+├── added_tokens.json
+├── config.json
+├── model-00001-of-00002.safetensors
+├── model-00002-of-00002.safetensors
+├── model.safetensors.index.json
+├── special_tokens_map.json
+├── spiece.model
+└── tokenizer_config.json
+
+0 directories, 8 files
+```
+
 3. Modify the file `configs/cogvideox_2b_infer.yaml`.
 
 ```yaml
diff --git a/sat/README_zh.md b/sat/README_zh.md
index 833a221..61f00f6 100644
--- a/sat/README_zh.md
+++ b/sat/README_zh.md
@@ -47,7 +47,18 @@ mkdir t5-v1_1-xxl
 mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
 ```
 通过上述方案，你将会得到一个 safetensor 格式的T5文件，确保在 Deepspeed微调过程中读入的时候不会报错。
+```
+├── added_tokens.json
+├── config.json
+├── model-00001-of-00002.safetensors
+├── model-00002-of-00002.safetensors
+├── model.safetensors.index.json
+├── special_tokens_map.json
+├── spiece.model
+└── tokenizer_config.json
 
+0 directories, 8 files
+```
 3. 修改`configs/cogvideox_2b_infer.yaml`中的文件。
 
 ```yaml

From 5a69462c8b3cd282fe4ce5c52bc4f7d9d1993601 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Wed, 7 Aug 2024 16:49:11 +0800
Subject: [PATCH 05/11] update gradio webdemo

---
 README.md                                     |  4 ++--
 README_zh.md                                  |  4 ++--
 inference/cli_demo.py                         | 19 ++-----------------
 .../gradio_web_demo.py                        |  2 +-
 .../{web_demo.py => streamlit_web_demo.py}    |  2 +-
 5 files changed, 8 insertions(+), 23 deletions(-)
 rename gradio_demo.py => inference/gradio_web_demo.py (99%)
 rename inference/{web_demo.py => streamlit_web_demo.py} (99%)

diff --git a/README.md b/README.md
index c292386..cc96605 100644
--- a/README.md
+++ b/README.md
@@ -132,14 +132,14 @@ of the **CogVideoX** open-source model.
   CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training
   distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as
   GPT, Gemini, etc.
-+ [gradio_demo](gradio_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate
++ [gradio_web_demo](inference/gradio_web_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate
   videos.
 
 <div style="text-align: center;">
     <img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
 </div>
 
-+ [web_demo](inference/web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model
++ [streamlit_web_demo](inference/streamlit_web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model
   to generate videos.
 
 <div style="text-align: center;">
diff --git a/README_zh.md b/README_zh.md
index cae5b40..2e26810 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -116,13 +116,13 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
 + [diffusers_demo](inference/cli_demo.py): 更详细的推理代码讲解，常见参数的意义，在这里都会提及。
 + [diffusers_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码，目前需要71GB显存，将来会优化。
 + [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合 CogVideoX的长输入。因为CogVideoX是在长文本上训练的，所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4，也可以替换为GPT、Gemini等任意大语言模型。
-+ [gradio_demo](gradio_demo.py): 一个简单的gradio网页应用，展示如何使用 CogVideoX-2B 模型生成视频。
++ [gradio_web_demo](inference/gradio_web_demo.py): 一个简单的gradio网页应用，展示如何使用 CogVideoX-2B 模型生成视频。
 
 <div style="text-align: center;">
     <img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
 </div>
 
-+ [web_demo](inference/web_demo.py): 一个简单的streamlit网页应用，展示如何使用 CogVideoX-2B 模型生成视频。
++ [streamlit_web_demo](inference/streamlit_web_demo.py): 一个简单的streamlit网页应用，展示如何使用 CogVideoX-2B 模型生成视频。
 
 <div style="text-align: center;">
     <img src="resources/web_demo.png" style="width: 100%; height: auto;" />
diff --git a/inference/cli_demo.py b/inference/cli_demo.py
index c480d43..a1bb764 100644
--- a/inference/cli_demo.py
+++ b/inference/cli_demo.py
@@ -47,20 +47,6 @@ def generate_video(
     device: str = "cuda",
     dtype: torch.dtype = torch.float16,
 ):
-    """
-    Generates a video based on the given prompt and saves it to the specified path.
-
-    Parameters:
-    - prompt (str): The description of the video to be generated.
-    - model_path (str): The path of the pre-trained model to be used.
-    - output_path (str): The path where the generated video will be saved.
-    - num_inference_steps (int): Number of steps for the inference process. More steps can result in better quality.
-    - guidance_scale (float): The scale for classifier-free guidance. Higher values can lead to better alignment with the prompt.
-    - num_videos_per_prompt (int): Number of videos to generate per prompt.
-    - device (str): The device to use for computation (e.g., "cuda" or "cpu").
-    - dtype (torch.dtype): The data type for computation (default is torch.float16).
-    """
-
     # Load the pre-trained CogVideoX pipeline with the specified precision (float16) and move it to the specified device
     pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype).to(device)
 
@@ -74,7 +60,8 @@ def generate_video(
         device=device,  # Device to use for computation
         dtype=dtype,  # Data type for computation
     )
-
+    # Must enable model CPU offload to avoid OOM issue on GPU with 24GB memory
+    pipe.enable_model_cpu_offload()
     # Generate the video frames using the pipeline
     video = pipe(
         num_inference_steps=num_inference_steps,  # Number of inference steps
@@ -82,11 +69,9 @@ def generate_video(
         prompt_embeds=prompt_embeds,  # Encoded prompt embeddings
         negative_prompt_embeds=torch.zeros_like(prompt_embeds),  # Not Supported negative prompt
     ).frames[0]
-
     # Export the generated frames to a video file. fps must be 8
     export_to_video_imageio(video, output_path, fps=8)
 
-
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="Generate a video from a text prompt using CogVideoX")
     parser.add_argument("--prompt", type=str, required=True, help="The description of the video to be generated")
diff --git a/gradio_demo.py b/inference/gradio_web_demo.py
similarity index 99%
rename from gradio_demo.py
rename to inference/gradio_web_demo.py
index ea0b020..9f36254 100644
--- a/gradio_demo.py
+++ b/inference/gradio_web_demo.py
@@ -104,7 +104,7 @@ def infer(
         device=device,
         dtype=dtype,
     )
-
+    pipe.enable_model_cpu_offload()
     video = pipe(
         num_inference_steps=num_inference_steps,
         guidance_scale=guidance_scale,
diff --git a/inference/web_demo.py b/inference/streamlit_web_demo.py
similarity index 99%
rename from inference/web_demo.py
rename to inference/streamlit_web_demo.py
index 8695975..6df62db 100644
--- a/inference/web_demo.py
+++ b/inference/streamlit_web_demo.py
@@ -76,7 +76,7 @@ def generate_video(
         device=device,
         dtype=dtype,
     )
-
+    pipe.enable_model_cpu_offload()
     # Generate video
     video = pipe(
         num_inference_steps=num_inference_steps,

From 125432d4032b827706ac769012aaf534d3e4e5c4 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Wed, 7 Aug 2024 19:27:53 +0800
Subject: [PATCH 06/11] 1

---
 README.md                       | 12 ++++---
 README_zh.md                    | 61 ++++++++++++++++++---------------
 inference/cli_demo.py           | 44 +++++++++++++++++-------
 inference/gradio_web_demo.py    |  5 +--
 inference/streamlit_web_demo.py |  4 ++-
 5 files changed, 78 insertions(+), 48 deletions(-)

diff --git a/README.md b/README.md
index cc96605..fba09da 100644
--- a/README.md
+++ b/README.md
@@ -20,6 +20,8 @@
 
 ## Update and News
 
+- 🔥 **News**: `2024/8/7`: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be performed
+  on a single 3090 GPU. For more details, please refer to the [code](inference/cli_demo.py).
 - 🔥 **News**: ``2024/8/6``: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can reconstruct
   the video almost losslessly.
 - 🔥 **News**: ``2024/8/6``: We have open-sourced **CogVideoX-2B**，the first model in the CogVideoX series of video
@@ -106,14 +108,14 @@ along with related basic information:
 | Model Name                                | CogVideoX-2B                                                                                                                                                                                        | 
 |-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | Prompt Language                           | English                                                                                                                                                                                             | 
-| GPU Memory Required for Inference (FP16)  | 18GB if using [SAT](https://github.com/THUDM/SwissArmyTransformer); 36GB if using diffusers (will be optimized before the PR is merged)                                                             | 
+| Single GPU  Inference (FP16)              | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)   <br>  23.9GB using diffusers                                                                                                      | 
+| Multi GPUs Inference (FP16)               | 20GB minimum per GPU using diffusers                                                                                                                                                                |
 | GPU Memory Required for Fine-tuning(bs=1) | 40GB                                                                                                                                                                                                |
 | Prompt Max  Length                        | 226 Tokens                                                                                                                                                                                          |
 | Video Length                              | 6 seconds                                                                                                                                                                                           | 
 | Frames Per Second                         | 8 frames                                                                                                                                                                                            | 
 | Resolution                                | 720 * 480                                                                                                                                                                                           |
 | Quantized Inference                       | Not Supported                                                                                                                                                                                       |          
-| Multi-card Inference                      | Not Supported                                                                                                                                                                                       |                             
 | Download Link (HF diffusers Model)        | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B)   [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b)   [💫 WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) |
 | Download Link (SAT Model)                 | [SAT](./sat/README.md)                                                                                                                                                                              |
 
@@ -132,14 +134,16 @@ of the **CogVideoX** open-source model.
   CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training
   distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as
   GPT, Gemini, etc.
-+ [gradio_web_demo](inference/gradio_web_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate
++ [gradio_web_demo](inference/gradio_web_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B
+  model to generate
   videos.
 
 <div style="text-align: center;">
     <img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
 </div>
 
-+ [streamlit_web_demo](inference/streamlit_web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model
++ [streamlit_web_demo](inference/streamlit_web_demo.py): A simple streamlit web application demonstrating how to use the
+  CogVideoX-2B model
   to generate videos.
 
 <div style="text-align: center;">
diff --git a/README_zh.md b/README_zh.md
index 2e26810..61453aa 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -21,24 +21,26 @@
 
 ## 项目更新
 
+- 🔥 **News**: ``2024/8/7``: CogVideoX 已经合并入 `diffusers` 0.30.0版本，单张3090可以推理，详情请见[代码](inference/cli_demo.py)。
 - 🔥 **News**: ``2024/8/6``: 我们开源 **3D Causal VAE**，用于 **CogVideoX-2B**，可以几乎无损地重构视频。
 - 🔥 **News**: ``2024/8/6``: 我们开源 CogVideoX 系列视频生成模型的第一个模型, **CogVideoX-2B**。
-- 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型（现在你可以在 `CogVideo` 分支中看到），这是首个开源的基于 Transformer 的大型文本生成视频模型，您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。
-**性能更强，参数量更大的模型正在到来的路上～，欢迎关注**
+- 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型（现在你可以在 `CogVideo` 分支中看到），这是首个开源的基于
+  Transformer 的大型文本生成视频模型，您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。
+  **性能更强，参数量更大的模型正在到来的路上～，欢迎关注**
 
 ## 目录
 
 跳转到指定部分：
 
 - [快速开始](#快速开始)
-  - [SAT](#sat) 
-  - [Diffusers](#Diffusers)
+    - [SAT](#sat)
+    - [Diffusers](#Diffusers)
 - [CogVideoX-2B 视频作品](#cogvideox-2b-视频作品)
 - [CogVideoX模型介绍](#模型介绍)
 - [完整项目代码结构](#完整项目代码结构)
-  - [Inference](#inference)
-  - [SAT](#sat)
-  - [Tools](#tools)
+    - [Inference](#inference)
+    - [SAT](#sat)
+    - [Tools](#tools)
 - [开源项目规划](#开源项目规划)
 - [模型协议](#模型协议)
 - [CogVideo(ICLR'23)模型介绍](#cogvideoiclr23)
@@ -53,8 +55,9 @@
 
 ### SAT
 
-查看sat文件夹下的[sat_demo](sat/README.md)：包含了 SAT 权重的推理代码和微调代码，推荐基于此代码进行 CogVideoX 模型结构的改进，研究者使用该代码可以更好的进行快速的迭代和开发。
-		(18 GB 推理, 40GB lora微调)
+查看sat文件夹下的[sat_demo](sat/README.md)：包含了 SAT 权重的推理代码和微调代码，推荐基于此代码进行 CogVideoX
+模型结构的改进，研究者使用该代码可以更好的进行快速的迭代和开发。
+(18 GB 推理, 40GB lora微调)
 
 ### Diffusers
 
@@ -64,7 +67,6 @@ pip install -r requirements.txt
 
 查看[diffusers_demo](inference/cli_demo.py)：包含对推理代码更详细的解释，包括各种关键的参数。（36GB 推理，显存优化以及微调代码正在开发）
 
-
 ## CogVideoX-2B 视频作品
 
 <div align="center">
@@ -93,19 +95,19 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
 
 下表战展示目前我们提供的视频生成模型列表，以及相关基础信息:
 
-| 模型名字                | CogVideoX-2B                                                                                                                         | 
-|---------------------|--------------------------------------------------------------------------------------------------------------------------------------|
-| 提示词语言               | English                                                                                                                              | 
-| 推理显存消耗 (FP-16)      | 36GB using diffusers (will be optimized before the PR is merged) and 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer) | 
-| 微调显存消耗 (bs=1)       | 42GB                                                                                                                                 |
-| 提示词长度上限             | 226 Tokens                                                                                                                           |
-| 视频长度                | 6 seconds                                                                                                                            | 
-| 帧率（每秒）              | 8 frames                                                                                                                             | 
-| 视频分辨率               | 720 * 480                                                                                                                            |
-| 量化推理                | 不支持                                                                                                                                  |          
-| 多卡推理                | 不支持                                                                                                                                  |                             
-| 下载地址 (Diffusers 模型) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B)  [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b)      |
-| 下载地址 (SAT 模型)       | [SAT](./sat/README_zh.md)                                                                                                            |
+| 模型名                 | CogVideoX-2B                                                                                                                  | 
+|---------------------|-------------------------------------------------------------------------------------------------------------------------------|
+| 提示词语言               | English                                                                                                                       | 
+| 单GPU推理 (FP-16) 显存消耗 | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)   <br>  23.9GB using diffusers                         | 
+| 多GPU推理 (FP-16) 显存消耗 | 20GB minimum per GPU using diffusers                                                                                          |                                                                                                            
+| 微调显存消耗 (bs=1)       | 42GB                                                                                                                          |
+| 提示词长度上限             | 226 Tokens                                                                                                                    |
+| 视频长度                | 6 seconds                                                                                                                     | 
+| 帧率（每秒）              | 8 frames                                                                                                                      | 
+| 视频分辨率               | 720 * 480                                                                                                                     |
+| 量化推理                | 不支持                                                                                                                           |          
+| 下载地址 (Diffusers 模型) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B)  [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) |
+| 下载地址 (SAT 模型)       | [SAT](./sat/README_zh.md)                                                                                                     |
 
 ## 完整项目代码结构
 
@@ -115,7 +117,8 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
 
 + [diffusers_demo](inference/cli_demo.py): 更详细的推理代码讲解，常见参数的意义，在这里都会提及。
 + [diffusers_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码，目前需要71GB显存，将来会优化。
-+ [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合 CogVideoX的长输入。因为CogVideoX是在长文本上训练的，所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4，也可以替换为GPT、Gemini等任意大语言模型。
++ [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合
+  CogVideoX的长输入。因为CogVideoX是在长文本上训练的，所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4，也可以替换为GPT、Gemini等任意大语言模型。
 + [gradio_web_demo](inference/gradio_web_demo.py): 一个简单的gradio网页应用，展示如何使用 CogVideoX-2B 模型生成视频。
 
 <div style="text-align: center;">
@@ -140,9 +143,10 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
 + [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): 将 SAT 模型权重转换为 Huggingface 模型权重。
 + [caption_demo](tools/caption/README_zh.md):  Caption 工具，对视频理解并用文字输出的模型。
 
+## CogVideo(ICLR'23)
 
-## CogVideo(ICLR'23) 
- [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) 的官方repo位于[CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)。
+[CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868)
+的官方repo位于[CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)。
 
 **CogVideo可以生成高帧率视频，下面展示了一个32帧的4秒视频。**
 
@@ -155,11 +159,12 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
   <video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="80%" controls autoplay></video>
 </div>
 
-CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/)。您可以在这里体验文本到视频生成。*原始输入为中文。*
+CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/)。您可以在这里体验文本到视频生成。
+*原始输入为中文。*
 
 ## 引用
 
-🌟 如果您发现我们的工作有所帮助，欢迎引用我们的文章，留下宝贵的stars 
+🌟 如果您发现我们的工作有所帮助，欢迎引用我们的文章，留下宝贵的stars
 
 ```
 @article{yang2024cogvideox,
diff --git a/inference/cli_demo.py b/inference/cli_demo.py
index a1bb764..0358ce7 100644
--- a/inference/cli_demo.py
+++ b/inference/cli_demo.py
@@ -22,7 +22,7 @@ from diffusers import CogVideoXPipeline
 
 
 def export_to_video_imageio(
-    video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
+        video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
 ) -> str:
     """
     Export the video frames to a video file using imageio lib to Avoid "green screen" issue (for example CogVideoX)
@@ -38,17 +38,34 @@ def export_to_video_imageio(
 
 
 def generate_video(
-    prompt: str,
-    model_path: str,
-    output_path: str = "./output.mp4",
-    num_inference_steps: int = 50,
-    guidance_scale: float = 6.0,
-    num_videos_per_prompt: int = 1,
-    device: str = "cuda",
-    dtype: torch.dtype = torch.float16,
+        prompt: str,
+        model_path: str,
+        output_path: str = "./output.mp4",
+        num_inference_steps: int = 50,
+        guidance_scale: float = 6.0,
+        num_videos_per_prompt: int = 1,
+        device: str = "cuda",
+        dtype: torch.dtype = torch.float16,
 ):
+    """
+    Generates a video based on the given prompt and saves it to the specified path.
+
+    Parameters:
+    - prompt (str): The description of the video to be generated.
+    - model_path (str): The path of the pre-trained model to be used.
+    - output_path (str): The path where the generated video will be saved.
+    - num_inference_steps (int): Number of steps for the inference process. More steps can result in better quality.
+    - guidance_scale (float): The scale for classifier-free guidance. Higher values can lead to better alignment with the prompt.
+    - num_videos_per_prompt (int): Number of videos to generate per prompt.
+    - device (str): The device to use for computation (e.g., "cuda" or "cpu").
+    - dtype (torch.dtype): The data type for computation (default is torch.float16).
+    """
+
     # Load the pre-trained CogVideoX pipeline with the specified precision (float16) and move it to the specified device
-    pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype).to(device)
+    # add device_map="balanced" in the from_pretrained function and remove
+    # `pipe.enable_model_cpu_offload()` to enable Multi GPUs (2 or more and each one must have more than 20GB memory) inference.
+    pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype)
+    pipe.enable_model_cpu_offload()
 
     # Encode the prompt to get the prompt embeddings
     prompt_embeds, _ = pipe.encode_prompt(
@@ -60,18 +77,19 @@ def generate_video(
         device=device,  # Device to use for computation
         dtype=dtype,  # Data type for computation
     )
-    # Must enable model CPU offload to avoid OOM issue on GPU with 24GB memory
-    pipe.enable_model_cpu_offload()
+
     # Generate the video frames using the pipeline
     video = pipe(
-        num_inference_steps=num_inference_steps,  # Number of inference steps
+        num_inference_steps=5,  # Number of inference steps
         guidance_scale=guidance_scale,  # Guidance scale for classifier-free guidance
         prompt_embeds=prompt_embeds,  # Encoded prompt embeddings
         negative_prompt_embeds=torch.zeros_like(prompt_embeds),  # Not Supported negative prompt
     ).frames[0]
+
     # Export the generated frames to a video file. fps must be 8
     export_to_video_imageio(video, output_path, fps=8)
 
+
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="Generate a video from a text prompt using CogVideoX")
     parser.add_argument("--prompt", type=str, required=True, help="The description of the video to be generated")
diff --git a/inference/gradio_web_demo.py b/inference/gradio_web_demo.py
index 9f36254..4b4cad0 100644
--- a/inference/gradio_web_demo.py
+++ b/inference/gradio_web_demo.py
@@ -16,7 +16,8 @@ import PIL
 
 dtype = torch.bfloat16
 device = "cuda" if torch.cuda.is_available() else "cpu"
-pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=dtype).to(device)
+pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=dtype)
+pipe.enable_model_cpu_offload()
 
 sys_prompt = """You are part of a team of bots that creates videos. You work with an assistant bot that will draw anything you say in square brackets.
 
@@ -104,7 +105,7 @@ def infer(
         device=device,
         dtype=dtype,
     )
-    pipe.enable_model_cpu_offload()
+
     video = pipe(
         num_inference_steps=num_inference_steps,
         guidance_scale=guidance_scale,
diff --git a/inference/streamlit_web_demo.py b/inference/streamlit_web_demo.py
index 6df62db..342d85b 100644
--- a/inference/streamlit_web_demo.py
+++ b/inference/streamlit_web_demo.py
@@ -39,7 +39,9 @@ def load_model(model_path: str, dtype: torch.dtype, device: str) -> CogVideoXPip
     Returns:
     - CogVideoXPipeline: Loaded model pipeline.
     """
-    return CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype).to(device)
+    pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype)
+    pipe.enable_model_cpu_offload()
+    return pipe
 
 
 # Define a function to generate video based on the provided prompt and model path

From 9ffa0bea284935e37fe155e6b2a1c2db2f95c6fc Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Wed, 7 Aug 2024 19:28:54 +0800
Subject: [PATCH 07/11] 2

---
 README.md    | 6 +++---
 README_zh.md | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/README.md b/README.md
index fba09da..8ce33cf 100644
--- a/README.md
+++ b/README.md
@@ -20,11 +20,11 @@
 
 ## Update and News
 
-- 🔥 **News**: `2024/8/7`: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be performed
+- 🔥 **News**: ```2024/8/7```: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be performed
   on a single 3090 GPU. For more details, please refer to the [code](inference/cli_demo.py).
-- 🔥 **News**: ``2024/8/6``: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can reconstruct
+- 🔥 **News**: ```2024/8/6```: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can reconstruct
   the video almost losslessly.
-- 🔥 **News**: ``2024/8/6``: We have open-sourced **CogVideoX-2B**，the first model in the CogVideoX series of video
+- 🔥 **News**: ```2024/8/6```: We have open-sourced **CogVideoX-2B**，the first model in the CogVideoX series of video
   generation models.
 - 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch)，the **first**
   open-sourced pretrained text-to-video model, and you can
diff --git a/README_zh.md b/README_zh.md
index 61453aa..274090a 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -21,9 +21,9 @@
 
 ## 项目更新
 
-- 🔥 **News**: ``2024/8/7``: CogVideoX 已经合并入 `diffusers` 0.30.0版本，单张3090可以推理，详情请见[代码](inference/cli_demo.py)。
-- 🔥 **News**: ``2024/8/6``: 我们开源 **3D Causal VAE**，用于 **CogVideoX-2B**，可以几乎无损地重构视频。
-- 🔥 **News**: ``2024/8/6``: 我们开源 CogVideoX 系列视频生成模型的第一个模型, **CogVideoX-2B**。
+- 🔥 **News**: ```2024/8/7```: CogVideoX 已经合并入 `diffusers` 0.30.0版本，单张3090可以推理，详情请见[代码](inference/cli_demo.py)。
+- 🔥 **News**: ```2024/8/6```: 我们开源 **3D Causal VAE**，用于 **CogVideoX-2B**，可以几乎无损地重构视频。
+- 🔥 **News**: ```2024/8/6```: 我们开源 CogVideoX 系列视频生成模型的第一个模型, **CogVideoX-2B**。
 - 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型（现在你可以在 `CogVideo` 分支中看到），这是首个开源的基于
   Transformer 的大型文本生成视频模型，您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。
   **性能更强，参数量更大的模型正在到来的路上～，欢迎关注**

From 71399f755812a05de3f59debc7bf0d8eac668ad3 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Wed, 7 Aug 2024 19:43:50 +0800
Subject: [PATCH 08/11] update GPU memory to 23.9GB

---
 README.md    | 1 -
 README_zh.md | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 8ce33cf..159ce21 100644
--- a/README.md
+++ b/README.md
@@ -73,7 +73,6 @@ pip install -r requirements.txt
 
 Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the
 significance of common parameters.
-(36GB for inference, smaller memory and fine-tuned code are under development)
 
 ## CogVideoX-2B Gallery
 
diff --git a/README_zh.md b/README_zh.md
index 274090a..970d58c 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -65,7 +65,7 @@
 pip install -r requirements.txt
 ```
 
-查看[diffusers_demo](inference/cli_demo.py)：包含对推理代码更详细的解释，包括各种关键的参数。（36GB 推理，显存优化以及微调代码正在开发）
+查看[diffusers_demo](inference/cli_demo.py)：包含对推理代码更详细的解释，包括各种关键的参数。
 
 ## CogVideoX-2B 视频作品
 

From 54546d0f8907ebd568153d36a119a9368bbdbbe2 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Wed, 7 Aug 2024 19:45:36 +0800
Subject: [PATCH 09/11] fix MODEL_LICENSE

---
 Model_License => MODEL_LICENSE | 0
 inference/cli_demo.py          | 2 +-
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename Model_License => MODEL_LICENSE (100%)

diff --git a/Model_License b/MODEL_LICENSE
similarity index 100%
rename from Model_License
rename to MODEL_LICENSE
diff --git a/inference/cli_demo.py b/inference/cli_demo.py
index 0358ce7..d069f02 100644
--- a/inference/cli_demo.py
+++ b/inference/cli_demo.py
@@ -80,7 +80,7 @@ def generate_video(
 
     # Generate the video frames using the pipeline
     video = pipe(
-        num_inference_steps=5,  # Number of inference steps
+        num_inference_steps=num_inference_steps,  # Number of inference steps
         guidance_scale=guidance_scale,  # Guidance scale for classifier-free guidance
         prompt_embeds=prompt_embeds,  # Encoded prompt embeddings
         negative_prompt_embeds=torch.zeros_like(prompt_embeds),  # Not Supported negative prompt

From 6fc9de04dc99f12c88c5dcf1a9932d7d8c403b39 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Wed, 7 Aug 2024 19:53:40 +0800
Subject: [PATCH 10/11] restore

---
 README.md    | 1 +
 README_zh.md | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 159ce21..b85f813 100644
--- a/README.md
+++ b/README.md
@@ -73,6 +73,7 @@ pip install -r requirements.txt
 
 Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the
 significance of common parameters.
+(24GB for inference,fine-tuned code are under development)
 
 ## CogVideoX-2B Gallery
 
diff --git a/README_zh.md b/README_zh.md
index 970d58c..bf97f15 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -65,7 +65,7 @@
 pip install -r requirements.txt
 ```
 
-查看[diffusers_demo](inference/cli_demo.py)：包含对推理代码更详细的解释，包括各种关键的参数。
+查看[diffusers_demo](inference/cli_demo.py)：包含对推理代码更详细的解释，包括各种关键的参数。（24GB 推理，微调代码正在开发）
 
 ## CogVideoX-2B 视频作品
 

From 8c0d0eb42712fa42d5a25a2f595cdb05fcc75fa6 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Fri, 9 Aug 2024 13:46:06 +0800
Subject: [PATCH 11/11] update multi gpus  finetune script

---
 README.md                                   |  4 ++++
 README_zh.md                                |  2 +-
 sat/README.md                               | 11 ++++++++---
 sat/README_zh.md                            |  7 +++++--
 sat/data_video.py                           |  2 +-
 sat/finetune_multi_gpus.sh                  | 10 ++++++++++
 sat/{finetune.sh => finetune_single_gpu.sh} |  0
 7 files changed, 29 insertions(+), 7 deletions(-)
 create mode 100644 sat/finetune_multi_gpus.sh
 rename sat/{finetune.sh => finetune_single_gpu.sh} (100%)

diff --git a/README.md b/README.md
index b85f813..0801638 100644
--- a/README.md
+++ b/README.md
@@ -60,6 +60,8 @@ the quality of the generated video.
 
 ### SAT
 
+**Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.**
+
 Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is
 recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform
 rapid stacking and development.
@@ -67,6 +69,8 @@ rapid stacking and development.
 
 ### Diffusers
 
+**Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.**
+
 ```
 pip install -r requirements.txt
 ```
diff --git a/README_zh.md b/README_zh.md
index bf97f15..5bd0e7e 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -93,7 +93,7 @@ pip install -r requirements.txt
 
 CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源版本视频生成模型。
 
-下表战展示目前我们提供的视频生成模型列表，以及相关基础信息:
+下表展示目前我们提供的视频生成模型列表，以及相关基础信息:
 
 | 模型名                 | CogVideoX-2B                                                                                                                  | 
 |---------------------|-------------------------------------------------------------------------------------------------------------------------------|
diff --git a/sat/README.md b/sat/README.md
index 7325be0..f55445e 100644
--- a/sat/README.md
+++ b/sat/README.md
@@ -117,8 +117,12 @@ bash inference.sh
 
 ### Preparing the Environment
 
-Please note that currently, SAT needs to be installed from the source code for proper fine-tuning. We will address this
-issue in future stable releases.
+Please note that currently, SAT needs to be installed from the source code for proper fine-tuning.
+
+You need to get the code from the source to support the fine-tuning functionality, as these features have not yet been
+released in the Pip package.
+
+We will address this issue in future stable releases.
 
 ```
 git clone https://github.com/THUDM/SwissArmyTransformer.git
@@ -197,7 +201,8 @@ model:
 1. Run the inference code to start fine-tuning.
 
 ```shell
-bash finetune.sh
+bash finetune_single_gpu.sh # Single GPU
+bash finetune_multi_gpus.sh # Multi GPUs
 ```
 
 ### Converting to Huggingface Diffusers Supported Weights
diff --git a/sat/README_zh.md b/sat/README_zh.md
index 61f00f6..3335e52 100644
--- a/sat/README_zh.md
+++ b/sat/README_zh.md
@@ -112,7 +112,9 @@ bash inference.sh
 
 ### 准备环境
 
-请注意，目前，SAT需要从源码安装，才能正常微调, 我们将会在未来的稳定版本解决这个问题。
+请注意，目前，SAT需要从源码安装，才能正常微调。
+这是因为你需要使用还没发型到pip包版本的最新代码所支持的功能。
+我们将会在未来的稳定版本解决这个问题。
 
 ```
 git clone https://github.com/THUDM/SwissArmyTransformer.git
@@ -189,7 +191,8 @@ model:
 1. 运行推理代码,即可开始微调。
 
 ```shell
-bash finetune.sh
+bash finetune_single_gpu.sh # Single GPU
+bash finetune_multi_gpus.sh # Multi GPUs
 ```
 
 ### 转换到 Huggingface Diffusers 库支持的权重
diff --git a/sat/data_video.py b/sat/data_video.py
index ccfea46..3783340 100644
--- a/sat/data_video.py
+++ b/sat/data_video.py
@@ -425,7 +425,7 @@ class SFTDataset(Dataset):
                     self.videos_list.append(tensor_frms)
 
                     # caption
-                    caption_path = os.path.join(root, filename.replace("videos", "labels").replace(".mp4", ".txt"))
+                    caption_path = os.path.join(root, filename.replace(".mp4", ".txt")).replace("videos", "labels")
                     if os.path.exists(caption_path):
                         caption = open(caption_path, "r").read().splitlines()[0]
                     else:
diff --git a/sat/finetune_multi_gpus.sh b/sat/finetune_multi_gpus.sh
new file mode 100644
index 0000000..d6b6383
--- /dev/null
+++ b/sat/finetune_multi_gpus.sh
@@ -0,0 +1,10 @@
+#! /bin/bash
+
+echo "RUN on `hostname`, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
+
+run_cmd="torchrun --standalone --nproc_per_node=4 train_video.py --base configs/cogvideox_2b_sft.yaml --seed $RANDOM“
+
+echo ${run_cmd}
+eval ${run_cmd}
+
+echo "DONE on `hostname`"
\ No newline at end of file
diff --git a/sat/finetune.sh b/sat/finetune_single_gpu.sh
similarity index 100%
rename from sat/finetune.sh
rename to sat/finetune_single_gpu.sh