Merge pull request #82 from THUDM/main

补充开发
2025-12-01 09:32:10 +08:00 · 2024-08-07 10:47:02 +08:00 · 2024-08-07 10:47:02 +08:00 · 66702c6240
commit 66702c6240
parent 04a99143a1 0a3eb32b33
8 changed files with 73 additions and 13 deletions
--- a/README.md
+++ b/README.md
@ -24,10 +24,26 @@
  the video almost losslessly.
 - 🔥 **News**: ``2024/8/6``: We have open-sourced **CogVideoX-2B**，the first model in the CogVideoX series of video
  generation models.
- 🌱 **Source**: ```2022/5/19```: We have open-sourced CogVideo (now you can see in `CogVideo` branch)，the **first** open-sourced pretrained text-to-video model, and you can check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details.
+- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch)，the **first** open-sourced pretrained text-to-video model, and you can check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details.

 **More powerful models with larger parameter sizes are on the way~ Stay tuned!**

+## Quick Start
+
+### SAT
+
+Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.
+		(18 GB for inference, 40GB for lora finetune)
+
+### Diffusers
+
+```
+pip install -r requirements.txt
+```
+
+Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters.
+		(36GB for inference, smaller memory and fine-tuned code are under development)
+
 ## CogVideoX-2B Gallery

 <div align="center">
@ -79,9 +95,15 @@ of the **CogVideoX** open-source model.

 ### Inference

-+ [cli_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters.
-+ [cli_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of memory, but it will be optimized in the future.
+ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters.
+ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of memory, but it will be optimized in the future.
 + [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as GPT, Gemini, etc.
+ [gradio_demo](gradio_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate videos.
+
+<div style="text-align: center;">
+    <img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
+</div>
+
 + [web_demo](inference/web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model to generate videos.

 <div style="text-align: center;">
@ -90,9 +112,7 @@ of the **CogVideoX** open-source model.

 ### sat

-+ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is
-  recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform
-  rapid stacking and development.
+ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.

 ### Tools

--- a/README_zh.md
+++ b/README_zh.md
@ -26,6 +26,22 @@
 - 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型（现在你可以在 `CogVideo` 分支中看到），这是首个开源的基于 Transformer 的大型文本生成视频模型，您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。
 **性能更强，参数量更大的模型正在到来的路上～，欢迎关注**

+
+## 快速开始
+
+### SAT
+
+查看sat文件夹下的[sat_demo](sat/README.md)：包含了 SAT 权重的推理代码和微调代码，推荐基于此代码进行 CogVideoX 模型结构的改进，研究者使用该代码可以更好的进行快速的迭代和开发。
+		(18 GB 推理, 40GB lora微调)
+
+### Diffusers
+
+```
+pip install -r requirements.txt
+```
+
+查看[diffusers_demo](inference/cli_demo.py)：包含对推理代码更详细的解释，包括各种关键的参数。（36GB 推理，显存优化以及微调代码正在开发）
+
 ## CogVideoX-2B 视频作品

 <div align="center">
@ -74,9 +90,15 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源

 ### inference

-+ [cli_demo](inference/cli_demo.py): 更详细的推理代码讲解，常见参数的意义，在这里都会提及。
-+ [cli_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码，目前需要71GB显存，将来会优化。
+ [diffusers_demo](inference/cli_demo.py): 更详细的推理代码讲解，常见参数的意义，在这里都会提及。
+ [diffusers_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码，目前需要71GB显存，将来会优化。
 + [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合 CogVideoX的长输入。因为CogVideoX是在长文本上训练的，所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4，也可以替换为GPT、Gemini等任意大语言模型。
+ [gradio_demo](gradio_demo.py): 一个简单的gradio网页应用，展示如何使用 CogVideoX-2B 模型生成视频。
+
+<div style="text-align: center;">
+    <img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
+</div>
+
 + [web_demo](inference/web_demo.py): 一个简单的streamlit网页应用，展示如何使用 CogVideoX-2B 模型生成视频。

 <div style="text-align: center;">
--- a/gradio_demo.py
+++ b/gradio_demo.py
@ -9,7 +9,6 @@ import torch
 from diffusers import CogVideoXPipeline
 from datetime import datetime, timedelta
 from openai import OpenAI
-import spaces
 import imageio
 import moviepy.editor as mp
 from typing import List, Union
@ -88,7 +87,6 @@ def convert_prompt(prompt: str, retry_times: int = 3) -> str:
    return prompt


-@spaces.GPU(duration=240)
 def infer(
        prompt: str,
        num_inference_steps: int,
--- a/requirements.txt
+++ b/requirements.txt
@ -1,4 +1,5 @@
 git+https://github.com/huggingface/diffusers.git@d1c575ad7ee0390c2735f50cc59a79aae666567a#egg=diffusers
+SwissArmyTransformer
 torch==2.4.0
 torchvision==0.19.0
 streamlit==1.37.0
--- a/resources/gradio_demo.png
+++ b/resources/gradio_demo.png
--- a/sat/README.md
+++ b/sat/README.md
@ -1,7 +1,6 @@
 # SAT CogVideoX-2B

-This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the
-fine-tuning code for SAT weights.
+This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the fine-tuning code for SAT weights.

 This code is the framework used by the team to train the model. It has few comments and requires careful study.

@ -100,6 +99,14 @@ bash inference.sh

 ## Fine-Tuning the Model

+### Preparing the Environment
+
+```
+git clone https://github.com/THUDM/SwissArmyTransformer.git
+cd SwissArmyTransformer
+pip install -e .
+```
+
 ### Preparing the Dataset

 The dataset format should be as follows:
@ -145,6 +152,8 @@ the `configs/cogvideox_2b_sft.yaml` (for full fine-tuning) as follows.
  valid_data: [ "your val data path" ] # Training and validation sets can be the same
  split: 1,0,0 # Ratio of training, validation, and test sets
  num_workers: 8 # Number of worker threads for data loading
+  force_train: True # Allow missing keys when loading ckpt (refer to T5 and VAE which are loaded independently)
+  only_log_video_latents: True # Avoid using VAE decoder when eval to save memory
 ```

 If you wish to use Lora fine-tuning, you also need to modify:
--- a/sat/README_zh.md
+++ b/sat/README_zh.md
@ -99,6 +99,14 @@ bash inference.sh

 ## 微调模型

+### 准备环境
+
+```
+git clone https://github.com/THUDM/SwissArmyTransformer.git
+cd SwissArmyTransformer
+pip install -e .
+```
+
 ### 准备数据集

 数据集格式应该如下：
@ -143,6 +151,8 @@ Encoder 使用。
  valid_data: [ "your val data path" ] # 训练集和验证集可以相同
  split: 1,0,0 # 训练集，验证集，测试集比例
  num_workers: 8 # 数据加载器的工作线程数
+  force_train: True # 在加载checkpoint时允许missing keys (T5 和 VAE 单独加载)
+  only_log_video_latents: True # 避免VAE decode带来的显存开销
 ```

 如果你希望使用 Lora 微调，你还需要修改：
--- a/sat/requirements.txt
+++ b/sat/requirements.txt
@ -1,4 +1,4 @@
-git+https://github.com/THUDM/SwissArmyTransformer.git
+SwissArmyTransformer==0.4.11
 diffusers>=0.29.2
 omegaconf>=2.3.0
 torch>=2.3.1