diff --git a/README.md b/README.md index e84a1a6..4b7f3af 100644 --- a/README.md +++ b/README.md @@ -24,10 +24,26 @@ the video almost losslessly. - 🔥 **News**: ``2024/8/6``: We have open-sourced **CogVideoX-2B**,the first model in the CogVideoX series of video generation models. -- 🌱 **Source**: ```2022/5/19```: We have open-sourced CogVideo (now you can see in `CogVideo` branch),the **first** open-sourced pretrained text-to-video model, and you can check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details. +- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch),the **first** open-sourced pretrained text-to-video model, and you can check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details. **More powerful models with larger parameter sizes are on the way~ Stay tuned!** +## Quick Start + +### SAT + +Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development. + (18 GB for inference, 40GB for lora finetune) + +### Diffusers + +``` +pip install -r requirements.txt +``` + +Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters. + (36GB for inference, smaller memory and fine-tuned code are under development) + ## CogVideoX-2B Gallery
@@ -79,9 +95,15 @@ of the **CogVideoX** open-source model. ### Inference -+ [cli_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters. -+ [cli_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of memory, but it will be optimized in the future. ++ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters. ++ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of memory, but it will be optimized in the future. + [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as GPT, Gemini, etc. ++ [gradio_demo](gradio_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate videos. + +
+ +
+ + [web_demo](inference/web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model to generate videos.
@@ -90,9 +112,7 @@ of the **CogVideoX** open-source model. ### sat -+ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is - recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform - rapid stacking and development. ++ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development. ### Tools diff --git a/README_zh.md b/README_zh.md index 13b509d..70419c9 100644 --- a/README_zh.md +++ b/README_zh.md @@ -26,6 +26,22 @@ - 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型(现在你可以在 `CogVideo` 分支中看到),这是首个开源的基于 Transformer 的大型文本生成视频模型,您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。 **性能更强,参数量更大的模型正在到来的路上~,欢迎关注** + +## 快速开始 + +### SAT + +查看sat文件夹下的[sat_demo](sat/README.md):包含了 SAT 权重的推理代码和微调代码,推荐基于此代码进行 CogVideoX 模型结构的改进,研究者使用该代码可以更好的进行快速的迭代和开发。 + (18 GB 推理, 40GB lora微调) + +### Diffusers + +``` +pip install -r requirements.txt +``` + +查看[diffusers_demo](inference/cli_demo.py):包含对推理代码更详细的解释,包括各种关键的参数。(36GB 推理,显存优化以及微调代码正在开发) + ## CogVideoX-2B 视频作品
@@ -74,9 +90,15 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源 ### inference -+ [cli_demo](inference/cli_demo.py): 更详细的推理代码讲解,常见参数的意义,在这里都会提及。 -+ [cli_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码,目前需要71GB显存,将来会优化。 ++ [diffusers_demo](inference/cli_demo.py): 更详细的推理代码讲解,常见参数的意义,在这里都会提及。 ++ [diffusers_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码,目前需要71GB显存,将来会优化。 + [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合 CogVideoX的长输入。因为CogVideoX是在长文本上训练的,所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4,也可以替换为GPT、Gemini等任意大语言模型。 ++ [gradio_demo](gradio_demo.py): 一个简单的gradio网页应用,展示如何使用 CogVideoX-2B 模型生成视频。 + +
+ +
+ + [web_demo](inference/web_demo.py): 一个简单的streamlit网页应用,展示如何使用 CogVideoX-2B 模型生成视频。
diff --git a/gradio_demo.py b/gradio_demo.py index 65eeb48..ea0b020 100644 --- a/gradio_demo.py +++ b/gradio_demo.py @@ -9,7 +9,6 @@ import torch from diffusers import CogVideoXPipeline from datetime import datetime, timedelta from openai import OpenAI -import spaces import imageio import moviepy.editor as mp from typing import List, Union @@ -88,7 +87,6 @@ def convert_prompt(prompt: str, retry_times: int = 3) -> str: return prompt -@spaces.GPU(duration=240) def infer( prompt: str, num_inference_steps: int, diff --git a/requirements.txt b/requirements.txt index 438815a..bc64475 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,5 @@ git+https://github.com/huggingface/diffusers.git@d1c575ad7ee0390c2735f50cc59a79aae666567a#egg=diffusers +SwissArmyTransformer torch==2.4.0 torchvision==0.19.0 streamlit==1.37.0 diff --git a/resources/gradio_demo.png b/resources/gradio_demo.png new file mode 100644 index 0000000..a78b78c Binary files /dev/null and b/resources/gradio_demo.png differ diff --git a/sat/README.md b/sat/README.md index 8d9ac8f..a2e69d6 100644 --- a/sat/README.md +++ b/sat/README.md @@ -1,7 +1,6 @@ # SAT CogVideoX-2B -This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the -fine-tuning code for SAT weights. +This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the fine-tuning code for SAT weights. This code is the framework used by the team to train the model. It has few comments and requires careful study. @@ -100,6 +99,14 @@ bash inference.sh ## Fine-Tuning the Model +### Preparing the Environment + +``` +git clone https://github.com/THUDM/SwissArmyTransformer.git +cd SwissArmyTransformer +pip install -e . +``` + ### Preparing the Dataset The dataset format should be as follows: @@ -145,6 +152,8 @@ the `configs/cogvideox_2b_sft.yaml` (for full fine-tuning) as follows. valid_data: [ "your val data path" ] # Training and validation sets can be the same split: 1,0,0 # Ratio of training, validation, and test sets num_workers: 8 # Number of worker threads for data loading + force_train: True # Allow missing keys when loading ckpt (refer to T5 and VAE which are loaded independently) + only_log_video_latents: True # Avoid using VAE decoder when eval to save memory ``` If you wish to use Lora fine-tuning, you also need to modify: diff --git a/sat/README_zh.md b/sat/README_zh.md index ba301c8..e2d9be9 100644 --- a/sat/README_zh.md +++ b/sat/README_zh.md @@ -99,6 +99,14 @@ bash inference.sh ## 微调模型 +### 准备环境 + +``` +git clone https://github.com/THUDM/SwissArmyTransformer.git +cd SwissArmyTransformer +pip install -e . +``` + ### 准备数据集 数据集格式应该如下: @@ -143,6 +151,8 @@ Encoder 使用。 valid_data: [ "your val data path" ] # 训练集和验证集可以相同 split: 1,0,0 # 训练集,验证集,测试集比例 num_workers: 8 # 数据加载器的工作线程数 + force_train: True # 在加载checkpoint时允许missing keys (T5 和 VAE 单独加载) + only_log_video_latents: True # 避免VAE decode带来的显存开销 ``` 如果你希望使用 Lora 微调,你还需要修改: diff --git a/sat/requirements.txt b/sat/requirements.txt index bf84ae0..3ab573f 100644 --- a/sat/requirements.txt +++ b/sat/requirements.txt @@ -1,4 +1,4 @@ -git+https://github.com/THUDM/SwissArmyTransformer.git +SwissArmyTransformer==0.4.11 diffusers>=0.29.2 omegaconf>=2.3.0 torch>=2.3.1