Merge pull request #82 from THUDM/main

补充开发
This commit is contained in:
zR 2024-08-07 10:47:02 +08:00 committed by GitHub
commit 66702c6240
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
8 changed files with 73 additions and 13 deletions

View File

@ -24,10 +24,26 @@
the video almost losslessly.
- 🔥 **News**: ``2024/8/6``: We have open-sourced **CogVideoX-2B**the first model in the CogVideoX series of video
generation models.
- 🌱 **Source**: ```2022/5/19```: We have open-sourced CogVideo (now you can see in `CogVideo` branch)the **first** open-sourced pretrained text-to-video model, and you can check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details.
- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch)the **first** open-sourced pretrained text-to-video model, and you can check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details.
**More powerful models with larger parameter sizes are on the way~ Stay tuned!**
## Quick Start
### SAT
Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.
(18 GB for inference, 40GB for lora finetune)
### Diffusers
```
pip install -r requirements.txt
```
Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters.
(36GB for inference, smaller memory and fine-tuned code are under development)
## CogVideoX-2B Gallery
<div align="center">
@ -79,9 +95,15 @@ of the **CogVideoX** open-source model.
### Inference
+ [cli_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters.
+ [cli_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of memory, but it will be optimized in the future.
+ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters.
+ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of memory, but it will be optimized in the future.
+ [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as GPT, Gemini, etc.
+ [gradio_demo](gradio_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate videos.
<div style="text-align: center;">
<img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
</div>
+ [web_demo](inference/web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model to generate videos.
<div style="text-align: center;">
@ -90,9 +112,7 @@ of the **CogVideoX** open-source model.
### sat
+ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is
recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform
rapid stacking and development.
+ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.
### Tools

View File

@ -26,6 +26,22 @@
- 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型(现在你可以在 `CogVideo` 分支中看到),这是首个开源的基于 Transformer 的大型文本生成视频模型,您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。
**性能更强,参数量更大的模型正在到来的路上~,欢迎关注**
## 快速开始
### SAT
查看sat文件夹下的[sat_demo](sat/README.md):包含了 SAT 权重的推理代码和微调代码,推荐基于此代码进行 CogVideoX 模型结构的改进,研究者使用该代码可以更好的进行快速的迭代和开发。
(18 GB 推理, 40GB lora微调)
### Diffusers
```
pip install -r requirements.txt
```
查看[diffusers_demo](inference/cli_demo.py)包含对推理代码更详细的解释包括各种关键的参数。36GB 推理,显存优化以及微调代码正在开发)
## CogVideoX-2B 视频作品
<div align="center">
@ -74,9 +90,15 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
### inference
+ [cli_demo](inference/cli_demo.py): 更详细的推理代码讲解,常见参数的意义,在这里都会提及。
+ [cli_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码目前需要71GB显存将来会优化。
+ [diffusers_demo](inference/cli_demo.py): 更详细的推理代码讲解,常见参数的意义,在这里都会提及。
+ [diffusers_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码目前需要71GB显存将来会优化。
+ [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合 CogVideoX的长输入。因为CogVideoX是在长文本上训练的所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4也可以替换为GPT、Gemini等任意大语言模型。
+ [gradio_demo](gradio_demo.py): 一个简单的gradio网页应用展示如何使用 CogVideoX-2B 模型生成视频。
<div style="text-align: center;">
<img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
</div>
+ [web_demo](inference/web_demo.py): 一个简单的streamlit网页应用展示如何使用 CogVideoX-2B 模型生成视频。
<div style="text-align: center;">

View File

@ -9,7 +9,6 @@ import torch
from diffusers import CogVideoXPipeline
from datetime import datetime, timedelta
from openai import OpenAI
import spaces
import imageio
import moviepy.editor as mp
from typing import List, Union
@ -88,7 +87,6 @@ def convert_prompt(prompt: str, retry_times: int = 3) -> str:
return prompt
@spaces.GPU(duration=240)
def infer(
prompt: str,
num_inference_steps: int,

View File

@ -1,4 +1,5 @@
git+https://github.com/huggingface/diffusers.git@d1c575ad7ee0390c2735f50cc59a79aae666567a#egg=diffusers
SwissArmyTransformer
torch==2.4.0
torchvision==0.19.0
streamlit==1.37.0

BIN
resources/gradio_demo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 597 KiB

View File

@ -1,7 +1,6 @@
# SAT CogVideoX-2B
This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the
fine-tuning code for SAT weights.
This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the fine-tuning code for SAT weights.
This code is the framework used by the team to train the model. It has few comments and requires careful study.
@ -100,6 +99,14 @@ bash inference.sh
## Fine-Tuning the Model
### Preparing the Environment
```
git clone https://github.com/THUDM/SwissArmyTransformer.git
cd SwissArmyTransformer
pip install -e .
```
### Preparing the Dataset
The dataset format should be as follows:
@ -145,6 +152,8 @@ the `configs/cogvideox_2b_sft.yaml` (for full fine-tuning) as follows.
valid_data: [ "your val data path" ] # Training and validation sets can be the same
split: 1,0,0 # Ratio of training, validation, and test sets
num_workers: 8 # Number of worker threads for data loading
force_train: True # Allow missing keys when loading ckpt (refer to T5 and VAE which are loaded independently)
only_log_video_latents: True # Avoid using VAE decoder when eval to save memory
```
If you wish to use Lora fine-tuning, you also need to modify:

View File

@ -99,6 +99,14 @@ bash inference.sh
## 微调模型
### 准备环境
```
git clone https://github.com/THUDM/SwissArmyTransformer.git
cd SwissArmyTransformer
pip install -e .
```
### 准备数据集
数据集格式应该如下:
@ -143,6 +151,8 @@ Encoder 使用。
valid_data: [ "your val data path" ] # 训练集和验证集可以相同
split: 1,0,0 # 训练集,验证集,测试集比例
num_workers: 8 # 数据加载器的工作线程数
force_train: True # 在加载checkpoint时允许missing keys (T5 和 VAE 单独加载)
only_log_video_latents: True # 避免VAE decode带来的显存开销
```
如果你希望使用 Lora 微调,你还需要修改:

View File

@ -1,4 +1,4 @@
git+https://github.com/THUDM/SwissArmyTransformer.git
SwissArmyTransformer==0.4.11
diffusers>=0.29.2
omegaconf>=2.3.0
torch>=2.3.1