ADD PAPER

This commit is contained in:
zR 2024-08-06 02:41:08 +08:00
parent 2da594c831
commit ad855f622c
5 changed files with 46 additions and 41 deletions

View File

@ -4,11 +4,13 @@
<div align="center"> <div align="center">
<img src=resources/logo.svg width="50%"/> <img src=resources/logo.svg width="50%"/>
</div>
<p align="center"> <p align="center">
🤗 Experience on <a href="https://huggingface.co/spaces/THUDM/CogVideoX" target="_blank">CogVideoX Huggingface Space</a> 🤗 Experience on <a href="https://huggingface.co/spaces/THUDM/CogVideoX" target="_blank">CogVideoX Huggingface Space</a>
</p> </p>
</div> <p align="center">
📚 Check here to view <a href="resources/CogVideoX.pdf" target="_blank">Paper</a>
</p>
<p align="center"> <p align="center">
👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a> 👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a>
</p> </p>
@ -55,18 +57,18 @@ to [清影](https://chatglm.cn/video).
The table below shows the list of video generation models we currently provide, The table below shows the list of video generation models we currently provide,
along with related basic information: along with related basic information:
| Model Name | CogVideoX-2B | | Model Name | CogVideoX-2B |
|-------------------------------------------|--------------------------------------------------------------| |-------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|
| Prompt Language | English | | Prompt Language | English |
| GPU Memory Required for Inference (FP16) | 36GB (will be optimized before the PR is merged) | | GPU Memory Required for Inference (FP16) | 36GB using diffusers (will be optimized before the PR is merged) and 25G using [SAT](https://github.com/THUDM/SwissArmyTransformer) |
| GPU Memory Required for Fine-tuning(bs=1) | 46.2GB | | GPU Memory Required for Fine-tuning(bs=1) | 42GB |
| Prompt Max Length | 226 Tokens | | Prompt Max Length | 226 Tokens |
| Video Length | 6 seconds | | Video Length | 6 seconds |
| Frames Per Second | 8 frames | | Frames Per Second | 8 frames |
| Resolution | 720 * 480 | | Resolution | 720 * 480 |
| Quantized Inference | Not Supported | | Quantized Inference | Not Supported |
| Multi-card Inference | Not Supported | | Multi-card Inference | Not Supported |
| Download Link | 🤗 [CogVideoX-2B](https://huggingface.co/THUDM/CogVideoX-2B) | | Download Link | 🤗 [CogVideoX-2B](https://huggingface.co/THUDM/CogVideoX-2B) |
## Project Structure ## Project Structure
@ -89,7 +91,7 @@ of the **CogVideoX** open-source model.
### sat ### sat
+ [sat_demo](sat/configs/README_zh.md): Contains the inference code and fine-tuning code of SAT weights. It is + [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is
recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform
rapid stacking and development. rapid stacking and development.

View File

@ -5,11 +5,13 @@
<div align="center"> <div align="center">
<img src=resources/logo.svg width="50%"/> <img src=resources/logo.svg width="50%"/>
</div>
<p align="center"> <p align="center">
🤗 在 <a href="https://huggingface.co/spaces/THUDM/CogVideoX" target="_blank">CogVideoX Huggingface Space</a> 体验视频生成模型 🤗 在 <a href="https://huggingface.co/spaces/THUDM/CogVideoX" target="_blank">CogVideoX Huggingface Space</a> 体验视频生成模型
</p> </p>
</div> <p align="center">
📚 查看 <a href="resources/CogVideoX.pdf" target="_blank">论文</a>
</p>
<p align="center"> <p align="center">
👋 加入我们的 <a href="resources/WECHAT.md" target="_blank">微信</a><a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a> 👋 加入我们的 <a href="resources/WECHAT.md" target="_blank">微信</a><a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a>
</p> </p>
@ -52,18 +54,18 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生
下表战展示目前我们提供的视频生成模型列表,以及相关基础信息: 下表战展示目前我们提供的视频生成模型列表,以及相关基础信息:
| 模型名字 | CogVideoX-2B | | 模型名字 | CogVideoX-2B |
|----------------|--------------------------------------------------------------| |----------------|-------------------------------------------------------------------------------------------------------------------------------------|
| 提示词语言 | English | | 提示词语言 | English |
| 推理显存消耗 (FP-16) | 36GB | | 推理显存消耗 (FP-16) | 36GB using diffusers (will be optimized before the PR is merged) and 25G using [SAT](https://github.com/THUDM/SwissArmyTransformer) |
| 微调显存消耗 (bs=1) | 46.2GB | | 微调显存消耗 (bs=1) | 42GB |
| 提示词长度上限 | 226 Tokens | | 提示词长度上限 | 226 Tokens |
| 视频长度 | 6 seconds | | 视频长度 | 6 seconds |
| 帧率(每秒) | 8 frames | | 帧率(每秒) | 8 frames |
| 视频分辨率 | 720 * 480 | | 视频分辨率 | 720 * 480 |
| 量化推理 | 不支持 | | 量化推理 | 不支持 |
| 多卡推理 | 不支持 | | 多卡推理 | 不支持 |
| 权重地址 | 🤗 [CogVideoX-2B](https://huggingface.co/THUDM/CogVideoX-2B) | | 权重地址 | 🤗 [CogVideoX-2B](https://huggingface.co/THUDM/CogVideoX-2B) |
## 项目结构 ## 项目结构
@ -77,12 +79,12 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生
+ [web_demo](inference/web_demo.py): 一个简单的streamlit网页应用展示如何使用 CogVideoX-2B 模型生成视频。 + [web_demo](inference/web_demo.py): 一个简单的streamlit网页应用展示如何使用 CogVideoX-2B 模型生成视频。
<div style="text-align: center;"> <div style="text-align: center;">
<img src="resources/web_demo.png" style="width: 100%%; height: auto;" /> <img src="resources/web_demo.png" style="width: 100%; height: auto;" />
</div> </div>
### sat ### sat
+ [sat_demo](sat/configs/README_zh.md): 包含了 SAT 权重的推理代码和微调代码,推荐基于 CogVideoX + [sat_demo](sat/README_zh.md): 包含了 SAT 权重的推理代码和微调代码,推荐基于 CogVideoX
模型结构进行改进,创新的研究者使用改代码以更好的进行快速的堆叠和开发。 模型结构进行改进,创新的研究者使用改代码以更好的进行快速的堆叠和开发。
### tools ### tools

BIN
resources/CogVideoX.pdf Normal file

Binary file not shown.

View File

@ -5,7 +5,7 @@ args:
batch_size: 1 batch_size: 1
input_type: txt input_type: txt
input_file: test.txt input_file: test.txt
sampling_num_frames: 13 sampling_num_frames: 13 # Must be 11,13 or 19
sampling_fps: 8 sampling_fps: 8
fp16: True fp16: True
output_dir: outputs/ output_dir: outputs/
@ -82,13 +82,13 @@ model:
target: sgm.modules.GeneralConditioner target: sgm.modules.GeneralConditioner
params: params:
emb_models: emb_models:
- is_trainable: false - is_trainable: false
input_key: txt input_key: txt
ucg_rate: 0.1 ucg_rate: 0.1
target: sgm.modules.encoders.modules.FrozenT5Embedder target: sgm.modules.encoders.modules.FrozenT5Embedder
params: params:
model_dir: "google/t5-v1_1-xxl" model_dir: "google/t5-v1_1-xxl"
max_length: 226 max_length: 226
first_stage_config: first_stage_config:
target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper

View File

@ -177,12 +177,13 @@ def sampling_main(args, model_cls):
latent = 1.0 / model.scale_factor * samples_z latent = 1.0 / model.scale_factor * samples_z
recons = [] recons = []
for i in range(6): loop_num = (T - 1) // 2
for i in range(loop_num):
if i == 0: if i == 0:
start_frame, end_frame = 0, 3 start_frame, end_frame = 0, 3
else: else:
start_frame, end_frame = i * 2 + 1, i * 2 + 3 start_frame, end_frame = i * 2 + 1, i * 2 + 3
if i == 5: if i == loop_num - 1:
clear_fake_cp_cache = True clear_fake_cp_cache = True
else: else:
clear_fake_cp_cache = False clear_fake_cp_cache = False