Table of Contents

This commit is contained in:
zR 2024-08-07 11:49:15 +08:00
parent 66702c6240
commit 6b2287b454
2 changed files with 109 additions and 56 deletions

103
README.md
View File

@ -24,16 +24,38 @@
the video almost losslessly.
- 🔥 **News**: ``2024/8/6``: We have open-sourced **CogVideoX-2B**the first model in the CogVideoX series of video
generation models.
- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch)the **first** open-sourced pretrained text-to-video model, and you can check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details.
- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch)the **first**
open-sourced pretrained text-to-video model, and you can
check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details.
**More powerful models with larger parameter sizes are on the way~ Stay tuned!**
## Table of Contents
Jump to a specific section:
- [Quick Start](#Quick-Start)
- [SAT](#sat)
- [Diffusers](#Diffusers)
- [CogVideoX-2B Video Works](#cogvideox-2b-gallery)
- [Introduction to the CogVideoX Model](#Model-Introduction)
- [Full Project Structure](#project-structure)
- [Inference](#inference)
- [SAT](#sat)
- [Tools](#tools)
- [Introduction to CogVideo(ICLR'23) Model](#cogvideoiclr23)
- [Citations](#Citation)
- [Open Source Project Plan](#Open-Source-Project-Plan)
- [Model License](#Model-License)
## Quick Start
### SAT
Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.
(18 GB for inference, 40GB for lora finetune)
Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is
recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform
rapid stacking and development.
(18 GB for inference, 40GB for lora finetune)
### Diffusers
@ -41,8 +63,9 @@ Follow instructions in [sat_demo](sat/README.md): Contains the inference code an
pip install -r requirements.txt
```
Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters.
(36GB for inference, smaller memory and fine-tuned code are under development)
Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the
significance of common parameters.
(36GB for inference, smaller memory and fine-tuned code are under development)
## CogVideoX-2B Gallery
@ -95,16 +118,23 @@ of the **CogVideoX** open-source model.
### Inference
+ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters.
+ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of memory, but it will be optimized in the future.
+ [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as GPT, Gemini, etc.
+ [gradio_demo](gradio_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate videos.
+ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the
significance of common parameters.
+ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of
memory, but it will be optimized in the future.
+ [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because
CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training
distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as
GPT, Gemini, etc.
+ [gradio_demo](gradio_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate
videos.
<div style="text-align: center;">
<img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
</div>
+ [web_demo](inference/web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model to generate videos.
+ [web_demo](inference/web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model
to generate videos.
<div style="text-align: center;">
<img src="resources/web_demo.png" style="width: 100%; height: auto;" />
@ -112,7 +142,9 @@ of the **CogVideoX** open-source model.
### sat
+ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.
+ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to
improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking
and development.
### Tools
@ -121,28 +153,11 @@ This folder contains some tools for model conversion / caption generation, etc.
+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): Convert SAT model weights to Huggingface model weights.
+ [caption_demo](tools/caption): Caption tool, a model that understands videos and outputs them in text.
## Project Plan
- [x] Open source CogVideoX model
- [x] Open source 3D Causal VAE used in CogVideoX.
- [x] CogVideoX model inference example (CLI / Web Demo)
- [x] CogVideoX online experience demo (Huggingface Space)
- [x] CogVideoX open source model API interface example (Huggingface)
- [x] CogVideoX model fine-tuning example (SAT)
- [ ] CogVideoX model fine-tuning example (Huggingface / SAT)
- [ ] Open source CogVideoX-Pro (adapted for CogVideoX-2B suite)
- [x] Release CogVideoX technical report
We welcome your contributions. You can click [here](resources/contribute.md) for more information.
## Model License
The code in this repository is released under the [Apache 2.0 License](LICENSE).
The model weights and implementation code are released under the [CogVideoX LICENSE](MODEL_LICENSE).
## CogVideo(ICLR'23)
The official repo for the paper: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) is on the [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)
The official repo for the
paper: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868)
is on the [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)
**CogVideo is able to generate relatively high-frame-rate videos.**
A 4-second clip of 32 frames is shown below.
@ -155,8 +170,8 @@ A 4-second clip of 32 frames is shown below.
</div>
The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/), where you can get hands-on practice on text-to-video generation. *The original input is in Chinese.*
The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/), where you can get
hands-on practice on text-to-video generation. *The original input is in Chinese.*
## Citation
@ -175,3 +190,23 @@ The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.a
year={2022}
}
```
## Open Source Project Plan
- [x] Open source CogVideoX model
- [x] Open source 3D Causal VAE used in CogVideoX.
- [x] CogVideoX model inference example (CLI / Web Demo)
- [x] CogVideoX online experience demo (Huggingface Space)
- [x] CogVideoX open source model API interface example (Huggingface)
- [x] CogVideoX model fine-tuning example (SAT)
- [ ] CogVideoX model fine-tuning example (Huggingface / SAT)
- [ ] Open source CogVideoX-Pro (adapted for CogVideoX-2B suite)
- [x] Release CogVideoX technical report
We welcome your contributions. You can click [here](resources/contribute.md) for more information.
## Model License
The code in this repository is released under the [Apache 2.0 License](LICENSE).
The model weights and implementation code are released under the [CogVideoX LICENSE](MODEL_LICENSE).

View File

@ -26,6 +26,23 @@
- 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型(现在你可以在 `CogVideo` 分支中看到),这是首个开源的基于 Transformer 的大型文本生成视频模型,您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。
**性能更强,参数量更大的模型正在到来的路上~,欢迎关注**
## 目录
跳转到指定部分:
- [快速开始](#快速开始)
- [SAT](#sat)
- [Diffusers](#Diffusers)
- [CogVideoX-2B 视频作品](#cogvideox-2b-视频作品)
- [CogVideoX模型介绍](#模型介绍)
- [完整项目代码结构](#完整项目代码结构)
- [Inference](#inference)
- [SAT](#sat)
- [Tools](#tools)
- [开源项目规划](#开源项目规划)
- [模型协议](#模型协议)
- [CogVideo(ICLR'23)模型介绍](#cogvideoiclr23)
- [引用](#引用)
## 快速开始
@ -84,7 +101,7 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
| 下载地址 (Diffusers 模型) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) |
| 下载地址 (SAT 模型) | [SAT](./sat/README_zh.md) |
## 项目结构
## 完整项目代码结构
本开源仓库将带领开发者快速上手 **CogVideoX** 开源模型的基础调用方式、微调示例。
@ -117,24 +134,6 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): 将 SAT 模型权重转换为 Huggingface 模型权重。
+ [caption_demo](tools/caption/README_zh.md): Caption 工具,对视频理解并用文字输出的模型。
## 项目规划
- [x] CogVideoX 模型开源
- [x] CogVideoX 模型推理示例 (CLI / Web Demo)
- [x] CogVideoX 在线体验示例 (Huggingface Space)
- [x] CogVideoX 开源模型API接口示例 (Huggingface)
- [x] CogVideoX 模型微调示例 (SAT)
- [ ] CogVideoX 模型微调示例 (Huggingface / SAT)
- [ ] CogVideoX-Pro 开源(适配 CogVideoX-2B 套件)
- [ ] CogVideoX 技术报告公开
我们欢迎您的贡献,您可以点击[这里](resources/contribute_zh.md)查看更多信息。
## 模型协议
本仓库代码使用 [Apache 2.0 协议](LICENSE) 发布。
本模型权重和模型实现代码根据 [CogVideoX LICENSE](MODEL_LICENSE) 许可证发布。
## CogVideo(ICLR'23)
[CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) 的官方repo位于[CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)。
@ -169,3 +168,22 @@ CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.amine
year={2022}
}
```
## 开源项目规划
- [x] CogVideoX 模型开源
- [x] CogVideoX 模型推理示例 (CLI / Web Demo)
- [x] CogVideoX 在线体验示例 (Huggingface Space)
- [x] CogVideoX 开源模型API接口示例 (Huggingface)
- [x] CogVideoX 模型微调示例 (SAT)
- [ ] CogVideoX 模型微调示例 (Huggingface / SAT)
- [ ] CogVideoX-Pro 开源(适配 CogVideoX-2B 套件)
- [X] CogVideoX 技术报告公开
我们欢迎您的贡献,您可以点击[这里](resources/contribute_zh.md)查看更多信息。
## 模型协议
本仓库代码使用 [Apache 2.0 协议](LICENSE) 发布。
本模型权重和模型实现代码根据 [CogVideoX LICENSE](MODEL_LICENSE) 许可证发布。