diff --git a/README.md b/README.md index 4b7f3af..b397e58 100644 --- a/README.md +++ b/README.md @@ -24,16 +24,38 @@ the video almost losslessly. - 🔥 **News**: ``2024/8/6``: We have open-sourced **CogVideoX-2B**,the first model in the CogVideoX series of video generation models. -- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch),the **first** open-sourced pretrained text-to-video model, and you can check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details. +- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch),the **first** + open-sourced pretrained text-to-video model, and you can + check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details. **More powerful models with larger parameter sizes are on the way~ Stay tuned!** +## Table of Contents + +Jump to a specific section: + +- [Quick Start](#Quick-Start) + - [SAT](#sat) + - [Diffusers](#Diffusers) +- [CogVideoX-2B Video Works](#cogvideox-2b-gallery) +- [Introduction to the CogVideoX Model](#Model-Introduction) +- [Full Project Structure](#project-structure) + - [Inference](#inference) + - [SAT](#sat) + - [Tools](#tools) +- [Introduction to CogVideo(ICLR'23) Model](#cogvideoiclr23) +- [Citations](#Citation) +- [Open Source Project Plan](#Open-Source-Project-Plan) +- [Model License](#Model-License) + ## Quick Start ### SAT -Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development. - (18 GB for inference, 40GB for lora finetune) +Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is +recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform +rapid stacking and development. +(18 GB for inference, 40GB for lora finetune) ### Diffusers @@ -41,8 +63,9 @@ Follow instructions in [sat_demo](sat/README.md): Contains the inference code an pip install -r requirements.txt ``` -Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters. - (36GB for inference, smaller memory and fine-tuned code are under development) +Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the +significance of common parameters. +(36GB for inference, smaller memory and fine-tuned code are under development) ## CogVideoX-2B Gallery @@ -95,16 +118,23 @@ of the **CogVideoX** open-source model. ### Inference -+ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters. -+ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of memory, but it will be optimized in the future. -+ [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as GPT, Gemini, etc. -+ [gradio_demo](gradio_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate videos. ++ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the + significance of common parameters. ++ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of + memory, but it will be optimized in the future. ++ [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because + CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training + distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as + GPT, Gemini, etc. ++ [gradio_demo](gradio_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate + videos.
-+ [web_demo](inference/web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model to generate videos. ++ [web_demo](inference/web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model + to generate videos.
@@ -112,40 +142,25 @@ of the **CogVideoX** open-source model. ### sat -+ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development. ++ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to + improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking + and development. ### Tools This folder contains some tools for model conversion / caption generation, etc. -+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): Convert SAT model weights to Huggingface model weights. ++ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): Convert SAT model weights to Huggingface model weights. + [caption_demo](tools/caption): Caption tool, a model that understands videos and outputs them in text. -## Project Plan - -- [x] Open source CogVideoX model - - [x] Open source 3D Causal VAE used in CogVideoX. - - [x] CogVideoX model inference example (CLI / Web Demo) - - [x] CogVideoX online experience demo (Huggingface Space) - - [x] CogVideoX open source model API interface example (Huggingface) - - [x] CogVideoX model fine-tuning example (SAT) - - [ ] CogVideoX model fine-tuning example (Huggingface / SAT) - - [ ] Open source CogVideoX-Pro (adapted for CogVideoX-2B suite) - - [x] Release CogVideoX technical report - -We welcome your contributions. You can click [here](resources/contribute.md) for more information. - -## Model License - -The code in this repository is released under the [Apache 2.0 License](LICENSE). - -The model weights and implementation code are released under the [CogVideoX LICENSE](MODEL_LICENSE). - ## CogVideo(ICLR'23) -The official repo for the paper: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) is on the [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo) + +The official repo for the +paper: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) +is on the [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo) **CogVideo is able to generate relatively high-frame-rate videos.** -A 4-second clip of 32 frames is shown below. +A 4-second clip of 32 frames is shown below. ![High-frame-rate sample](https://raw.githubusercontent.com/THUDM/CogVideo/CogVideo/assets/appendix-sample-highframerate.png) @@ -155,8 +170,8 @@ A 4-second clip of 32 frames is shown below.
-The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/), where you can get hands-on practice on text-to-video generation. *The original input is in Chinese.* - +The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/), where you can get +hands-on practice on text-to-video generation. *The original input is in Chinese.* ## Citation @@ -175,3 +190,23 @@ The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.a year={2022} } ``` + +## Open Source Project Plan + +- [x] Open source CogVideoX model + - [x] Open source 3D Causal VAE used in CogVideoX. + - [x] CogVideoX model inference example (CLI / Web Demo) + - [x] CogVideoX online experience demo (Huggingface Space) + - [x] CogVideoX open source model API interface example (Huggingface) + - [x] CogVideoX model fine-tuning example (SAT) + - [ ] CogVideoX model fine-tuning example (Huggingface / SAT) + - [ ] Open source CogVideoX-Pro (adapted for CogVideoX-2B suite) + - [x] Release CogVideoX technical report + +We welcome your contributions. You can click [here](resources/contribute.md) for more information. + +## Model License + +The code in this repository is released under the [Apache 2.0 License](LICENSE). + +The model weights and implementation code are released under the [CogVideoX LICENSE](MODEL_LICENSE). diff --git a/README_zh.md b/README_zh.md index 70419c9..a0b3c0b 100644 --- a/README_zh.md +++ b/README_zh.md @@ -26,6 +26,23 @@ - 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型(现在你可以在 `CogVideo` 分支中看到),这是首个开源的基于 Transformer 的大型文本生成视频模型,您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。 **性能更强,参数量更大的模型正在到来的路上~,欢迎关注** +## 目录 + +跳转到指定部分: + +- [快速开始](#快速开始) + - [SAT](#sat) + - [Diffusers](#Diffusers) +- [CogVideoX-2B 视频作品](#cogvideox-2b-视频作品) +- [CogVideoX模型介绍](#模型介绍) +- [完整项目代码结构](#完整项目代码结构) + - [Inference](#inference) + - [SAT](#sat) + - [Tools](#tools) +- [开源项目规划](#开源项目规划) +- [模型协议](#模型协议) +- [CogVideo(ICLR'23)模型介绍](#cogvideoiclr23) +- [引用](#引用) ## 快速开始 @@ -84,7 +101,7 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源 | 下载地址 (Diffusers 模型) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) | | 下载地址 (SAT 模型) | [SAT](./sat/README_zh.md) | -## 项目结构 +## 完整项目代码结构 本开源仓库将带领开发者快速上手 **CogVideoX** 开源模型的基础调用方式、微调示例。 @@ -117,24 +134,6 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源 + [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): 将 SAT 模型权重转换为 Huggingface 模型权重。 + [caption_demo](tools/caption/README_zh.md): Caption 工具,对视频理解并用文字输出的模型。 -## 项目规划 - -- [x] CogVideoX 模型开源 - - [x] CogVideoX 模型推理示例 (CLI / Web Demo) - - [x] CogVideoX 在线体验示例 (Huggingface Space) - - [x] CogVideoX 开源模型API接口示例 (Huggingface) - - [x] CogVideoX 模型微调示例 (SAT) - - [ ] CogVideoX 模型微调示例 (Huggingface / SAT) - - [ ] CogVideoX-Pro 开源(适配 CogVideoX-2B 套件) - - [ ] CogVideoX 技术报告公开 - -我们欢迎您的贡献,您可以点击[这里](resources/contribute_zh.md)查看更多信息。 - -## 模型协议 - -本仓库代码使用 [Apache 2.0 协议](LICENSE) 发布。 - -本模型权重和模型实现代码根据 [CogVideoX LICENSE](MODEL_LICENSE) 许可证发布。 ## CogVideo(ICLR'23) [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) 的官方repo位于[CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)。 @@ -168,4 +167,23 @@ CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.amine journal={arXiv preprint arXiv:2205.15868}, year={2022} } -``` \ No newline at end of file +``` + +## 开源项目规划 + +- [x] CogVideoX 模型开源 + - [x] CogVideoX 模型推理示例 (CLI / Web Demo) + - [x] CogVideoX 在线体验示例 (Huggingface Space) + - [x] CogVideoX 开源模型API接口示例 (Huggingface) + - [x] CogVideoX 模型微调示例 (SAT) + - [ ] CogVideoX 模型微调示例 (Huggingface / SAT) + - [ ] CogVideoX-Pro 开源(适配 CogVideoX-2B 套件) + - [X] CogVideoX 技术报告公开 + +我们欢迎您的贡献,您可以点击[这里](resources/contribute_zh.md)查看更多信息。 + +## 模型协议 + +本仓库代码使用 [Apache 2.0 协议](LICENSE) 发布。 + +本模型权重和模型实现代码根据 [CogVideoX LICENSE](MODEL_LICENSE) 许可证发布。