# CogVideo & CogVideoX [中文阅读](./README_zh.md) [日本語で読む](./README_ja.md)
Experience the CogVideoX-5B model online at 🤗 Huggingface Space or 🤖 ModelScope Space
📚 View the paper and user guide
📍 Visit QingYing and API Platform to experience larger-scale commercial video generation models.
## Project Updates - 🔥🔥 **News**: ```2025/03/24```: We have launched [CogKit](https://github.com/THUDM/CogKit), a fine-tuning and inference framework for the **CogView4** and **CogVideoX** series. This toolkit allows you to fully explore and utilize our multimodal generation models. - 🔥 **News**: ```2025/02/28```: DDIM Inverse is now supported in `CogVideoX-5B` and `CogVideoX1.5-5B`. Check [here](inference/ddim_inversion.py). - 🔥 **News**: ```2025/01/08```: We have updated the code for `Lora` fine-tuning based on the `diffusers` version model, which uses less GPU memory. For more details, please see [here](finetune/README.md). - 🔥 **News**: ```2024/11/15```: We released the `CogVideoX1.5` model in the diffusers version. Only minor parameter adjustments are needed to continue using previous code. - 🔥 **News**: ```2024/11/08```: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX. The CogVideoX1.5-5B series supports 10-second videos with higher resolution, and CogVideoX1.5-5B-I2V supports video generation at any resolution. The SAT code has already been updated, while the diffusers version is still under adaptation. Download the SAT version code [here](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT). - 🔥 **News**: ```2024/10/13```: A more cost-effective fine-tuning framework for `CogVideoX-5B` that works with a single 4090 GPU, [cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory), has been released. It supports fine-tuning with multiple resolutions. Feel free to use it! - 🔥 **News**: ```2024/10/10```: We have updated our technical report. Please click [here](https://arxiv.org/pdf/2408.06072) to view it. More training details and a demo have been added. To see the demo, click [here](https://yzy-thu.github.io/CogVideoX-demo/).- 🔥 **News**: ```2024/10/09```: We have publicly released the [technical documentation](https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh) for CogVideoX fine-tuning on Feishu, further increasing distribution flexibility. All examples in the public documentation can be fully reproduced. - 🔥 **News**: ```2024/9/19```: We have open-sourced the CogVideoX series image-to-video model **CogVideoX-5B-I2V**. This model can take an image as a background input and generate a video combined with prompt words, offering greater controllability. With this, the CogVideoX series models now support three tasks: text-to-video generation, video continuation, and image-to-video generation. Welcome to try it online at [Experience](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space). - 🔥 ```2024/9/19```: The Caption model [CogVLM2-Caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption), used in the training process of CogVideoX to convert video data into text descriptions, has been open-sourced. Welcome to download and use it. - 🔥 ```2024/8/27```: We have open-sourced a larger model in the CogVideoX series, **CogVideoX-5B**. We have significantly optimized the model's inference performance, greatly lowering the inference threshold. You can run **CogVideoX-2B** on older GPUs like `GTX 1080TI`, and **CogVideoX-5B** on desktop GPUs like `RTX 3060`. Please strictly follow the [requirements](requirements.txt) to update and install dependencies, and refer to [cli_demo](inference/cli_demo.py) for inference code. Additionally, the open-source license for the **CogVideoX-2B** model has been changed to the **Apache 2.0 License**. - 🔥 ```2024/8/6```: We have open-sourced **3D Causal VAE**, used for **CogVideoX-2B**, which can reconstruct videos with almost no loss. - 🔥 ```2024/8/6```: We have open-sourced the first model of the CogVideoX series video generation models, **CogVideoX-2B **. - 🌱 **Source**: ```2022/5/19```: We have open-sourced the CogVideo video generation model (now you can see it in the `CogVideo` branch). This is the first open-source large Transformer-based text-to-video generation model. You can access the [ICLR'23 paper](https://arxiv.org/abs/2205.15868) for technical details. ## Table of Contents Jump to a specific section: - [Quick Start](#quick-start) - [Prompt Optimization](#prompt-optimization) - [SAT](#sat) - [Diffusers](#diffusers) - [Gallery](#gallery) - [CogVideoX-5B](#cogvideox-5b) - [CogVideoX-2B](#cogvideox-2b) - [Model Introduction](#model-introduction) - [Friendly Links](#friendly-links) - [Project Structure](#project-structure) - [Quick Start with Colab](#quick-start-with-colab) - [Inference](#inference) - [finetune](#finetune) - [sat](#sat-1) - [Tools](#tools) - [CogVideo(ICLR'23)](#cogvideoiclr23) - [Citation](#citation) - [Model-License](#model-license) ## Quick Start ### Prompt Optimization Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use large models like GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation. ### SAT **Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.** Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development. ### Diffusers **Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.** ``` pip install -r requirements.txt ``` Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters. For more details on quantized inference, please refer to [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao/). With Diffusers and TorchAO, quantized inference is also possible leading to memory-efficient inference as well as speedup in some cases when compiled. A full list of memory and time benchmarks with various settings on A100 and H100 has been published at [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao). ## Gallery ### CogVideoX-5BModel Name | CogVideoX1.5-5B (Latest) | CogVideoX1.5-5B-I2V (Latest) | CogVideoX-2B | CogVideoX-5B | CogVideoX-5B-I2V |
---|---|---|---|---|---|
Release Date | November 8, 2024 | November 8, 2024 | August 6, 2024 | August 27, 2024 | September 19, 2024 |
Video Resolution | 1360 * 768 | Min(W, H) = 768 768 ≤ Max(W, H) ≤ 1360 Max(W, H) % 16 = 0 |
720 * 480 | ||
Number of Frames | Should be 16N + 1 where N <= 10 (default 81) | Should be 8N + 1 where N <= 6 (default 49) | |||
Inference Precision | BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4 | FP16*(Recommended), BF16, FP32, FP8*, INT8, Not supported: INT4 | BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4 | ||
Single GPU Memory Usage |
SAT BF16: 76GB diffusers BF16: from 10GB* diffusers INT8(torchao): from 7GB* |
SAT FP16: 18GB diffusers FP16: 4GB minimum* diffusers INT8 (torchao): 3.6GB minimum* |
SAT BF16: 26GB diffusers BF16 : 5GB minimum* diffusers INT8 (torchao): 4.4GB minimum* |
||
Multi-GPU Memory Usage | BF16: 24GB* using diffusers |
FP16: 10GB* using diffusers |
BF16: 15GB* using diffusers |
||
Inference Speed (Step = 50, FP/BF16) |
Single A100: ~1000 seconds (5-second video) Single H100: ~550 seconds (5-second video) |
Single A100: ~90 seconds Single H100: ~45 seconds |
Single A100: ~180 seconds Single H100: ~90 seconds |
||
Prompt Language | English* | ||||
Prompt Token Limit | 224 Tokens | 226 Tokens | |||
Video Length | 5 seconds or 10 seconds | 6 seconds | |||
Frame Rate | 16 frames / second | 8 frames / second | |||
Position Encoding | 3d_rope_pos_embed | 3d_sincos_pos_embed | 3d_rope_pos_embed | 3d_rope_pos_embed + learnable_pos_embed | |
Download Link (Diffusers) | 🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
Download Link (SAT) | 🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
SAT |