# CogVideo & CogVideoX [中文阅读](./README_zh.md) [日本語で読む](./README_ja.md)
Experience the CogVideoX-5B model online at 🤗 Huggingface Space or 🤖 ModelScope Space
📚 View the paper and user guide
📍 Visit QingYing and API Platform to experience larger-scale commercial video generation models.
## Update and News - 🔥🔥 **News**: ```2024/9/19```: We have open-sourced the CogVideoX series image-to-video model **CogVideoX-5B-I2V**. This model allows inputting an image as a background combined with prompts to generate videos, providing greater controllability. With this release, the CogVideoX series now supports three tasks: text-to-video, video extension, and image-to-video generation. Feel free to try it out [online](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space). - 🔥 **News**: ```2024/9/16```: We have added an automated video generation tool! You can now use local open-source models + FLUX + CogVideoX to automatically generate high-quality videos. Feel free to [try it out](tools/llm_flux_cogvideox/llm_flux_cogvideox.py). - 🔥 **News**: ```2024/9/15```: CogVideoX LoRA fine-tuning weights have been exported and tested successfully with the `diffusers` library. Please check the [tutorial](sat/README.md). - 🔥 **News**: ```2024/8/29```: By adding `pipe.enable_sequential_cpu_offload()` and `pipe.vae.enable_slicing()` to the inference code of CogVideoX-5B, VRAM usage can be reduced to `5GB`. Please check the updated [cli_demo](inference/cli_demo.py). - 🔥 **News**: ```2024/8/27```: The **CogVideoX-2B** model's open-source license has been changed to the **Apache 2.0 License**. - 🔥 **News**: ```2024/8/27```: We have open-sourced a larger model in the CogVideoX series, **CogVideoX-5B**. We have significantly optimized the model's inference performance, greatly lowering the inference threshold. You can run **CogVideoX-2B** on older GPUs like the `GTX 1080TI`, and run the **CogVideoX-5B** model on mid-range GPUs like the `RTX 3060`. Please ensure you update and install the dependencies according to the [requirements](requirements.txt), and refer to the [cli_demo](inference/cli_demo.py) for inference code. - 🔥 **News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) now supports enhancing videos generated by CogVideoX, achieving higher resolution and higher quality video rendering. We welcome you to try it out by following the [tutorial](tools/venhancer/README.md). - 🔥 **News**: ```2024/8/15```: The `SwissArmyTransformer` dependency in CogVideoX has been upgraded to `0.4.12`. Fine-tuning no longer requires installing `SwissArmyTransformer` from source. Additionally, the `Tied VAE` technique has been applied in the implementation within the `diffusers` library. Please install `diffusers` and `accelerate` libraries from source. Inference for CogVideoX now requires only 12GB of VRAM. The inference code needs to be modified. Please check [cli_demo](inference/cli_demo.py). - 🔥 **News**: ```2024/8/12```: The CogVideoX paper has been uploaded to arxiv. Feel free to check out the [paper](https://arxiv.org/abs/2408.06072). - 🔥 **News**: ```2024/8/7```: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be performed on a single 3090 GPU. For more details, please refer to the [code](inference/cli_demo.py). - 🔥 **News**: ```2024/8/6```: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can reconstruct the video almost losslessly. - 🔥 **News**: ```2024/8/6```: We have open-sourced **CogVideoX-2B**,the first model in the CogVideoX series of video generation models. - 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch),the **first** open-sourced pretrained text-to-video model, and you can check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details. **More powerful models with larger parameter sizes are on the way~ Stay tuned!** ## Table of Contents Jump to a specific section: - [Quick Start](#Quick-Start) - [SAT](#sat) - [Diffusers](#Diffusers) - [CogVideoX-2B Video Works](#cogvideox-2b-gallery) - [Introduction to the CogVideoX Model](#Model-Introduction) - [Full Project Structure](#project-structure) - [Inference](#inference) - [SAT](#sat) - [Tools](#tools) - [Introduction to CogVideo(ICLR'23) Model](#cogvideoiclr23) - [Citations](#Citation) - [Open Source Project Plan](#Open-Source-Project-Plan) - [Model License](#Model-License) ## Quick Start ### Prompt Optimization Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use large models like GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation. ### SAT **Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.** Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development. ### Diffusers **Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.** ``` pip install -r requirements.txt ``` Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters. ## Gallery ### CogVideoX-5BModel Name | CogVideoX-2B | CogVideoX-5B | CogVideoX-5B-I2V |
---|---|---|---|
Model Description | Entry-level model, balancing compatibility. Low cost for running and secondary development. | Larger model with higher video generation quality and better visual effects. | CogVideoX-5B image-to-video version. |
Inference Precision | FP16*(recommended), BF16, FP32, FP8*, INT8, not supported: INT4 | BF16 (recommended), FP16, FP32, FP8*, INT8, not supported: INT4 | |
Single GPU Memory Usage |
SAT FP16: 18GB diffusers FP16: from 4GB* diffusers INT8 (torchao): from 3.6GB* |
SAT BF16: 26GB diffusers BF16: from 5GB* diffusers INT8 (torchao): from 4.4GB* |
|
Multi-GPU Inference Memory Usage | FP16: 10GB* using diffusers |
BF16: 15GB* using diffusers |
|
Inference Speed (Step = 50, FP/BF16) |
Single A100: ~90 seconds Single H100: ~45 seconds |
Single A100: ~180 seconds Single H100: ~90 seconds |
|
Fine-tuning Precision | FP16 | BF16 | |
Fine-tuning Memory Usage | 47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT) |
63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT) |
78 GB (bs=1, LORA) 75GB (bs=1, SFT, 16GPU) |
Prompt Language | English* | ||
Maximum Prompt Length | 226 Tokens | ||
Video Length | 6 Seconds | ||
Frame Rate | 8 Frames / Second | ||
Video Resolution | 720 x 480, no support for other resolutions (including fine-tuning) | ||
Position Encoding | 3d_sincos_pos_embed | 3d_sincos_pos_embed | 3d_rope_pos_embed + learnable_pos_embed |
Download Link (Diffusers) | 🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
Download Link (SAT) | SAT |