# CogVideo & CogVideoX [中文阅读](./README_zh.md) [日本語で読む](./README_ja.md)
Experience the CogVideoX-5B model online at 🤗 Huggingface Space or 🤖 ModelScope Space or Replicate
📚 View the paper and user guide
📍 Visit QingYing and API Platform to experience larger-scale commercial video generation models.
## Project Updates - 🔥🔥 **News**: ```2024/9/19```: We have open-sourced the CogVideoX series image-to-video model **CogVideoX-5B-I2V**. This model can take an image as a background input and generate a video combined with prompt words, offering greater controllability. With this, the CogVideoX series models now support three tasks: text-to-video generation, video continuation, and image-to-video generation. Welcome to try it online at [Experience](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space). - 🔥🔥 **News**: ```2024/9/19```: The Caption model [CogVLM2-Caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption), used in the training process of CogVideoX to convert video data into text descriptions, has been open-sourced. Welcome to download and use it. - 🔥 ```2024/8/27```: We have open-sourced a larger model in the CogVideoX series, **CogVideoX-5B**. We have significantly optimized the model's inference performance, greatly lowering the inference threshold. You can run * *CogVideoX-2B** on older GPUs like `GTX 1080TI`, and **CogVideoX-5B** on desktop GPUs like `RTX 3060`. Please strictly follow the [requirements](requirements.txt) to update and install dependencies, and refer to [cli_demo](inference/cli_demo.py) for inference code. Additionally, the open-source license for the **CogVideoX-2B ** model has been changed to the **Apache 2.0 License**. - 🔥 ```2024/8/6```: We have open-sourced **3D Causal VAE**, used for **CogVideoX-2B**, which can reconstruct videos with almost no loss. - 🔥 ```2024/8/6```: We have open-sourced the first model of the CogVideoX series video generation models, **CogVideoX-2B **. - 🌱 **Source**: ```2022/5/19```: We have open-sourced the CogVideo video generation model (now you can see it in the `CogVideo` branch). This is the first open-source large Transformer-based text-to-video generation model. You can access the [ICLR'23 paper](https://arxiv.org/abs/2205.15868) for technical details. ## Table of Contents Jump to a specific section: - [Quick Start](#Quick-Start) - [SAT](#sat) - [Diffusers](#Diffusers) - [CogVideoX-2B Video Works](#cogvideox-2b-gallery) - [Introduction to the CogVideoX Model](#Model-Introduction) - [Full Project Structure](#project-structure) - [Inference](#inference) - [SAT](#sat) - [Tools](#tools) - [Introduction to CogVideo(ICLR'23) Model](#cogvideoiclr23) - [Citations](#Citation) - [Open Source Project Plan](#Open-Source-Project-Plan) - [Model License](#Model-License) ## Quick Start ### Prompt Optimization Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use large models like GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation. ### SAT **Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.** Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development. ### Diffusers **Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.** ``` pip install -r requirements.txt ``` Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters. For more details on quantized inference, please refer to [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao/). With Diffusers and TorchAO, quantized inference is also possible leading to memory-efficient inference as well as speedup in some cases when compiled. A full list of memory and time benchmarks with various settings on A100 and H100 has been published at [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao). ## Gallery ### CogVideoX-5BModel Name | CogVideoX-2B | CogVideoX-5B | CogVideoX-5B-I2V |
---|---|---|---|
Model Description | Entry-level model, balancing compatibility. Low cost for running and secondary development. | Larger model with higher video generation quality and better visual effects. | CogVideoX-5B image-to-video version. |
Inference Precision | FP16*(recommended), BF16, FP32, FP8*, INT8, not supported: INT4 | BF16 (recommended), FP16, FP32, FP8*, INT8, not supported: INT4 | |
Single GPU Memory Usage |
SAT FP16: 18GB diffusers FP16: from 4GB* diffusers INT8 (torchao): from 3.6GB* |
SAT BF16: 26GB diffusers BF16: from 5GB* diffusers INT8 (torchao): from 4.4GB* |
|
Multi-GPU Inference Memory Usage | FP16: 10GB* using diffusers |
BF16: 15GB* using diffusers |
|
Inference Speed (Step = 50, FP/BF16) |
Single A100: ~90 seconds Single H100: ~45 seconds |
Single A100: ~180 seconds Single H100: ~90 seconds |
|
Fine-tuning Precision | FP16 | BF16 | |
Fine-tuning Memory Usage | 47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT) |
63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT) |
78 GB (bs=1, LORA) 75GB (bs=1, SFT, 16GPU) |
Prompt Language | English* | ||
Maximum Prompt Length | 226 Tokens | ||
Video Length | 6 Seconds | ||
Frame Rate | 8 Frames / Second | ||
Video Resolution | 720 x 480, no support for other resolutions (including fine-tuning) | ||
Position Encoding | 3d_sincos_pos_embed | 3d_sincos_pos_embed | 3d_rope_pos_embed + learnable_pos_embed |
Download Link (Diffusers) | 🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel |
Download Link (SAT) | SAT |