diff --git a/README.md b/README.md index fd6a91c..5f562cc 100644 --- a/README.md +++ b/README.md @@ -17,16 +17,18 @@ Experience the CogVideoX-5B model online at WeChat and Discord
-📍 Visit 清影 and API Platform to experience larger-scale commercial video generation models. +📍 Visit QingYing and API Platform to experience larger-scale commercial video generation models.
## Update and News -- 🔥🔥 **News**: ```2024/8/27```: We have open-sourced a larger model in the CogVideoX series, **CogVideoX-5B**. At the - same time, **CogVideoX-2B** will be licensed under the **Apache 2.0 License**. We have significantly optimized the - model's - inference performance, greatly lowering the inference threshold. You can now run **CogVideoX-2B** on earlier GPUs like - the `GTX 1080TI`, and **CogVideoX-5B** on mainstream desktop GPUs like the `RTX 3060`. +- 🔥🔥 **News**: ```2024/8/27```: The **CogVideoX-2B** model's open-source license has been changed to the **Apache 2.0 + License**. +- 🔥🔥 **News**: ```2024/8/27```: We have open-sourced a larger model in the CogVideoX series, **CogVideoX-5B**. + We have significantly optimized the model's inference performance, greatly lowering the inference threshold. You can + run **CogVideoX-2B** on older GPUs like the `GTX 1080TI`, and run the **CogVideoX-5B** model on mid-range GPUs like + the `RTX 3060`. Please ensure you update and install the dependencies according to + the [requirements](requirements.txt), and refer to the [cli_demo](inference/cli_demo.py) for inference code. - 🔥 **News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) now supports enhancing videos generated by CogVideoX, achieving higher resolution and higher quality video rendering. We welcome you to try it out by following @@ -155,7 +157,9 @@ To view the corresponding prompt words for the gallery, please click [here](reso ## Model Introduction -CogVideoX is an open-source version of the video generation model originating from [QingYing](https://chatglm.cn/video?fr=osm_cogvideo). The table below displays the list of video generation models we currently offer, along with their foundational information. +CogVideoX is an open-source version of the video generation model originating +from [QingYing](https://chatglm.cn/video?lang=en?fr=osm_cogvideo). The table below displays the list of video generation +models we currently offer, along with their foundational information.Single GPU VRAM Consumption | -FP16: 18GB using SAT / 12.5GB* using diffusers INT8: 7.8GB* using diffusers |
- BF16: 26GB using SAT / 20.7GB* using diffusers INT8: 11.4GB* using diffusers |
+ FP16: 18GB using SAT / 12.5GB* using diffusers INT8: 7.8GB* using diffusers with torchao |
+ BF16: 26GB using SAT / 20.7GB* using diffusers INT8: 11.4GB* using diffusers with torchao |
||||||||||||
Multi-GPU Inference VRAM Consumption | @@ -236,15 +240,26 @@ CogVideoX is an open-source version of the video generation model originating fr **Data Explanation** -- When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()` optimization were enabled. This solution has not been tested for actual VRAM/memory usage on devices other than **NVIDIA A100/H100**. Generally, this solution can be adapted to all devices with **NVIDIA Ampere architecture** and above. If optimization is disabled, VRAM usage will increase significantly, with peak VRAM approximately 3 times the value in the table. +- When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()` + optimization were enabled. This solution has not been tested for actual VRAM/memory usage on devices other than * + *NVIDIA A100/H100**. Generally, this solution can be adapted to all devices with **NVIDIA Ampere architecture** and + above. If optimization is disabled, VRAM usage will increase significantly, with peak VRAM approximately 3 times the + value in the table. - When performing multi-GPU inference, the `enable_model_cpu_offload()` optimization needs to be disabled. -- Using an INT8 model will result in reduced inference speed. This is done to accommodate GPUs with lower VRAM, allowing inference to run properly with minimal video quality loss, though the inference speed will be significantly reduced. -- The 2B model is trained using `FP16` precision, while the 5B model is trained using `BF16` precision. It is recommended to use the precision used in model training for inference. -- `FP8` precision must be used on `NVIDIA H100` and above devices, requiring source installation of the `torch`, `torchao`, `diffusers`, and `accelerate` Python packages. `CUDA 12.4` is recommended. -- Inference speed testing also used the aforementioned VRAM optimization scheme. Without VRAM optimization, inference speed increases by about 10%. Only models using `diffusers` support quantization. +- Using an INT8 model will result in reduced inference speed. This is done to accommodate GPUs with lower VRAM, allowing + inference to run properly with minimal video quality loss, though the inference speed will be significantly reduced. +- The 2B model is trained using `FP16` precision, while the 5B model is trained using `BF16` precision. It is + recommended to use the precision used in model training for inference. +- [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be + used to quantize the Text Encoder, Transformer and VAE modules to lower the memory requirement of CogVideoX. This + makes it possible to run the model on free-tier T4 Colab or smaller VRAM GPUs as well! It is also worth noting that + TorchAO quantization is fully compatible with `torch.compile`, which allows for much faster inference speed. `FP8` + precision must be used on `NVIDIA H100` and above devices, requiring source installation of + the `torch`, `torchao`, `diffusers`, and `accelerate` Python packages. `CUDA 12.4` is recommended. +- Inference speed testing also used the aforementioned VRAM optimization scheme. Without VRAM optimization, inference + speed increases by about 10%. Only models using `diffusers` support quantization. - The model only supports English input; other languages can be translated to English during large model refinements. - ## Friendly Links We highly welcome contributions from the community and actively contribute to the open-source community. The following @@ -277,30 +292,23 @@ of the **CogVideoX** open-source model. using an LLM. The script defaults to using GLM-4, but it can be replaced with GPT, Gemini, or any other large language model. + [gradio_web_demo](inference/gradio_web_demo.py): A simple Gradio web application demonstrating how to use the - CogVideoX-2B model to generate videos. Similar to our Huggingface Space, you can use this script to run a simple web + CogVideoX-2B / 5B model to generate videos. Similar to our Huggingface Space, you can use this script to run a simple + web application for video generation. ```shell cd inference # For Linux and Windows users -python gradio_web_demo.py # humans mode +python gradio_web_demo.py # For macOS with Apple Silicon users, Intel not supported, this maybe 20x slower than RTX 4090 -PYTORCH_ENABLE_MPS_FALLBACK=1 python gradio_web_demo.py # humans mode +PYTORCH_ENABLE_MPS_FALLBACK=1 python gradio_web_demo.py ```
@@ -117,7 +120,8 @@ pip install -r requirements.txt |
@@ -139,7 +143,7 @@ pip install -r requirements.txt
## モデル紹介
-CogVideoXは[清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源のオープンソース版動画生成モデルです。
+CogVideoXは[清影](https://chatglm.cn/video?lang=en?fr=osm_cogvideo) 同源のオープンソース版動画生成モデルです。
以下の表は、提供されている動画生成モデルに関する基本情報を示しています。
|