mirror of
https://github.com/THUDM/CogVideo.git
synced 2025-04-05 03:04:56 +08:00
update the release draft readme
This commit is contained in:
parent
e000b43cd4
commit
6d7f6e8601
244
README.md
244
README.md
@ -8,7 +8,7 @@
|
||||
<img src=resources/logo.svg width="50%"/>
|
||||
</div>
|
||||
<p align="center">
|
||||
🤗 Experience on <a href="https://huggingface.co/spaces/THUDM/CogVideoX" target="_blank">CogVideoX Huggingface Space</a>
|
||||
Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B" target="_blank"> 🤗 Huggingface Space</a> or <a href="https://modelscope.cn/studios/ZhipuAI/CogVideoX-5b-demo" target="_blank"> 🤖 ModelScope Space</a>
|
||||
</p>
|
||||
<p align="center">
|
||||
📚 Check here to view <a href="https://arxiv.org/abs/2408.06072" target="_blank">Paper</a>
|
||||
@ -22,7 +22,12 @@
|
||||
|
||||
## Update and News
|
||||
|
||||
- 🔥🔥 **News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) now supports enhancing videos
|
||||
- 🔥🔥 **News**: ```2024/8/27```: We have open-sourced a larger model in the CogVideoX series, **CogVideoX-5B**. At the
|
||||
same time, **CogVideoX-2B** will be licensed under the **Apache 2.0 License**. We have significantly optimized the
|
||||
model's
|
||||
inference performance, greatly lowering the inference threshold. You can now run **CogVideoX-2B** on earlier GPUs like
|
||||
the `GTX 1080TI`, and **CogVideoX-5B** on mainstream desktop GPUs like the `RTX 3060`.
|
||||
- 🔥 **News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) now supports enhancing videos
|
||||
generated by
|
||||
CogVideoX, achieving higher resolution and higher quality video rendering. We welcome you to try it out by following
|
||||
the [tutorial](tools/venhancer/README_zh.md).
|
||||
@ -80,7 +85,6 @@ with long prompts, and a good prompt directly impacts the quality of the video g
|
||||
Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is
|
||||
recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform
|
||||
rapid stacking and development.
|
||||
(18 GB for inference, 40GB for lora finetune)
|
||||
|
||||
### Diffusers
|
||||
|
||||
@ -92,51 +96,154 @@ pip install -r requirements.txt
|
||||
|
||||
Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the
|
||||
significance of common parameters.
|
||||
(24GB for inference,fine-tuned code are under development)
|
||||
|
||||
## CogVideoX-2B Gallery
|
||||
## Gallery
|
||||
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="80%" controls autoplay></video>
|
||||
<p>A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.</p>
|
||||
</div>
|
||||
### CogVideoX-5B
|
||||
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
|
||||
<tr>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/cf5953ea-96d3-48fd-9907-c4708752c714" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/fe0a78e6-b669-4800-8cf0-b5f9b5145b52" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/c182f606-8f8c-421d-b414-8487070fcfcb" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/7db2bbce-194d-434d-a605-350254b6c298" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/62b01046-8cab-44cc-bd45-4d965bb615ec" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/d78e552a-4b3f-4b81-ac3f-3898079554f6" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/30894f12-c741-44a2-9e6e-ddcacc231e5b" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/926575ca-7150-435b-a0ff-4900a963297b" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="80%" controls autoplay></video>
|
||||
<p>The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from its tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.</p>
|
||||
</div>
|
||||
### CogVideoX-2B
|
||||
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
|
||||
<tr>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="80%" controls autoplay></video>
|
||||
<p>A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.</p>
|
||||
</div>
|
||||
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="80%" controls autoplay></video>
|
||||
<p>In the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.</p>
|
||||
</div>
|
||||
To view the corresponding prompt words for the gallery, please click [here](resources/galary_prompt.md)
|
||||
|
||||
## Model Introduction
|
||||
|
||||
CogVideoX is an open-source version of the video generation model, which is homologous
|
||||
to [清影](https://chatglm.cn/video?fr=osm_cogvideox).
|
||||
<table style="border-collapse: collapse; width: 100%;">
|
||||
<tr>
|
||||
<th style="text-align: center;">Model Name</th>
|
||||
<th style="text-align: center;">CogVideoX-2B</th>
|
||||
<th style="text-align: center;">CogVideoX-5B</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Model Description</td>
|
||||
<td style="text-align: center;">Entry-level model, balancing compatibility. Low cost for running and secondary development.</td>
|
||||
<td style="text-align: center;">Larger model with higher video generation quality and better visual effects.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Inference Precision</td>
|
||||
<td style="text-align: center;"><b>FP16*(Recommended)</b>, BF16, FP32, FP8*(E4M3, E5M2), INT8, INT4 not supported</td>
|
||||
<td style="text-align: center;"><b>BF16(Recommended)</b>, FP16, FP32, FP8*(E4M3, E5M2), INT8, INT4 not supported</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Single GPU Memory Consumption<br></td>
|
||||
<td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
|
||||
<td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Multi-GPU Inference Memory Consumption</td>
|
||||
<td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
|
||||
<td style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Inference Speed<br>(Step = 50)</td>
|
||||
<td style="text-align: center;">FP16: ~90* s</td>
|
||||
<td style="text-align: center;">BF16: ~180* s</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Fine-Tuning Precision</td>
|
||||
<td style="text-align: center;"><b>FP16</b></td>
|
||||
<td style="text-align: center;"><b>BF16</b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Fine-Tuning Memory Consumption (per GPU)</td>
|
||||
<td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
|
||||
<td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Prompt Language</td>
|
||||
<td colspan="2" style="text-align: center;">English*</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Prompt Length Limit</td>
|
||||
<td colspan="2" style="text-align: center;">226 Tokens</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Video Length</td>
|
||||
<td colspan="2" style="text-align: center;">6 seconds</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Frame Rate</td>
|
||||
<td colspan="2" style="text-align: center;">8 frames per second</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Video Resolution</td>
|
||||
<td colspan="2" style="text-align: center;">720 * 480, other resolutions not supported (including fine-tuning)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Positional Encoding</td>
|
||||
<td style="text-align: center;">3d_sincos_pos_embed</td>
|
||||
<td style="text-align: center;">3d_rope_pos_embed<br></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Download Links (Diffusers Model)</td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Download Links (SAT Model)</td>
|
||||
<td colspan="2" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
The table below shows the list of video generation models we currently provide,
|
||||
along with related basic information:
|
||||
**Data Explanation**
|
||||
|
||||
| Model Name | CogVideoX-2B |
|
||||
|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| Prompt Language | English |
|
||||
| Single GPU Inference (FP16) | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer) <br> 23.9GB using diffusers |
|
||||
| Multi GPUs Inference (FP16) | 20GB minimum per GPU using diffusers |
|
||||
| GPU Memory Required for Fine-tuning(bs=1) | 40GB |
|
||||
| Prompt Max Length | 226 Tokens |
|
||||
| Video Length | 6 seconds |
|
||||
| Frames Per Second | 8 frames |
|
||||
| Resolution | 720 * 480 |
|
||||
| Quantized Inference | Not Supported |
|
||||
| Download Link (HF diffusers Model) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) [💫 WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) |
|
||||
| Download Link (SAT Model) | [SAT](./sat/README.md) |
|
||||
+ When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()`
|
||||
optimization were enabled. This setup has not been tested for actual memory/VRAM usage on devices other than **NVIDIA
|
||||
A100 / H100**. Generally, this approach should be compatible with all devices using the **NVIDIA Ampere architecture**
|
||||
and above. If these optimizations are disabled, memory usage will increase significantly, with peak VRAM usage
|
||||
approximately three times higher than the values shown in the table.
|
||||
+ When performing multi-GPU inference, the `enable_model_cpu_offload()` optimization must be disabled.
|
||||
+ Using the INT8 model will result in slower inference speeds. This is done to ensure that inference can be performed on
|
||||
GPUs with lower memory without significant video quality loss, albeit with a notable reduction in speed.
|
||||
+ Inference speed tests were also conducted with the above memory optimizations. Without memory optimization, inference
|
||||
speed increases by approximately 10%. Only the `diffusers` version of the model supports quantization.
|
||||
+ The model only supports English input; other languages can be translated into English when refined through large
|
||||
language models.
|
||||
|
||||
## Friendly Links
|
||||
|
||||
@ -157,20 +264,25 @@ of the **CogVideoX** open-source model.
|
||||
|
||||
### Inference
|
||||
|
||||
+ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the
|
||||
significance of common parameters.
|
||||
+ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of
|
||||
memory, but it will be optimized in the future.
|
||||
+ [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because
|
||||
CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training
|
||||
distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as
|
||||
GPT, Gemini, etc.
|
||||
+ [gradio_web_demo](inference/gradio_web_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B
|
||||
model to generate videos. Same as Our Huggingface Space, you can use this script to launch a web demo.
|
||||
+ [dcli_demo](inference/cli_demo.py): A more detailed inference code explanation, including the significance of
|
||||
common parameters. All of this is covered here.
|
||||
+ [cli_demo_quantization](inference/cli_demo_quantization.py):
|
||||
Quantized model inference code that can run on devices with lower memory. You can also modify this code to support
|
||||
running CogVideoX models in FP8 precision.
|
||||
+ [diffusers_vae_demo](inference/cli_vae_demo.py): Code for running VAE inference separately.
|
||||
+ [space demo](inference/gradio_composite_demo): The same GUI code as used in the Huggingface Space, with frame
|
||||
interpolation and super-resolution tools integrated.
|
||||
+ [convert_demo](inference/convert_demo.py): How to convert user input into long-form input suitable for CogVideoX.
|
||||
Since CogVideoX is trained on long texts, we need to transform the input text distribution to match the training data
|
||||
using an LLM. The script defaults to using GLM4, but it can be replaced with GPT, Gemini, or any other large language
|
||||
model.
|
||||
+ [gradio_web_demo](inference/gradio_web_demo.py): A simple Gradio web application demonstrating how to use the
|
||||
CogVideoX-2B model to generate videos. Similar to our Huggingface Space, you can use this script to run a simple web
|
||||
application for video generation.
|
||||
|
||||
```shell
|
||||
cd inference
|
||||
# For Linux and Windows users (and macOS with Intel??)
|
||||
# For Linux and Windows users
|
||||
python gradio_web_demo.py # humans mode
|
||||
|
||||
# For macOS with Apple Silicon users, Intel not supported, this maybe 20x slower than RTX 4090
|
||||
@ -243,20 +355,28 @@ hands-on practice on text-to-video generation. *The original input is in Chinese
|
||||
|
||||
## Open Source Project Plan
|
||||
|
||||
- [x] Open source CogVideoX model
|
||||
- [x] Open source 3D Causal VAE used in CogVideoX.
|
||||
- [x] CogVideoX model inference example (CLI / Web Demo)
|
||||
- [x] CogVideoX online experience demo (Huggingface Space)
|
||||
- [x] CogVideoX open source model API interface example (Huggingface)
|
||||
- [x] CogVideoX model fine-tuning example (SAT)
|
||||
- [ ] CogVideoX model fine-tuning example (Huggingface / SAT)
|
||||
- [ ] Open source CogVideoX-Pro (adapted for CogVideoX-2B suite)
|
||||
- [x] Release CogVideoX technical report
|
||||
- [x] CogVideoX Model Open Source
|
||||
- [x] CogVideoX Model Inference Example (CLI / Web Demo)
|
||||
- [x] CogVideoX Online Experience Example (Huggingface Space)
|
||||
- [x] CogVideoX Open Source Model API Interface Example (Huggingface)
|
||||
- [x] CogVideoX Model Fine-Tuning Example (SAT)
|
||||
- [ ] CogVideoX Model Fine-Tuning Example (Huggingface Diffusers)
|
||||
- [X] CogVideoX-5B Open Source (Adapted to CogVideoX-2B Suite)
|
||||
- [X] CogVideoX Technical Report Released
|
||||
- [X] CogVideoX Technical Explanation Video
|
||||
- [ ] CogVideoX Peripheral Tools
|
||||
- [X] Basic Video Super-Resolution / Frame Interpolation Suite
|
||||
- [ ] Inference Framework Adaptation
|
||||
- [ ] ComfyUI Full Ecosystem Tools
|
||||
|
||||
We welcome your contributions. You can click [here](resources/contribute.md) for more information.
|
||||
We welcome your contributions! You can click [here](resources/contribute_zh.md) for more information.
|
||||
|
||||
## Model License
|
||||
## License Agreement
|
||||
|
||||
The code in this repository is released under the [Apache 2.0 License](LICENSE).
|
||||
|
||||
The model weights and implementation code are released under the [CogVideoX LICENSE](MODEL_LICENSE).
|
||||
The CogVideoX-2B model (including its corresponding Transformers module and VAE module) is released under
|
||||
the [Apache 2.0 License](LICENSE).
|
||||
|
||||
The CogVideoX-5B model (Transformers module) is released under
|
||||
the [CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE).
|
||||
|
244
README_ja.md
244
README_ja.md
@ -8,7 +8,7 @@
|
||||
<img src=resources/logo.svg width="50%"/>
|
||||
</div>
|
||||
<p align="center">
|
||||
🤗 <a href="https://huggingface.co/spaces/THUDM/CogVideoX" target="_blank">CogVideoX Huggingface Space</a> で体験
|
||||
<a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B" target="_blank"> 🤗 Huggingface Space</a> または <a href="https://modelscope.cn/studios/ZhipuAI/CogVideoX-5b-demo" target="_blank"> 🤖 ModelScope Space</a> で CogVideoX-5B モデルをオンラインで体験してください
|
||||
</p>
|
||||
<p align="center">
|
||||
📚 <a href="https://arxiv.org/abs/2408.06072" target="_blank">論文</a> をチェック
|
||||
@ -22,7 +22,10 @@
|
||||
|
||||
## 更新とニュース
|
||||
|
||||
- 🔥🔥 **ニュース**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) は CogVideoX
|
||||
- 🔥🔥 **ニュース**: ```2024/8/27```: CogVideoXシリーズのより大きなモデル**CogVideoX-5B**をオープンソース化しました。同時に、
|
||||
**CogVideoX-2B**は **Apache 2.0** ライセンスに変更されます。モデルの推論性能を大幅に最適化し、推論のハードルを大きく下げました。これにより、
|
||||
**CogVideoX-2B**は `GTX 1080TI` などの古いGPUで、**CogVideoX-5B**は `RTX 3060` などのデスクトップ向けGPUで実行できます。
|
||||
- 🔥**ニュース**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) は CogVideoX
|
||||
が生成したビデオの強化をサポートしました。より高い解像度とより高品質なビデオレンダリングを実現します。[チュートリアル](tools/venhancer/README_ja.md)
|
||||
に従って、ぜひお試しください。
|
||||
- 🔥**ニュース**: 2024/8/15: CogVideoX の依存関係である`SwissArmyTransformer`の依存が`0.4.12`
|
||||
@ -71,7 +74,6 @@
|
||||
|
||||
[sat_demo](sat/README.md) の指示に従ってください:
|
||||
SATウェイトの推論コードと微調整コードが含まれています。CogVideoXモデル構造に基づいて改善することをお勧めします。革新的な研究者は、このコードを使用して迅速なスタッキングと開発を行うことができます。
|
||||
(推論には18GB、lora微調整には40GBが必要です)
|
||||
|
||||
### Diffusers
|
||||
|
||||
@ -80,49 +82,156 @@ pip install -r requirements.txt
|
||||
```
|
||||
|
||||
次に [diffusers_demo](inference/cli_demo.py) を参照してください: 推論コードの詳細な説明が含まれており、一般的なパラメータの意味についても言及しています。
|
||||
(推論には24GBが必要で、微調整コードは開発中です)
|
||||
|
||||
## CogVideoX-2B ギャラリー
|
||||
## Gallery
|
||||
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="80%" controls autoplay></video>
|
||||
<p>詳細に彫刻されたマストと帆を持つ木製の玩具船が、海の波を模倣した豪華な青いカーペットの上を滑らかに進んでいます。船体は濃い茶色に塗られ、小さな窓が付いています。カーペットは柔らかく、テクスチャーがあり、海洋の広がりを連想させる完璧な背景を提供します。船の周りにはさまざまな他の玩具や子供のアイテムがあり、遊び心のある環境を示唆しています。このシーンは、子供時代の無邪気さと想像力を捉えており、玩具船の旅は室内の幻想的な設定での無限の冒険を象徴しています。</p>
|
||||
</div>
|
||||
### CogVideoX-5B
|
||||
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
|
||||
<tr>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/cf5953ea-96d3-48fd-9907-c4708752c714" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/fe0a78e6-b669-4800-8cf0-b5f9b5145b52" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/c182f606-8f8c-421d-b414-8487070fcfcb" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/7db2bbce-194d-434d-a605-350254b6c298" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/62b01046-8cab-44cc-bd45-4d965bb615ec" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/d78e552a-4b3f-4b81-ac3f-3898079554f6" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/30894f12-c741-44a2-9e6e-ddcacc231e5b" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/926575ca-7150-435b-a0ff-4900a963297b" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="80%" controls autoplay></video>
|
||||
<p>カメラは、黒いルーフラックを備えた白いビンテージSUVの後ろを追いかけ、急な山道をスピードアップして進みます。タイヤからほこりが舞い上がり、日光がSUVに当たり、暖かい輝きを放ちます。山道は緩やかに曲がり、他の車両は見当たりません。道の両側には赤杉の木が立ち並び、緑のパッチが点在しています。車は後ろから見て、険しい地形を楽々と進んでいるように見えます。山道自体は急な丘と山に囲まれ、上空には青い空と薄い雲が広がっています。</p>
|
||||
</div>
|
||||
### CogVideoX-2B
|
||||
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
|
||||
<tr>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="80%" controls autoplay></video>
|
||||
<p>色とりどりのバンダナを巻いた、擦り切れたデニムジャケットを着たストリートアーティストが、広大なコンクリートの壁の前に立ち、スプレーペイント缶を持ち、斑点のある壁にカラフルな鳥をスプレーペイントしています。</p>
|
||||
</div>
|
||||
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="80%" controls autoplay></video>
|
||||
<p>戦争で荒廃した都市の背景に、廃墟と崩れた壁が破壊の物語を語る中、若い少女の感動的なクローズアップがフレームに収められています。彼女の顔は灰で汚れており、周囲の混乱を静かに物語っています。彼女の目は悲しみと回復力の混じった輝きを放ち、紛争の荒廃によって無垢を失った世界の生の感情を捉えています。</p>
|
||||
</div>
|
||||
ギャラリーの対応するプロンプトワードを表示するには、[こちら](resources/galary_prompt.md)をクリックしてください
|
||||
|
||||
## モデル紹介
|
||||
|
||||
CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源のオープンソース版ビデオ生成モデルです。
|
||||
CogVideoXは [清影](https://chatglm.cn/video?fr=osm_cogvideox) に由来するオープンソース版のビデオ生成モデルです。
|
||||
以下の表は、提供しているビデオ生成モデルに関する基本情報を示しています。
|
||||
|
||||
以下の表は、現在提供しているビデオ生成モデルのリストと関連する基本情報を示しています:
|
||||
<table style="border-collapse: collapse; width: 100%;">
|
||||
<tr>
|
||||
<th style="text-align: center;">モデル名</th>
|
||||
<th style="text-align: center;">CogVideoX-2B</th>
|
||||
<th style="text-align: center;">CogVideoX-5B</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">モデル紹介</td>
|
||||
<td style="text-align: center;">入門レベルのモデルで、互換性を重視しています。運用や二次開発のコストが低いです。</td>
|
||||
<td style="text-align: center;">より高いビデオ生成品質と優れた視覚効果を提供する大型モデル。</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">推論精度</td>
|
||||
<td style="text-align: center;"><b>FP16*(推奨)</b>, BF16, FP32, FP8(E4M3, E5M2), INT8, INT4はサポートされていません</td>
|
||||
<td style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32, FP8(E4M3, E5M2), INT8, INT4はサポートされていません</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">単一GPUメモリ消費量<br></td>
|
||||
<td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
|
||||
<td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">マルチGPU推論メモリ消費量</td>
|
||||
<td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
|
||||
<td style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">推論速度<br>(Step = 50)</td>
|
||||
<td style="text-align: center;">FP16: ~90* s</td>
|
||||
<td style="text-align: center;">BF16: ~180* s</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">微調整精度</td>
|
||||
<td style="text-align: center;"><b>FP16</b></td>
|
||||
<td style="text-align: center;"><b>BF16</b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">微調整メモリ消費量(各GPU)</td>
|
||||
<td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
|
||||
<td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">プロンプト言語</td>
|
||||
<td colspan="2" style="text-align: center;">英語*</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">プロンプト長さ制限</td>
|
||||
<td colspan="2" style="text-align: center;">226 トークン</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">ビデオ長さ</td>
|
||||
<td colspan="2" style="text-align: center;">6 秒</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">フレームレート</td>
|
||||
<td colspan="2" style="text-align: center;">8 フレーム/秒</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">ビデオ解像度</td>
|
||||
<td colspan="2" style="text-align: center;">720 * 480、他の解像度はサポートされていません(微調整を含む)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">位置エンコーディング</td>
|
||||
<td style="text-align: center;">3d_sincos_pos_embed</td>
|
||||
<td style="text-align: center;">3d_rope_pos_embed<br></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">ダウンロードリンク (Diffusers モデル)</td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">ダウンロードリンク (SAT モデル)</td>
|
||||
<td colspan="2" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
| モデル名 | CogVideoX-2B |
|
||||
|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| プロンプト言語 | 英語 |
|
||||
| 単一GPU推論 (FP16) | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer) <br> 23.9GB using diffusers |
|
||||
| 複数GPU推論 (FP16) | 20GB minimum per GPU using diffusers |
|
||||
| 微調整に必要なGPUメモリ(bs=1) | 40GB |
|
||||
| プロンプトの最大長 | 226 トークン |
|
||||
| ビデオの長さ | 6秒 |
|
||||
| フレームレート | 8フレーム |
|
||||
| 解像度 | 720 * 480 |
|
||||
| 量子化推論 | サポートされていません |
|
||||
| ダウンロードリンク (HF diffusers モデル) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) [💫 WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) |
|
||||
| ダウンロードリンク (SAT モデル) | [SAT](./sat/README.md) |
|
||||
**データ解説**
|
||||
|
||||
+ diffusers ライブラリを使用してテストを行った際に、`enable_model_cpu_offload()` オプションと `pipe.vae.enable_tiling()`
|
||||
最適化が有効になっていました。このセットアップは **NVIDIA A100 / H100** 以外のデバイスでの実際のメモリ/VRAM
|
||||
使用量についてはテストされていません。通常、このアプローチは **NVIDIA Ampere アーキテクチャ**
|
||||
以上のすべてのデバイスに適しています。これらの最適化を無効にすると、メモリ使用量が大幅に増加し、表に示されている値の約3倍になります。
|
||||
+ マルチGPU推論を行う際には、`enable_model_cpu_offload()` 最適化を無効にする必要があります。
|
||||
+ INT8 モデルを使用すると推論速度が低下しますが、これは、メモリの少ないGPUでも正常に推論できるようにし、ビデオ品質の損失を最小限に抑えるためです。推論速度は大幅に低下します。
|
||||
|
||||
推論速度テストも上記のメモリ最適化を使用して実施されました。メモリ最適化を使用しない場合、推論速度は約10%向上します。量子化をサポートしているのは `diffusers`
|
||||
バージョンのモデルのみです。
|
||||
|
||||
+ モデルは英語入力のみをサポートしており、他の言語は大規模な言語モデルを通じて英語に翻訳することで対応できます。
|
||||
|
||||
## 友好的リンク
|
||||
|
||||
@ -132,15 +241,17 @@ CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源の
|
||||
強力で包括的な分散推論フレームワークであり、ワンクリックで独自のモデルや最新のオープンソースモデルを簡単にデプロイできます。
|
||||
+ [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys): VideoSysは、使いやすく高性能なビデオ生成インフラを提供し、最新のモデルや技術を継続的に統合しています。
|
||||
|
||||
|
||||
## プロジェクト構造
|
||||
|
||||
このオープンソースリポジトリは、**CogVideoX** オープンソースモデルの基本的な使用方法と微調整の例を迅速に開始するためのガイドです。
|
||||
|
||||
### 推論
|
||||
|
||||
+ [diffusers_demo](inference/cli_demo.py): 推論コードの詳細な説明が含まれており、一般的なパラメータの意味についても言及しています。
|
||||
+ [cli_demo](inference/cli_demo.py): 推論コードの詳細な説明が含まれており、一般的なパラメータの意味についても言及しています。
|
||||
+ [cli_demo_quantization](inference/cli_demo_quantization.py):
|
||||
量子化モデル推論コードで、低メモリのデバイスでも実行可能です。また、このコードを変更して、FP8 精度の CogVideoX モデルの実行をサポートすることもできます。
|
||||
+ [diffusers_vae_demo](inference/cli_vae_demo.py): VAE推論コードの実行には現在71GBのメモリが必要ですが、将来的には最適化される予定です。
|
||||
+ [space demo](inference/gradio_composite_demo): Huggingface Spaceと同じGUIコードで、フレーム補間や超解像ツールが組み込まれています。
|
||||
+ [convert_demo](inference/convert_demo.py):
|
||||
ユーザー入力をCogVideoXに適した形式に変換する方法。CogVideoXは長いキャプションでトレーニングされているため、入力テキストをLLMを使用してトレーニング分布と一致させる必要があります。デフォルトではGLM4を使用しますが、GPT、Geminiなどの他のLLMに置き換えることもできます。
|
||||
+ [gradio_web_demo](inference/gradio_web_demo.py): CogVideoX-2B モデルを使用して動画を生成する方法を示す、シンプルな
|
||||
@ -148,7 +259,7 @@ CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源の
|
||||
|
||||
```shell
|
||||
cd inference
|
||||
# For Linux and Windows users (and macOS with Intel??)
|
||||
# For Linux and Windows users
|
||||
python gradio_web_demo.py # humans mode
|
||||
|
||||
# For macOS with Apple Silicon users, Intel not supported, this maybe 20x slower than RTX 4090
|
||||
@ -178,26 +289,6 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python gradio_web_demo.py # humans mode
|
||||
+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): SATモデルのウェイトをHuggingfaceモデルのウェイトに変換します。
|
||||
+ [caption_demo](tools/caption): キャプションツール、ビデオを理解し、テキストで出力するモデル。
|
||||
|
||||
## プロジェクト計画
|
||||
|
||||
- [x] CogVideoXモデルのオープンソース化
|
||||
- [x] CogVideoXで使用される3D Causal VAEのオープンソース化
|
||||
- [x] CogVideoXモデルの推論例 (CLI / Webデモ)
|
||||
- [x] CogVideoXオンライン体験デモ (Huggingface Space)
|
||||
- [x] CogVideoXオープンソースモデルAPIインターフェースの例 (Huggingface)
|
||||
- [x] CogVideoXモデルの微調整例 (SAT)
|
||||
- [ ] CogVideoXモデルの微調整例 (Huggingface / SAT)
|
||||
- [ ] CogVideoX-Proのオープンソース化 (CogVideoX-2Bスイートに適応)
|
||||
- [x] CogVideoX技術レポートの公開
|
||||
|
||||
私たちはあなたの貢献を歓迎します。詳細については[こちら](resources/contribute.md)をクリックしてください。
|
||||
|
||||
## モデルライセンス
|
||||
|
||||
このリポジトリのコードは [Apache 2.0 ライセンス](LICENSE) の下で公開されています。
|
||||
|
||||
モデルのウェイトと実装コードは [CogVideoX LICENSE](MODEL_LICENSE) の下で公開されています。
|
||||
|
||||
## CogVideo(ICLR'23)
|
||||
|
||||
論文の公式リポジトリ: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868)
|
||||
@ -238,19 +329,28 @@ CogVideoのデモは [https://models.aminer.cn/cogvideo](https://models.aminer.c
|
||||
|
||||
## オープンソースプロジェクト計画
|
||||
|
||||
- [x] CogVideoX モデルのオープンソース化
|
||||
- [x] CogVideoX モデル推論サンプル (CLI / Web デモ)
|
||||
- [x] CogVideoX オンライン体験サンプル (Huggingface Space)
|
||||
- [x] CogVideoX オープンソースAPIインターフェースサンプル (Huggingface)
|
||||
- [x] CogVideoX モデルの微調整サンプル (SAT)
|
||||
- [ ] CogVideoX モデルの微調整サンプル (Huggingface / SAT)
|
||||
- [ ] CogVideoX-Pro オープンソース化 (CogVideoX-2B スイートに対応)
|
||||
- [X] CogVideoX 技術レポート公開
|
||||
- [x] CogVideoX モデルオープンソース化
|
||||
- [x] CogVideoX モデル推論例 (CLI / Web デモ)
|
||||
- [x] CogVideoX オンライン体験例 (Huggingface Space)
|
||||
- [x] CogVideoX オープンソースモデルAPIインターフェース例 (Huggingface)
|
||||
- [x] CogVideoX モデル微調整例 (SAT)
|
||||
- [ ] CogVideoX モデル微調整例 (Huggingface Diffusers)
|
||||
- [X] CogVideoX-5B オープンソース化 (CogVideoX-2B スイートに適応)
|
||||
- [X] CogVideoX 技術報告公開
|
||||
- [X] CogVideoX 技術解説ビデオ
|
||||
- [ ] CogVideoX 周辺ツール
|
||||
- [X] 基本的なビデオ超解像 / フレーム補間スイート
|
||||
- [ ] 推論フレームワーク適応
|
||||
- [ ] ComfyUI 完全エコシステムツール
|
||||
|
||||
私たちは皆さんの貢献を歓迎しています。詳しくは[こちら](resources/contribute_zh.md)をご覧ください。
|
||||
あなたの貢献をお待ちしています!詳細は[こちら](resources/contribute_zh.md)をクリックしてください。
|
||||
|
||||
## モデルライセンス
|
||||
## ライセンス契約
|
||||
|
||||
本リポジトリのコードは [Apache 2.0 ライセンス](LICENSE) の下で公開されています。
|
||||
このリポジトリのコードは [Apache 2.0 License](LICENSE) の下で公開されています。
|
||||
|
||||
本モデルのウェイトと実装コードは [CogVideoX LICENSE](MODEL_LICENSE) ライセンスに基づいて公開されています。
|
||||
CogVideoX-2B モデル (対応するTransformersモジュールやVAEモジュールを含む) は
|
||||
[Apache 2.0 License](LICENSE) の下で公開されています。
|
||||
|
||||
CogVideoX-5B モデル (Transformersモジュール) は
|
||||
[CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE) の下で公開されています。
|
200
README_zh.md
200
README_zh.md
@ -9,7 +9,7 @@
|
||||
<img src=resources/logo.svg width="50%"/>
|
||||
</div>
|
||||
<p align="center">
|
||||
🤗 在 <a href="https://huggingface.co/spaces/THUDM/CogVideoX" target="_blank">CogVideoX Huggingface Space</a> 体验视频生成模型
|
||||
在 <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B" target="_blank"> 🤗 Huggingface Space</a> 或 <a href="https://modelscope.cn/studios/ZhipuAI/CogVideoX-5b-demo" target="_blank"> 🤖 ModelScope Space</a> 在线体验 CogVideoX-5B 模型
|
||||
</p>
|
||||
<p align="center">
|
||||
📚 查看 <a href="https://arxiv.org/abs/2408.06072" target="_blank">论文</a>
|
||||
@ -23,7 +23,10 @@
|
||||
|
||||
## 项目更新
|
||||
|
||||
- 🔥🔥**News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) 已经支持对 CogVideoX
|
||||
- 🔥🔥 **News**: ```2024/8/27```: 我们开源 CogVideoX 系列更大的模型 **CogVideoX-5B**。同时 **CogVideoX-2B** 将修改为
|
||||
**Apache 2.0 协议**。我们大幅度优化了模型的推理性能,推理门槛大幅降低,您可以在 `GTX 1080TI` 等早期显卡运行 **CogVideoX-2B**
|
||||
,在 `RTX 3060`等桌面端甜品卡运行 **CogVideoX-5B** 模型。
|
||||
- 🔥**News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) 已经支持对 CogVideoX
|
||||
生成的视频进行增强,实现更高分辨率,更高质量的视频渲染。欢迎大家按照[教程](tools/venhancer/README_zh.md)体验使用。
|
||||
- 🔥**News**: ```2024/8/15```: CogVideoX 依赖中`SwissArmyTransformer`依赖升级到`0.4.12`,
|
||||
微调不再需要从源代码安装`SwissArmyTransformer`。同时,`Tied VAE` 技术已经被应用到 `diffusers`
|
||||
@ -60,15 +63,14 @@
|
||||
|
||||
### 提示词优化
|
||||
|
||||
在开始运行模型之前,请参考[这里](inference/convert_demo.py) 查看我们是怎么使用GLM-4(或者同级别的其他产品,例如GPT-4)
|
||||
在开始运行模型之前,请参考 [这里](inference/convert_demo.py) 查看我们是怎么使用GLM-4(或者同级别的其他产品,例如GPT-4)
|
||||
大模型对模型进行优化的,这很重要,
|
||||
由于模型是在长提示词下训练的,一个好的提示词直接影响了视频生成的质量。
|
||||
|
||||
### SAT
|
||||
|
||||
查看sat文件夹下的[sat_demo](sat/README.md):包含了 SAT 权重的推理代码和微调代码,推荐基于此代码进行 CogVideoX
|
||||
查看sat文件夹下的 [sat_demo](sat/README.md):包含了 SAT 权重的推理代码和微调代码,推荐基于此代码进行 CogVideoX
|
||||
模型结构的改进,研究者使用该代码可以更好的进行快速的迭代和开发。
|
||||
(18 GB 推理, 40GB lora微调)
|
||||
|
||||
### Diffusers
|
||||
|
||||
@ -76,49 +78,145 @@
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
查看[diffusers_demo](inference/cli_demo.py):包含对推理代码更详细的解释,包括各种关键的参数。(24GB 推理,微调代码正在开发)
|
||||
查看[diffusers_demo](inference/cli_demo.py):包含对推理代码更详细的解释,包括各种关键的参数。
|
||||
|
||||
## CogVideoX-2B 视频作品
|
||||
## 视频作品
|
||||
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="80%" controls autoplay></video>
|
||||
<p>A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.</p>
|
||||
</div>
|
||||
### CogVideoX-5B
|
||||
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
|
||||
<tr>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/cf5953ea-96d3-48fd-9907-c4708752c714" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/fe0a78e6-b669-4800-8cf0-b5f9b5145b52" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/c182f606-8f8c-421d-b414-8487070fcfcb" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/7db2bbce-194d-434d-a605-350254b6c298" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/62b01046-8cab-44cc-bd45-4d965bb615ec" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/d78e552a-4b3f-4b81-ac3f-3898079554f6" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/30894f12-c741-44a2-9e6e-ddcacc231e5b" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/926575ca-7150-435b-a0ff-4900a963297b" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="80%" controls autoplay></video>
|
||||
<p>The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from its tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.</p>
|
||||
</div>
|
||||
### CogVideoX-2B
|
||||
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
|
||||
<tr>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
<td>
|
||||
<video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="100%" controls autoplay loop></video>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="80%" controls autoplay></video>
|
||||
<p>A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.</p>
|
||||
</div>
|
||||
|
||||
<div align="center">
|
||||
<video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="80%" controls autoplay></video>
|
||||
<p>In the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.</p>
|
||||
</div>
|
||||
查看画廊的对应提示词,请点击[这里](resources/galary_prompt.md)
|
||||
|
||||
## 模型介绍
|
||||
|
||||
CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源版本视频生成模型。
|
||||
下表展示我们提供的视频生成模型相关基础信息:
|
||||
|
||||
下表展示目前我们提供的视频生成模型列表,以及相关基础信息:
|
||||
<table style="border-collapse: collapse; width: 100%;">
|
||||
<tr>
|
||||
<th style="text-align: center;">模型名</th>
|
||||
<th style="text-align: center;">CogVideoX-2B</th>
|
||||
<th style="text-align: center;">CogVideoX-5B (本仓库)</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">模型介绍</td>
|
||||
<td style="text-align: center;">入门级模型,兼顾兼容性。运行,二次开发成本低。</td>
|
||||
<td style="text-align: center;">视频生成质量更高,视觉效果更好的更大尺寸模型。</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">推理精度</td>
|
||||
<td style="text-align: center;"><b>FP16*(推荐)</b>, BF16, FP32,FP8*(E4M3,E5M2),INT8,不支持INT4</td>
|
||||
<td style="text-align: center;"><b>BF16(推荐)</b>, FP16, FP32,FP8*(E4M3,E5M2),INT8,不支持INT4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">单GPU显存消耗<br></td>
|
||||
<td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
|
||||
<td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">多GPU推理显存消耗</td>
|
||||
<td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
|
||||
<td style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">推理速度<br>(Step = 50)</td>
|
||||
<td style="text-align: center;">FP16: ~90* s</td>
|
||||
<td style="text-align: center;">BF16: ~180* s</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">微调精度</td>
|
||||
<td style="text-align: center;"><b>FP16</b></td>
|
||||
<td style="text-align: center;"><b>BF16</b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">微调显存消耗(每卡)</td>
|
||||
<td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
|
||||
<td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">提示词语言</td>
|
||||
<td colspan="2" style="text-align: center;">English*</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">提示词长度上限</td>
|
||||
<td colspan="2" style="text-align: center;">226 Tokens</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">视频长度</td>
|
||||
<td colspan="2" style="text-align: center;">6 秒</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">帧率</td>
|
||||
<td colspan="2" style="text-align: center;">8 帧 / 秒 </td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">视频分辨率</td>
|
||||
<td colspan="2" style="text-align: center;">720 * 480,不支持其他分辨率(含微调)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">位置编码</td>
|
||||
<td style="text-align: center;">3d_sincos_pos_embed</td>
|
||||
<td style="text-align: center;">3d_rope_pos_embed<br></td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
| 模型名 | CogVideoX-2B |
|
||||
|---------------------|---------------------------------------------------------------------------------------------------------------------------------|
|
||||
| 提示词语言 | English |
|
||||
| 单GPU推理 (FP-16) 显存消耗 | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer) <br> 23.9GB using diffusers |
|
||||
| 多GPU推理 (FP-16) 显存消耗 | 20GB minimum per GPU using diffusers |
|
||||
| 微调显存消耗 (bs=1) | 42GB |
|
||||
| 提示词长度上限 | 226 Tokens |
|
||||
| 视频长度 | 6 seconds |
|
||||
| 帧率(每秒) | 8 frames |
|
||||
| 视频分辨率 | 720 * 480 |
|
||||
| 量化推理 | 不支持 |
|
||||
| 下载地址 (Diffusers 模型) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) |
|
||||
| 下载地址 (SAT 模型) | [SAT](./sat/README_zh.md) |
|
||||
**数据解释**
|
||||
|
||||
+ 使用 diffusers 库进行测试时,启用了 `enable_model_cpu_offload()` 选项 和 `pipe.vae.enable_tiling()` 优化,该方案未测试在非
|
||||
**NVIDIA A100 / H100** 外的设备上的实际显存 / 内存占用。通常,该方案可以适配于所有 **NVIDIA 安培架构**
|
||||
以上的设备。若关闭优化,显存占用会成倍增加,峰值显存约为表格的3倍。
|
||||
+ 多GPU推理时,需要关闭 `enable_model_cpu_offload()` 优化。
|
||||
+ 使用 INT8 模型会导致推理速度降低,此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失,推理速度大幅降低。
|
||||
+ 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。 只有`diffusers`版本模型支持量化。
|
||||
+ 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
|
||||
|
||||
## 友情链接
|
||||
|
||||
@ -133,16 +231,19 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
|
||||
|
||||
### inference
|
||||
|
||||
+ [diffusers_demo](inference/cli_demo.py): 更详细的推理代码讲解,常见参数的意义,在这里都会提及。
|
||||
+ [diffusers_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码,目前需要71GB显存,将来会优化。
|
||||
+ [cli_demo](inference/cli_demo.py): 更详细的推理代码讲解,常见参数的意义,在这里都会提及。
|
||||
+ [cli_demo_quantization](inference/cli_demo_quantization.py):
|
||||
量化模型推理代码,可以在显存较低的设备上运行,也可以基于此代码修改,以支持运行FP8等精度的CogVideoX模型。请注意,FP8 仅测试通过,且必须将 `torch-nightly`,`torchao`源代码安装,不建议在生产环境中使用。
|
||||
+ [diffusers_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码。
|
||||
+ [space demo](inference/gradio_composite_demo): Huggingface Space同款的 GUI 代码,植入了插帧,超分工具。
|
||||
+ [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合
|
||||
CogVideoX的长输入。因为CogVideoX是在长文本上训练的,所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4,也可以替换为GPT、Gemini等任意大语言模型。
|
||||
+ [gradio_web_demo](inference/gradio_web_demo.py): 一个简单的gradio网页应用,展示如何使用 CogVideoX-2B 模型生成视频。
|
||||
与我们的 Huggingface Space 类似,你可以使用此脚本运行一个简单的网页应用,用于生成视频。
|
||||
+ [gradio_web_demo](inference/gradio_web_demo.py): 一个简单的gradio网页应用,展示如何使用 CogVideoX-2B 模型生成视频。 与我们的
|
||||
Huggingface Space 类似,你可以使用此脚本运行一个简单的网页应用,用于生成视频。
|
||||
|
||||
```shell
|
||||
cd inference
|
||||
# For Linux and Windows users (and macOS with Intel??)
|
||||
# For Linux and Windows users
|
||||
python gradio_web_demo.py # humans mode
|
||||
|
||||
# For macOS with Apple Silicon users, Intel not supported, this maybe 20x slower than RTX 4090
|
||||
@ -216,9 +317,14 @@ CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.amine
|
||||
- [x] CogVideoX 在线体验示例 (Huggingface Space)
|
||||
- [x] CogVideoX 开源模型API接口示例 (Huggingface)
|
||||
- [x] CogVideoX 模型微调示例 (SAT)
|
||||
- [ ] CogVideoX 模型微调示例 (Huggingface / SAT)
|
||||
- [ ] CogVideoX-Pro 开源(适配 CogVideoX-2B 套件)
|
||||
- [ ] CogVideoX 模型微调示例 (Huggingface Diffusers)
|
||||
- [X] CogVideoX-5B 开源 (适配 CogVideoX-2B 套件)
|
||||
- [X] CogVideoX 技术报告公开
|
||||
- [X] CogVideoX 技术讲解视频
|
||||
- [ ] CogVideoX 周边工具
|
||||
- [X] 视频超分 / 插帧基础套件
|
||||
- [ ] 推理框架适配
|
||||
- [ ] ComfyUI 完整生态工具
|
||||
|
||||
我们欢迎您的贡献,您可以点击[这里](resources/contribute_zh.md)查看更多信息。
|
||||
|
||||
@ -226,4 +332,8 @@ CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.amine
|
||||
|
||||
本仓库代码使用 [Apache 2.0 协议](LICENSE) 发布。
|
||||
|
||||
本模型权重和模型实现代码根据 [CogVideoX LICENSE](MODEL_LICENSE) 许可证发布。
|
||||
CogVideoX-2B 模型 (包括其对应的Transformers模块,VAE模块) 根据 [Apache 2.0 协议](LICENSE) 许可证发布。
|
||||
|
||||
CogVideoX-5B 模型 (Transformers 模块)
|
||||
根据 [CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE)
|
||||
许可证发布。
|
@ -12,18 +12,18 @@ Run the script:
|
||||
|
||||
import argparse
|
||||
import torch
|
||||
from diffusers import CogVideoXPipeline, CogVideoXDDIMScheduler
|
||||
from diffusers import CogVideoXPipeline, CogVideoXDDIMScheduler, CogVideoXDPMScheduler
|
||||
from diffusers.utils import export_to_video
|
||||
|
||||
|
||||
def generate_video(
|
||||
prompt: str,
|
||||
model_path: str,
|
||||
output_path: str = "./output.mp4",
|
||||
num_inference_steps: int = 50,
|
||||
guidance_scale: float = 6.0,
|
||||
num_videos_per_prompt: int = 1,
|
||||
dtype: torch.dtype = torch.bfloat16,
|
||||
prompt: str,
|
||||
model_path: str,
|
||||
output_path: str = "./output.mp4",
|
||||
num_inference_steps: int = 50,
|
||||
guidance_scale: float = 6.0,
|
||||
num_videos_per_prompt: int = 1,
|
||||
dtype: torch.dtype = torch.bfloat16,
|
||||
):
|
||||
"""
|
||||
Generates a video based on the given prompt and saves it to the specified path.
|
||||
@ -47,10 +47,12 @@ def generate_video(
|
||||
|
||||
# 2. Set Scheduler.
|
||||
# Can be changed to `CogVideoXDPMScheduler` or `CogVideoXDDIMScheduler`.
|
||||
# We recommend using `CogVideoXDDIMScheduler` for better results.
|
||||
pipe.scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
|
||||
# We recommend using `CogVideoXDDIMScheduler` for CogVideoX-2B and `CogVideoXDPMScheduler` for CogVideoX-5B.
|
||||
# pipe.scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
|
||||
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
|
||||
|
||||
# 3. Enable CPU offload for the model, enable tiling.
|
||||
# turn off if you have multiple GPUs or enough GPU memory(such as H100) and it will cost less time in inference
|
||||
pipe.enable_model_cpu_offload()
|
||||
pipe.vae.enable_tiling()
|
||||
|
||||
@ -63,7 +65,8 @@ def generate_video(
|
||||
num_videos_per_prompt=num_videos_per_prompt, # Number of videos to generate per prompt
|
||||
num_inference_steps=num_inference_steps, # Number of inference steps
|
||||
num_frames=49, # Number of frames to generate,changed to 49 for diffusers version `0.31.0` and after.
|
||||
guidance_scale=guidance_scale, # Guidance scale for classifier-free guidance
|
||||
use_dynamic_cfg=True, ## This id used for DPM Sechduler, for DDIM scheduler, it should be False
|
||||
guidance_scale=guidance_scale, # Guidance scale for classifier-free guidance, can set to 7 for DPM scheduler
|
||||
generator=torch.Generator().manual_seed(42), # Set the seed for reproducibility
|
||||
).frames[0]
|
||||
|
||||
|
@ -1,36 +1,56 @@
|
||||
"""
|
||||
This script demonstrates how to generate a video from a text prompt using CogVideoX with 🤗Huggingface Diffusers Pipeline.
|
||||
This script demonstrates how to generate a video from a text prompt using CogVideoX with quantization.
|
||||
|
||||
Note:
|
||||
This script requires the `diffusers>=0.30.1` and `torchao>=0.4.0` library to be installed.
|
||||
|
||||
Run the script:
|
||||
$ python cli_demo.py --prompt "A girl ridding a bike." --model_path THUDM/CogVideoX-2b
|
||||
Must install the `torchao`,`torch`,`diffusers`,`accelerate` library FROM SOURCE to use the quantization feature.
|
||||
Only NVIDIA GPUs like H100 or higher are supported om FP-8 quantization.
|
||||
|
||||
ALL quantization schemes must using with NVIDIA GPUs.
|
||||
|
||||
# Run the script:
|
||||
|
||||
python cli_demo_quantization.py --prompt "A girl riding a bike." --model_path THUDM/CogVideoX-2b --quantization_scheme fp8 --dtype float16
|
||||
python cli_demo_quantization.py --prompt "A girl riding a bike." --model_path THUDM/CogVideoX-5b --quantization_scheme fp8 --dtype bfloat16
|
||||
|
||||
In this script, we have only provided the script for testing and inference in INT8 for the entire process
|
||||
(including T5 Encoder, CogVideoX Transformer, VAE).
|
||||
You can use other functionalities provided by torchao to convert to other precisions.
|
||||
Please note that INT4 is not supported.
|
||||
"""
|
||||
import argparse
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import torch
|
||||
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
|
||||
import torch._dynamo
|
||||
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline, CogVideoXDPMScheduler
|
||||
from diffusers.utils import export_to_video
|
||||
from transformers import T5EncoderModel
|
||||
|
||||
# Make sure to install torchao>=0.4.0
|
||||
from torchao.quantization import quantize_, int8_weight_only
|
||||
from torchao.float8.inference import ActivationCasting, QuantConfig, quantize_to_float8
|
||||
|
||||
os.environ["TORCH_LOGS"] = "+dynamo,output_code,graph_breaks,recompiles"
|
||||
torch._dynamo.config.suppress_errors = True
|
||||
torch.set_float32_matmul_precision("high")
|
||||
torch._inductor.config.conv_1x1_as_mm = True
|
||||
torch._inductor.config.coordinate_descent_tuning = True
|
||||
torch._inductor.config.epilogue_fusion = False
|
||||
torch._inductor.config.coordinate_descent_check_all_directions = True
|
||||
|
||||
|
||||
def quantize_model(part, quantization_scheme):
|
||||
if quantization_scheme == "int8":
|
||||
quantize_(part, int8_weight_only())
|
||||
elif quantization_scheme == "fp8":
|
||||
quantize_to_float8(part, QuantConfig(ActivationCasting.DYNAMIC))
|
||||
return part
|
||||
|
||||
|
||||
def generate_video(
|
||||
prompt: str,
|
||||
model_path: str,
|
||||
output_path: str = "./output.mp4",
|
||||
num_inference_steps: int = 50,
|
||||
guidance_scale: float = 6.0,
|
||||
num_videos_per_prompt: int = 1,
|
||||
dtype: torch.dtype = torch.bfloat16,
|
||||
prompt: str,
|
||||
model_path: str,
|
||||
output_path: str = "./output.mp4",
|
||||
num_inference_steps: int = 50,
|
||||
guidance_scale: float = 6.0,
|
||||
num_videos_per_prompt: int = 1,
|
||||
quantization_scheme: str = "fp8",
|
||||
dtype: torch.dtype = torch.bfloat16,
|
||||
):
|
||||
"""
|
||||
Generates a video based on the given prompt and saves it to the specified path.
|
||||
@ -42,24 +62,28 @@ def generate_video(
|
||||
- num_inference_steps (int): Number of steps for the inference process. More steps can result in better quality.
|
||||
- guidance_scale (float): The scale for classifier-free guidance. Higher values can lead to better alignment with the prompt.
|
||||
- num_videos_per_prompt (int): Number of videos to generate per prompt.
|
||||
- quantization_scheme (str): The quantization scheme to use ('int8', 'fp8').
|
||||
- dtype (torch.dtype): The data type for computation (default is torch.bfloat16).
|
||||
|
||||
"""
|
||||
|
||||
text_encoder = T5EncoderModel.from_pretrained(model_path, subfolder="text_encoder", torch_dtype=dtype)
|
||||
quantize_(text_encoder, int8_weight_only())
|
||||
transformer = CogVideoXTransformer3DModel.from_pretrained(model_path, subfolder="transformer",
|
||||
torch_dtype=dtype)
|
||||
quantize_(transformer, int8_weight_only())
|
||||
text_encoder = quantize_model(part=text_encoder, quantization_scheme=quantization_scheme)
|
||||
transformer = CogVideoXTransformer3DModel.from_pretrained(model_path, subfolder="transformer", torch_dtype=dtype)
|
||||
transformer = quantize_model(part=transformer, quantization_scheme=quantization_scheme)
|
||||
vae = AutoencoderKLCogVideoX.from_pretrained(model_path, subfolder="vae", torch_dtype=dtype)
|
||||
quantize_(vae, int8_weight_only())
|
||||
vae = quantize_model(part=vae, quantization_scheme=quantization_scheme)
|
||||
pipe = CogVideoXPipeline.from_pretrained(
|
||||
model_path,
|
||||
text_encoder=text_encoder,
|
||||
transformer=transformer,
|
||||
vae=vae,
|
||||
torch_dtype=dtype,
|
||||
)
|
||||
).to("cuda")
|
||||
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
|
||||
|
||||
# Using with compile will run faster. First time infer will cost ~30min to compile.
|
||||
# pipe.transformer.to(memory_format=torch.channels_last)
|
||||
# for FP8 should remove pipe.enable_model_cpu_offload()
|
||||
pipe.enable_model_cpu_offload()
|
||||
pipe.vae.enable_tiling()
|
||||
video = pipe(
|
||||
@ -67,8 +91,9 @@ def generate_video(
|
||||
num_videos_per_prompt=num_videos_per_prompt,
|
||||
num_inference_steps=num_inference_steps,
|
||||
num_frames=49,
|
||||
use_dynamic_cfg=True, ## This id used for DPM Sechduler, for DDIM scheduler, it should be False
|
||||
guidance_scale=guidance_scale,
|
||||
generator=torch.Generator().manual_seed(42),
|
||||
generator=torch.Generator(device="cuda").manual_seed(42),
|
||||
).frames[0]
|
||||
|
||||
export_to_video(video, output_path, fps=8)
|
||||
@ -89,7 +114,14 @@ if __name__ == "__main__":
|
||||
parser.add_argument("--guidance_scale", type=float, default=6.0, help="The scale for classifier-free guidance")
|
||||
parser.add_argument("--num_videos_per_prompt", type=int, default=1, help="Number of videos to generate per prompt")
|
||||
parser.add_argument(
|
||||
"--dtype", type=str, default="bfloat16", help="The data type for computation (e.g., 'float16' or 'bfloat16')"
|
||||
"--dtype", type=str, default="bfloat16", help="The data type for computation (e.g., 'float16', 'bfloat16')"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--quantization_scheme",
|
||||
type=str,
|
||||
default="bf16",
|
||||
choices=["int8", "fp8"],
|
||||
help="The quantization scheme to use (int8, fp8)",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
@ -101,5 +133,6 @@ if __name__ == "__main__":
|
||||
num_inference_steps=args.num_inference_steps,
|
||||
guidance_scale=args.guidance_scale,
|
||||
num_videos_per_prompt=args.num_videos_per_prompt,
|
||||
quantization_scheme=args.quantization_scheme,
|
||||
dtype=dtype,
|
||||
)
|
||||
|
@ -1,3 +1,11 @@
|
||||
"""
|
||||
THis is the main file for the gradio web demo. It uses the CogVideoX-5B model to generate videos gradio web demo.
|
||||
set environment variable OPENAI_API_KEY to use the OpenAI API to enhance the prompt.
|
||||
|
||||
Usage:
|
||||
OpenAI_API_KEY=your_openai_api_key OpenAI_BASE_URL=https://api.openai.com/v1 python inference/gradio_web_demo.py
|
||||
"""
|
||||
|
||||
import math
|
||||
import os
|
||||
import random
|
||||
@ -6,7 +14,7 @@ import time
|
||||
|
||||
import gradio as gr
|
||||
import torch
|
||||
from diffusers import CogVideoXPipeline, CogVideoXDDIMScheduler,CogVideoXDPMScheduler
|
||||
from diffusers import CogVideoXPipeline, CogVideoXDDIMScheduler, CogVideoXDPMScheduler
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
from diffusers.image_processor import VaeImageProcessor
|
||||
@ -98,14 +106,14 @@ def convert_prompt(prompt: str, retry_times: int = 3) -> str:
|
||||
|
||||
|
||||
def infer(
|
||||
prompt: str,
|
||||
num_inference_steps: int,
|
||||
guidance_scale: float,
|
||||
seed: int = -1,
|
||||
#progress=gr.Progress(track_tqdm=True),
|
||||
prompt: str,
|
||||
num_inference_steps: int,
|
||||
guidance_scale: float,
|
||||
seed: int = -1,
|
||||
progress=gr.Progress(track_tqdm=True),
|
||||
):
|
||||
if seed == -1:
|
||||
seed = random.randint(0, 2 ** 8 - 1)
|
||||
seed = random.randint(0, 2**8 - 1)
|
||||
video_pt = pipe(
|
||||
prompt=prompt,
|
||||
num_videos_per_prompt=1,
|
||||
@ -172,10 +180,6 @@ with gr.Blocks() as demo:
|
||||
)
|
||||
enhance_button = gr.Button("✨ Enhance Prompt(Optional)")
|
||||
|
||||
gr.Markdown(
|
||||
"<span style='color:red; font-weight:bold;'>For the CogVideoX-5B model, 50 steps will take approximately 120 seconds.</span>"
|
||||
)
|
||||
|
||||
with gr.Group():
|
||||
with gr.Column():
|
||||
with gr.Row():
|
||||
@ -262,20 +266,13 @@ with gr.Blocks() as demo:
|
||||
</table>
|
||||
""")
|
||||
|
||||
|
||||
def generate(prompt,
|
||||
seed_value,
|
||||
scale_status,
|
||||
rife_status,
|
||||
progress=gr.Progress(track_tqdm=True)
|
||||
):
|
||||
|
||||
def generate(prompt, seed_value, scale_status, rife_status, progress=gr.Progress(track_tqdm=True)):
|
||||
latents, seed = infer(
|
||||
prompt,
|
||||
num_inference_steps=50, # NOT Changed
|
||||
guidance_scale=7.0, # NOT Changed
|
||||
seed=seed_value,
|
||||
#progress=progress,
|
||||
# progress=progress,
|
||||
)
|
||||
if scale_status:
|
||||
latents = utils.upscale_batch_and_concatenate(upscale_model, latents, device)
|
||||
@ -300,11 +297,9 @@ with gr.Blocks() as demo:
|
||||
|
||||
return video_path, video_update, gif_update, seed_update
|
||||
|
||||
|
||||
def enhance_prompt_func(prompt):
|
||||
return convert_prompt(prompt, retry_times=1)
|
||||
|
||||
|
||||
generate_button.click(
|
||||
generate,
|
||||
inputs=[prompt, seed_param, enable_scale, enable_rife],
|
||||
|
@ -1,3 +1,11 @@
|
||||
"""
|
||||
THis is the main file for the gradio web demo. It uses the CogVideoX-2B model to generate videos gradio web demo.
|
||||
set environment variable OPENAI_API_KEY to use the OpenAI API to enhance the prompt.
|
||||
|
||||
Usage:
|
||||
OpenAI_API_KEY=your_openai_api_key OpenAI_BASE_URL=https://api.openai.com/v1 python inference/gradio_web_demo.py
|
||||
"""
|
||||
|
||||
import os
|
||||
import threading
|
||||
import time
|
||||
@ -151,14 +159,17 @@ with gr.Blocks() as demo:
|
||||
|
||||
with gr.Row():
|
||||
gr.Markdown(
|
||||
"✨Upon pressing the enhanced prompt button, we will use [GLM-4 Model](https://github.com/THUDM/GLM-4) to polish the prompt and overwrite the original one.")
|
||||
"✨Upon pressing the enhanced prompt button, we will use [GLM-4 Model](https://github.com/THUDM/GLM-4) to polish the prompt and overwrite the original one."
|
||||
)
|
||||
enhance_button = gr.Button("✨ Enhance Prompt(Optional)")
|
||||
|
||||
with gr.Column():
|
||||
gr.Markdown("**Optional Parameters** (default values are recommended)<br>"
|
||||
"Increasing the number of inference steps will produce more detailed videos, but it will slow down the process.<br>"
|
||||
"50 steps are recommended for most cases.<br>"
|
||||
"For the 5B model, 50 steps will take approximately 350 seconds.")
|
||||
gr.Markdown(
|
||||
"**Optional Parameters** (default values are recommended)<br>"
|
||||
"Increasing the number of inference steps will produce more detailed videos, but it will slow down the process.<br>"
|
||||
"50 steps are recommended for most cases.<br>"
|
||||
"For the 5B model, 50 steps will take approximately 350 seconds."
|
||||
)
|
||||
with gr.Row():
|
||||
num_inference_steps = gr.Number(label="Inference Steps", value=50)
|
||||
guidance_scale = gr.Number(label="Guidance Scale", value=6.0)
|
||||
@ -206,7 +217,6 @@ with gr.Blocks() as demo:
|
||||
</table>
|
||||
""")
|
||||
|
||||
|
||||
def generate(prompt, num_inference_steps, guidance_scale, model_choice, progress=gr.Progress(track_tqdm=True)):
|
||||
tensor = infer(prompt, num_inference_steps, guidance_scale, progress=progress)
|
||||
video_path = save_video(tensor)
|
||||
@ -216,22 +226,16 @@ with gr.Blocks() as demo:
|
||||
|
||||
return video_path, video_update, gif_update
|
||||
|
||||
|
||||
def enhance_prompt_func(prompt):
|
||||
return convert_prompt(prompt, retry_times=1)
|
||||
|
||||
|
||||
generate_button.click(
|
||||
generate,
|
||||
inputs=[prompt, num_inference_steps, guidance_scale],
|
||||
outputs=[video_output, download_video_button, download_gif_button]
|
||||
outputs=[video_output, download_video_button, download_gif_button],
|
||||
)
|
||||
|
||||
enhance_button.click(
|
||||
enhance_prompt_func,
|
||||
inputs=[prompt],
|
||||
outputs=[prompt]
|
||||
)
|
||||
enhance_button.click(enhance_prompt_func, inputs=[prompt], outputs=[prompt])
|
||||
|
||||
if __name__ == "__main__":
|
||||
demo.launch()
|
||||
|
@ -19,8 +19,9 @@ pip install -r requirements.txt
|
||||
|
||||
### 2. Download the model weights
|
||||
|
||||
First, go to the SAT mirror to download the dependencies.
|
||||
### 2. Download model weights
|
||||
|
||||
First, go to the SAT mirror to download the model weights. For the CogVideoX-2B model, please download as follows:
|
||||
```shell
|
||||
mkdir CogVideoX-2b-sat
|
||||
cd CogVideoX-2b-sat
|
||||
@ -31,13 +32,21 @@ wget https://cloud.tsinghua.edu.cn/f/556a3e1329e74f1bac45/?dl=1
|
||||
mv 'index.html?dl=1' transformer.zip
|
||||
unzip transformer.zip
|
||||
```
|
||||
|
||||
Then unzip, the model structure should look like this:
|
||||
For the CogVideoX-5B model, please download as follows (VAE files are the same):
|
||||
```shell
|
||||
mkdir CogVideoX-5b-sat
|
||||
cd CogVideoX-5b-sat
|
||||
wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
|
||||
mv 'index.html?dl=1' vae.zip
|
||||
unzip vae.zip
|
||||
```
|
||||
Then, you need to go to [Tsinghua Cloud Disk](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list) to download our model and unzip it.
|
||||
After sorting, the complete model structure of the two models should be as follows:
|
||||
|
||||
```
|
||||
.
|
||||
├── transformer
|
||||
│ ├── 1000
|
||||
│ ├── 1000 (or 1)
|
||||
│ │ └── mp_rank_00_model_states.pt
|
||||
│ └── latest
|
||||
└── vae
|
||||
@ -71,8 +80,6 @@ loading it into Deepspeed in Finetune.
|
||||
0 directories, 8 files
|
||||
```
|
||||
|
||||
Here is the English translation of the provided text:
|
||||
|
||||
### 3. Modify the file in `configs/cogvideox_2b.yaml`.
|
||||
|
||||
```yaml
|
||||
|
@ -19,7 +19,7 @@ pip install -r requirements.txt
|
||||
|
||||
### 2. モデルウェイトをダウンロードします
|
||||
|
||||
まず、SAT ミラーにアクセスして依存関係をダウンロードします。
|
||||
まず、SAT ミラーに移動してモデルの重みをダウンロードします。 CogVideoX-2B モデルの場合は、次のようにダウンロードしてください。
|
||||
|
||||
```shell
|
||||
mkdir CogVideoX-2b-sat
|
||||
@ -32,12 +32,23 @@ mv 'index.html?dl=1' transformer.zip
|
||||
unzip transformer.zip
|
||||
```
|
||||
|
||||
次に解凍し、モデル構造は次のようになります:
|
||||
CogVideoX-5B モデルの場合は、次のようにダウンロードしてください (VAE ファイルは同じです)。
|
||||
|
||||
```shell
|
||||
mkdir CogVideoX-5b-sat
|
||||
cd CogVideoX-5b-sat
|
||||
wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
|
||||
mv 'index.html?dl=1' vae.zip
|
||||
unzip vae.zip
|
||||
```
|
||||
|
||||
次に、[Tsinghua Cloud Disk](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list) に移動してモデルをダウンロードし、解凍する必要があります。
|
||||
整理すると、2 つのモデルの完全なモデル構造は次のようになります。 モデル構造は次のようになります:
|
||||
|
||||
```
|
||||
.
|
||||
├── transformer
|
||||
│ ├── 1000
|
||||
│ ├── 1000 (or 1)
|
||||
│ │ └── mp_rank_00_model_states.pt
|
||||
│ └── latest
|
||||
└── vae
|
||||
|
@ -18,8 +18,7 @@ pip install -r requirements.txt
|
||||
|
||||
### 2. 下载模型权重
|
||||
|
||||
首先,前往 SAT 镜像下载依赖。
|
||||
|
||||
首先,前往 SAT 镜像下载模型权重。对于 CogVideoX-2B 模型,请按照如下方式下载:
|
||||
```shell
|
||||
mkdir CogVideoX-2b-sat
|
||||
cd CogVideoX-2b-sat
|
||||
@ -30,13 +29,21 @@ wget https://cloud.tsinghua.edu.cn/f/556a3e1329e74f1bac45/?dl=1
|
||||
mv 'index.html?dl=1' transformer.zip
|
||||
unzip transformer.zip
|
||||
```
|
||||
|
||||
然后,解压文件,模型结构应该如下
|
||||
对于 CogVideoX-5B 模型,请按照如下方式下载(VAE文件相同):
|
||||
```shell
|
||||
mkdir CogVideoX-5b-sat
|
||||
cd CogVideoX-5b-sat
|
||||
wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
|
||||
mv 'index.html?dl=1' vae.zip
|
||||
unzip vae.zip
|
||||
```
|
||||
然后,您需要前往[清华云盘](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list)下载我们的模型,并进行解压。
|
||||
整理之后, 两个模型的完整模型结构应该如下:
|
||||
|
||||
```
|
||||
.
|
||||
├── transformer
|
||||
│ ├── 1000
|
||||
│ ├── 1000 (or 1)
|
||||
│ │ └── mp_rank_00_model_states.pt
|
||||
│ └── latest
|
||||
└── vae
|
||||
|
Loading…
x
Reference in New Issue
Block a user