mirror of
https://github.com/THUDM/CogVideo.git
synced 2025-04-06 03:57:56 +08:00
commit
2fdc59c3ce
3
.gitignore
vendored
3
.gitignore
vendored
@ -7,4 +7,5 @@ logs/
|
|||||||
*.DS_Store
|
*.DS_Store
|
||||||
.idea
|
.idea
|
||||||
output*
|
output*
|
||||||
test*
|
test*
|
||||||
|
venv
|
29
README.md
29
README.md
@ -22,7 +22,8 @@ Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spac
|
|||||||
|
|
||||||
## Project Updates
|
## Project Updates
|
||||||
|
|
||||||
- 🔥🔥 News: ```2024/11/08```: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX.
|
- 🔥🔥 **News**: ```2024/11/15```: We released the `CogVideoX1.5` model in the diffusers version. Only minor parameter adjustments are needed to continue using previous code.
|
||||||
|
- 🔥 News: ```2024/11/08```: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX.
|
||||||
The CogVideoX1.5-5B series supports 10-second videos with higher resolution, and CogVideoX1.5-5B-I2V supports video generation at any resolution.
|
The CogVideoX1.5-5B series supports 10-second videos with higher resolution, and CogVideoX1.5-5B-I2V supports video generation at any resolution.
|
||||||
The SAT code has already been updated, while the diffusers version is still under adaptation. Download the SAT version code [here](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT).
|
The SAT code has already been updated, while the diffusers version is still under adaptation. Download the SAT version code [here](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT).
|
||||||
- 🔥 **News**: ```2024/10/13```: A more cost-effective fine-tuning framework for `CogVideoX-5B` that works with a single
|
- 🔥 **News**: ```2024/10/13```: A more cost-effective fine-tuning framework for `CogVideoX-5B` that works with a single
|
||||||
@ -43,11 +44,11 @@ The SAT code has already been updated, while the diffusers version is still unde
|
|||||||
model [CogVLM2-Caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption), used in the training process of
|
model [CogVLM2-Caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption), used in the training process of
|
||||||
CogVideoX to convert video data into text descriptions, has been open-sourced. Welcome to download and use it.
|
CogVideoX to convert video data into text descriptions, has been open-sourced. Welcome to download and use it.
|
||||||
- 🔥 ```2024/8/27```: We have open-sourced a larger model in the CogVideoX series, **CogVideoX-5B**. We have
|
- 🔥 ```2024/8/27```: We have open-sourced a larger model in the CogVideoX series, **CogVideoX-5B**. We have
|
||||||
significantly optimized the model's inference performance, greatly lowering the inference threshold. You can run *
|
significantly optimized the model's inference performance, greatly lowering the inference threshold.
|
||||||
*CogVideoX-2B** on older GPUs like `GTX 1080TI`, and **CogVideoX-5B** on desktop GPUs like `RTX 3060`. Please strictly
|
You can run **CogVideoX-2B** on older GPUs like `GTX 1080TI`, and **CogVideoX-5B** on desktop GPUs like `RTX 3060`. Please strictly
|
||||||
follow the [requirements](requirements.txt) to update and install dependencies, and refer
|
follow the [requirements](requirements.txt) to update and install dependencies, and refer
|
||||||
to [cli_demo](inference/cli_demo.py) for inference code. Additionally, the open-source license for the **CogVideoX-2B
|
to [cli_demo](inference/cli_demo.py) for inference code. Additionally, the open-source license for
|
||||||
** model has been changed to the **Apache 2.0 License**.
|
the **CogVideoX-2B** model has been changed to the **Apache 2.0 License**.
|
||||||
- 🔥 ```2024/8/6```: We have open-sourced **3D Causal VAE**, used for **CogVideoX-2B**, which can reconstruct videos with
|
- 🔥 ```2024/8/6```: We have open-sourced **3D Causal VAE**, used for **CogVideoX-2B**, which can reconstruct videos with
|
||||||
almost no loss.
|
almost no loss.
|
||||||
- 🔥 ```2024/8/6```: We have open-sourced the first model of the CogVideoX series video generation models, **CogVideoX-2B
|
- 🔥 ```2024/8/6```: We have open-sourced the first model of the CogVideoX series video generation models, **CogVideoX-2B
|
||||||
@ -193,19 +194,19 @@ models we currently offer, along with their foundational information.
|
|||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td style="text-align: center;">Inference Precision</td>
|
<td style="text-align: center;">Inference Precision</td>
|
||||||
<td colspan="2" style="text-align: center;"><b>BF16</b></td>
|
<td colspan="2" style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, Not supported: INT4</td>
|
||||||
<td style="text-align: center;"><b>FP16*(Recommended)</b>, BF16, FP32, FP8*, INT8, Not supported: INT4</td>
|
<td style="text-align: center;"><b>FP16*(Recommended)</b>, BF16, FP32, FP8*, INT8, Not supported: INT4</td>
|
||||||
<td colspan="2" style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, Not supported: INT4</td>
|
<td colspan="2" style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, Not supported: INT4</td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td style="text-align: center;">Single GPU Memory Usage<br></td>
|
<td style="text-align: center;">Single GPU Memory Usage<br></td>
|
||||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 66GB <br></td>
|
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 76GB <br><b>diffusers BF16: from 10GB*</b><br><b>diffusers INT8(torchao): from 7GB*</b></td>
|
||||||
<td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GB minimum* </b><br><b>diffusers INT8 (torchao): 3.6GB minimum*</b></td>
|
<td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GB minimum* </b><br><b>diffusers INT8 (torchao): 3.6GB minimum*</b></td>
|
||||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB minimum* </b><br><b>diffusers INT8 (torchao): 4.4GB minimum* </b></td>
|
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB minimum* </b><br><b>diffusers INT8 (torchao): 4.4GB minimum* </b></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td style="text-align: center;">Multi-GPU Memory Usage</td>
|
<td style="text-align: center;">Multi-GPU Memory Usage</td>
|
||||||
<td colspan="2" style="text-align: center;"><b>Not Supported</b><br></td>
|
<td colspan="2" style="text-align: center;"><b>BF16: 24GB* using diffusers</b><br></td>
|
||||||
<td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
|
<td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
|
||||||
<td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
|
<td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
|
||||||
</tr>
|
</tr>
|
||||||
@ -243,7 +244,8 @@ models we currently offer, along with their foundational information.
|
|||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td style="text-align: center;">Download Link (Diffusers)</td>
|
<td style="text-align: center;">Download Link (Diffusers)</td>
|
||||||
<td colspan="2" style="text-align: center;"> Coming Soon </td>
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B">🟣 WiseModel</a></td>
|
||||||
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🟣 WiseModel</a></td>
|
||||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
|
||||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
|
||||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
|
||||||
@ -278,15 +280,12 @@ pipe.vae.enable_tiling()
|
|||||||
used to quantize the text encoder, transformer, and VAE modules to reduce the memory requirements of CogVideoX. This
|
used to quantize the text encoder, transformer, and VAE modules to reduce the memory requirements of CogVideoX. This
|
||||||
allows the model to run on free T4 Colabs or GPUs with smaller memory! Also, note that TorchAO quantization is fully
|
allows the model to run on free T4 Colabs or GPUs with smaller memory! Also, note that TorchAO quantization is fully
|
||||||
compatible with `torch.compile`, which can significantly improve inference speed. FP8 precision must be used on
|
compatible with `torch.compile`, which can significantly improve inference speed. FP8 precision must be used on
|
||||||
devices with NVIDIA H100 and above, requiring source installation of `torch`, `torchao`, `diffusers`, and `accelerate`
|
devices with NVIDIA H100 and above, requiring source installation of `torch`, `torchao` Python packages. CUDA 12.4 is recommended.
|
||||||
Python packages. CUDA 12.4 is recommended.
|
|
||||||
+ The inference speed tests also used the above memory optimization scheme. Without memory optimization, inference speed
|
+ The inference speed tests also used the above memory optimization scheme. Without memory optimization, inference speed
|
||||||
increases by about 10%. Only the `diffusers` version of the model supports quantization.
|
increases by about 10%. Only the `diffusers` version of the model supports quantization.
|
||||||
+ The model only supports English input; other languages can be translated into English for use via large model
|
+ The model only supports English input; other languages can be translated into English for use via large model
|
||||||
refinement.
|
refinement.
|
||||||
+ The memory usage of model fine-tuning is tested in an `8 * H100` environment, and the program automatically
|
|
||||||
uses `Zero 2` optimization. If a specific number of GPUs is marked in the table, that number or more GPUs must be used
|
|
||||||
for fine-tuning.
|
|
||||||
|
|
||||||
## Friendly Links
|
## Friendly Links
|
||||||
|
|
||||||
@ -319,6 +318,8 @@ works have already been adapted for CogVideoX, and we invite everyone to use the
|
|||||||
+ [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): DiffSynth Studio is a diffusion engine. It has
|
+ [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): DiffSynth Studio is a diffusion engine. It has
|
||||||
restructured the architecture, including text encoders, UNet, VAE, etc., enhancing computational performance while
|
restructured the architecture, including text encoders, UNet, VAE, etc., enhancing computational performance while
|
||||||
maintaining compatibility with open-source community models. The framework has been adapted for CogVideoX.
|
maintaining compatibility with open-source community models. The framework has been adapted for CogVideoX.
|
||||||
|
+ [CogVideoX-Controlnet](https://github.com/TheDenk/cogvideox-controlnet): A simple ControlNet module code that includes the CogVideoX model.
|
||||||
|
+ [VideoTuna](https://github.com/VideoVerses/VideoTuna): VideoTuna is the first repo that integrates multiple AI video generation models for text-to-video, image-to-video, text-to-image generation.
|
||||||
|
|
||||||
## Project Structure
|
## Project Structure
|
||||||
|
|
||||||
|
20
README_ja.md
20
README_ja.md
@ -22,7 +22,8 @@
|
|||||||
|
|
||||||
## 更新とニュース
|
## 更新とニュース
|
||||||
|
|
||||||
- 🔥🔥 ニュース: ```2024/11/08```: `CogVideoX1.5` モデルをリリースしました。CogVideoX1.5 は CogVideoX オープンソースモデルのアップグレードバージョンです。
|
- 🔥🔥 **ニュース**: ```2024/11/15```: `CogVideoX1.5`モデルのdiffusersバージョンをリリースしました。わずかなパラメータ調整で以前のコードをそのまま利用可能です。
|
||||||
|
- 🔥 **ニュース**: ```2024/11/08```: `CogVideoX1.5` モデルをリリースしました。CogVideoX1.5 は CogVideoX オープンソースモデルのアップグレードバージョンです。
|
||||||
CogVideoX1.5-5B シリーズモデルは、10秒 長の動画とより高い解像度をサポートしており、`CogVideoX1.5-5B-I2V` は任意の解像度での動画生成に対応しています。
|
CogVideoX1.5-5B シリーズモデルは、10秒 長の動画とより高い解像度をサポートしており、`CogVideoX1.5-5B-I2V` は任意の解像度での動画生成に対応しています。
|
||||||
SAT コードはすでに更新されており、`diffusers` バージョンは現在適応中です。
|
SAT コードはすでに更新されており、`diffusers` バージョンは現在適応中です。
|
||||||
SAT バージョンのコードは [こちら](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT) からダウンロードできます。
|
SAT バージョンのコードは [こちら](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT) からダウンロードできます。
|
||||||
@ -185,19 +186,19 @@ CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源の
|
|||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td style="text-align: center;">推論精度</td>
|
<td style="text-align: center;">推論精度</td>
|
||||||
<td colspan="2" style="text-align: center;"><b>BF16</b></td>
|
<td colspan="2" style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32,FP8*,INT8,INT4非対応</td>
|
||||||
<td style="text-align: center;"><b>FP16*(推奨)</b>, BF16, FP32,FP8*,INT8,INT4非対応</td>
|
<td style="text-align: center;"><b>FP16*(推奨)</b>, BF16, FP32,FP8*,INT8,INT4非対応</td>
|
||||||
<td colspan="2" style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32,FP8*,INT8,INT4非対応</td>
|
<td colspan="2" style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32,FP8*,INT8,INT4非対応</td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td style="text-align: center;">単一GPUメモリ消費量<br></td>
|
<td style="text-align: center;">単一GPUメモリ消費量<br></td>
|
||||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 66GB <br></td>
|
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 76GB <br><b>diffusers BF16:10GBから*</b><br><b>diffusers INT8(torchao):7GBから*</b></td>
|
||||||
<td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GB以上* </b><br><b>diffusers INT8(torchao): 3.6GB以上*</b></td>
|
<td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GB以上* </b><br><b>diffusers INT8(torchao): 3.6GB以上*</b></td>
|
||||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB以上* </b><br><b>diffusers INT8(torchao): 4.4GB以上* </b></td>
|
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB以上* </b><br><b>diffusers INT8(torchao): 4.4GB以上* </b></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td style="text-align: center;">複数GPU推論メモリ消費量</td>
|
<td style="text-align: center;">複数GPU推論メモリ消費量</td>
|
||||||
<td colspan="2" style="text-align: center;"><b>非対応</b><br></td>
|
<td colspan="2" style="text-align: center;"><b>BF16: 24GB* using diffusers</b><br></td>
|
||||||
<td style="text-align: center;"><b>FP16: 10GB* diffusers使用</b><br></td>
|
<td style="text-align: center;"><b>FP16: 10GB* diffusers使用</b><br></td>
|
||||||
<td colspan="2" style="text-align: center;"><b>BF16: 15GB* diffusers使用</b><br></td>
|
<td colspan="2" style="text-align: center;"><b>BF16: 15GB* diffusers使用</b><br></td>
|
||||||
</tr>
|
</tr>
|
||||||
@ -235,7 +236,8 @@ CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源の
|
|||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td style="text-align: center;">ダウンロードリンク (Diffusers)</td>
|
<td style="text-align: center;">ダウンロードリンク (Diffusers)</td>
|
||||||
<td colspan="2" style="text-align: center;"> 近日公開 </td>
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B">🟣 WiseModel</a></td>
|
||||||
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🟣 WiseModel</a></td>
|
||||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
|
||||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
|
||||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
|
||||||
@ -268,13 +270,11 @@ pipe.vae.enable_tiling()
|
|||||||
は、CogVideoXのメモリ要件を削減するためにテキストエンコーダ、トランスフォーマ、およびVAEモジュールを量子化するために使用できます。これにより、無料のT4
|
は、CogVideoXのメモリ要件を削減するためにテキストエンコーダ、トランスフォーマ、およびVAEモジュールを量子化するために使用できます。これにより、無料のT4
|
||||||
Colabやより少ないメモリのGPUでモデルを実行することが可能になります。同様に重要なのは、TorchAOの量子化は`torch.compile`
|
Colabやより少ないメモリのGPUでモデルを実行することが可能になります。同様に重要なのは、TorchAOの量子化は`torch.compile`
|
||||||
と完全に互換性があり、推論速度を大幅に向上させることができる点です。`NVIDIA H100`およびそれ以上のデバイスでは`FP8`
|
と完全に互換性があり、推論速度を大幅に向上させることができる点です。`NVIDIA H100`およびそれ以上のデバイスでは`FP8`
|
||||||
精度を使用する必要があります。これには、`torch`、`torchao`、`diffusers`、`accelerate`
|
精度を使用する必要があります。これには、`torch`、`torchao` Pythonパッケージのソースコードからのインストールが必要です。`CUDA 12.4`の使用をお勧めします。
|
||||||
Pythonパッケージのソースコードからのインストールが必要です。`CUDA 12.4`の使用をお勧めします。
|
|
||||||
+ 推論速度テストも同様に、上記のメモリ最適化方法を使用しています。メモリ最適化を使用しない場合、推論速度は約10%向上します。
|
+ 推論速度テストも同様に、上記のメモリ最適化方法を使用しています。メモリ最適化を使用しない場合、推論速度は約10%向上します。
|
||||||
`diffusers`バージョンのモデルのみが量子化をサポートしています。
|
`diffusers`バージョンのモデルのみが量子化をサポートしています。
|
||||||
+ モデルは英語入力のみをサポートしており、他の言語は大規模モデルの改善を通じて英語に翻訳できます。
|
+ モデルは英語入力のみをサポートしており、他の言語は大規模モデルの改善を通じて英語に翻訳できます。
|
||||||
+ モデルのファインチューニングに使用されるメモリは`8 * H100`環境でテストされています。プログラムは自動的に`Zero 2`
|
|
||||||
最適化を使用しています。表に具体的なGPU数が記載されている場合、ファインチューニングにはその数以上のGPUが必要です。
|
|
||||||
|
|
||||||
## 友好的リンク
|
## 友好的リンク
|
||||||
|
|
||||||
@ -299,6 +299,8 @@ pipe.vae.enable_tiling()
|
|||||||
キーフレーム補間生成において、より大きな柔軟性を提供することを目的とした、CogVideoX構造を基にした修正版のパイプライン。
|
キーフレーム補間生成において、より大きな柔軟性を提供することを目的とした、CogVideoX構造を基にした修正版のパイプライン。
|
||||||
+ [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): DiffSynth
|
+ [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): DiffSynth
|
||||||
Studioは、拡散エンジンです。テキストエンコーダー、UNet、VAEなどを含むアーキテクチャを再構築し、オープンソースコミュニティモデルとの互換性を維持しつつ、計算性能を向上させました。このフレームワークはCogVideoXに適応しています。
|
Studioは、拡散エンジンです。テキストエンコーダー、UNet、VAEなどを含むアーキテクチャを再構築し、オープンソースコミュニティモデルとの互換性を維持しつつ、計算性能を向上させました。このフレームワークはCogVideoXに適応しています。
|
||||||
|
+ [CogVideoX-Controlnet](https://github.com/TheDenk/cogvideox-controlnet): CogVideoXモデルを含むシンプルなControlNetモジュールのコード。
|
||||||
|
+ [VideoTuna](https://github.com/VideoVerses/VideoTuna): VideoTuna は、テキストからビデオ、画像からビデオ、テキストから画像生成のための複数のAIビデオ生成モデルを統合した最初のリポジトリです。
|
||||||
|
|
||||||
## プロジェクト構造
|
## プロジェクト構造
|
||||||
|
|
||||||
|
22
README_zh.md
22
README_zh.md
@ -22,7 +22,8 @@
|
|||||||
|
|
||||||
## 项目更新
|
## 项目更新
|
||||||
|
|
||||||
- 🔥🔥 **News**: ```2024/11/08```: 我们发布 `CogVideoX1.5` 模型。CogVideoX1.5 是 CogVideoX 开源模型的升级版本。
|
- 🔥🔥 **News**: ```2024/11/15```: 我们发布 `CogVideoX1.5` 模型的diffusers版本,仅需调整部分参数仅可沿用之前的代码。
|
||||||
|
- 🔥 **News**: ```2024/11/08```: 我们发布 `CogVideoX1.5` 模型。CogVideoX1.5 是 CogVideoX 开源模型的升级版本。
|
||||||
CogVideoX1.5-5B 系列模型支持 **10秒** 长度的视频和更高的分辨率,其中 `CogVideoX1.5-5B-I2V` 支持 **任意分辨率** 的视频生成,SAT代码已经更新。`diffusers`版本还在适配中。SAT版本代码前往 [这里](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT) 下载。
|
CogVideoX1.5-5B 系列模型支持 **10秒** 长度的视频和更高的分辨率,其中 `CogVideoX1.5-5B-I2V` 支持 **任意分辨率** 的视频生成,SAT代码已经更新。`diffusers`版本还在适配中。SAT版本代码前往 [这里](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT) 下载。
|
||||||
- 🔥**News**: ```2024/10/13```: 成本更低,单卡4090可微调 `CogVideoX-5B`
|
- 🔥**News**: ```2024/10/13```: 成本更低,单卡4090可微调 `CogVideoX-5B`
|
||||||
的微调框架[cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory)已经推出,多种分辨率微调,欢迎使用。
|
的微调框架[cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory)已经推出,多种分辨率微调,欢迎使用。
|
||||||
@ -39,8 +40,7 @@ CogVideoX1.5-5B 系列模型支持 **10秒** 长度的视频和更高的分辨
|
|||||||
- 🔥 ```2024/8/27```: 我们开源 CogVideoX 系列更大的模型 **CogVideoX-5B**
|
- 🔥 ```2024/8/27```: 我们开源 CogVideoX 系列更大的模型 **CogVideoX-5B**
|
||||||
。我们大幅度优化了模型的推理性能,推理门槛大幅降低,您可以在 `GTX 1080TI` 等早期显卡运行 **CogVideoX-2B**,在 `RTX 3060`
|
。我们大幅度优化了模型的推理性能,推理门槛大幅降低,您可以在 `GTX 1080TI` 等早期显卡运行 **CogVideoX-2B**,在 `RTX 3060`
|
||||||
等桌面端甜品卡运行 **CogVideoX-5B** 模型。 请严格按照[要求](requirements.txt)
|
等桌面端甜品卡运行 **CogVideoX-5B** 模型。 请严格按照[要求](requirements.txt)
|
||||||
更新安装依赖,推理代码请查看 [cli_demo](inference/cli_demo.py)。同时,**CogVideoX-2B** 模型开源协议已经修改为**Apache 2.0
|
更新安装依赖,推理代码请查看 [cli_demo](inference/cli_demo.py)。同时,**CogVideoX-2B** 模型开源协议已经修改为**Apache 2.0 协议**。
|
||||||
协议**。
|
|
||||||
- 🔥 ```2024/8/6```: 我们开源 **3D Causal VAE**,用于 **CogVideoX-2B**,可以几乎无损地重构视频。
|
- 🔥 ```2024/8/6```: 我们开源 **3D Causal VAE**,用于 **CogVideoX-2B**,可以几乎无损地重构视频。
|
||||||
- 🔥 ```2024/8/6```: 我们开源 CogVideoX 系列视频生成模型的第一个模型, **CogVideoX-2B**。
|
- 🔥 ```2024/8/6```: 我们开源 CogVideoX 系列视频生成模型的第一个模型, **CogVideoX-2B**。
|
||||||
- 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型(现在你可以在 `CogVideo` 分支中看到),这是首个开源的基于
|
- 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型(现在你可以在 `CogVideo` 分支中看到),这是首个开源的基于
|
||||||
@ -176,19 +176,19 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
|
|||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td style="text-align: center;">推理精度</td>
|
<td style="text-align: center;">推理精度</td>
|
||||||
<td colspan="2" style="text-align: center;"><b>BF16</b></td>
|
<td colspan="2" style="text-align: center;"><b>BF16(推荐)</b>, FP16, FP32,FP8*,INT8,不支持INT4</td>
|
||||||
<td style="text-align: center;"><b>FP16*(推荐)</b>, BF16, FP32,FP8*,INT8,不支持INT4</td>
|
<td style="text-align: center;"><b>FP16*(推荐)</b>, BF16, FP32,FP8*,INT8,不支持INT4</td>
|
||||||
<td colspan="2" style="text-align: center;"><b>BF16(推荐)</b>, FP16, FP32,FP8*,INT8,不支持INT4</td>
|
<td colspan="2" style="text-align: center;"><b>BF16(推荐)</b>, FP16, FP32,FP8*,INT8,不支持INT4</td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td style="text-align: center;">单GPU显存消耗<br></td>
|
<td style="text-align: center;">单GPU显存消耗<br></td>
|
||||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 66GB <br></td>
|
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 76GB <br><b>diffusers BF16 : 10GB起* </b><br><b>diffusers INT8(torchao): 7G起* </b></td>
|
||||||
<td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GB起* </b><br><b>diffusers INT8(torchao): 3.6G起*</b></td>
|
<td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GB起* </b><br><b>diffusers INT8(torchao): 3.6G起*</b></td>
|
||||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB起* </b><br><b>diffusers INT8(torchao): 4.4G起* </b></td>
|
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB起* </b><br><b>diffusers INT8(torchao): 4.4G起* </b></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td style="text-align: center;">多GPU推理显存消耗</td>
|
<td style="text-align: center;">多GPU推理显存消耗</td>
|
||||||
<td colspan="2" style="text-align: center;"><b>不支持</b><br></td>
|
<td colspan="2" style="text-align: center;"><b>BF16: 24GB* using diffusers</b><br></td>
|
||||||
<td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
|
<td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
|
||||||
<td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
|
<td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
|
||||||
</tr>
|
</tr>
|
||||||
@ -226,7 +226,8 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
|
|||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td style="text-align: center;">下载链接 (Diffusers)</td>
|
<td style="text-align: center;">下载链接 (Diffusers)</td>
|
||||||
<td colspan="2" style="text-align: center;"> 即将推出 </td>
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B">🟣 WiseModel</a></td>
|
||||||
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🟣 WiseModel</a></td>
|
||||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
|
||||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
|
||||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
|
||||||
@ -256,11 +257,9 @@ pipe.vae.enable_tiling()
|
|||||||
+ [PytorchAO](https://github.com/pytorch/ao) 和 [Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
|
+ [PytorchAO](https://github.com/pytorch/ao) 和 [Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
|
||||||
可以用于量化文本编码器、Transformer 和 VAE 模块,以降低 CogVideoX 的内存需求。这使得在免费的 T4 Colab 或更小显存的 GPU
|
可以用于量化文本编码器、Transformer 和 VAE 模块,以降低 CogVideoX 的内存需求。这使得在免费的 T4 Colab 或更小显存的 GPU
|
||||||
上运行模型成为可能!同样值得注意的是,TorchAO 量化完全兼容 `torch.compile`,这可以显著提高推理速度。在 `NVIDIA H100`
|
上运行模型成为可能!同样值得注意的是,TorchAO 量化完全兼容 `torch.compile`,这可以显著提高推理速度。在 `NVIDIA H100`
|
||||||
及以上设备上必须使用 `FP8` 精度,这需要源码安装 `torch`、`torchao`、`diffusers` 和 `accelerate` Python
|
及以上设备上必须使用 `FP8` 精度,这需要源码安装 `torch`、`torchao` Python 包。建议使用 `CUDA 12.4`。
|
||||||
包。建议使用 `CUDA 12.4`。
|
|
||||||
+ 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。 只有`diffusers`版本模型支持量化。
|
+ 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。 只有`diffusers`版本模型支持量化。
|
||||||
+ 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
|
+ 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
|
||||||
+ 模型微调所占用的显存是在 `8 * H100` 环境下进行测试,程序已经自动使用`Zero 2` 优化。表格中若有标注具体GPU数量则必须使用大于等于该数量的GPU进行微调。
|
|
||||||
|
|
||||||
## 友情链接
|
## 友情链接
|
||||||
|
|
||||||
@ -280,6 +279,9 @@ pipe.vae.enable_tiling()
|
|||||||
+ [CogVideoX-Interpolation](https://github.com/feizc/CogvideX-Interpolation): 基于 CogVideoX 结构修改的管道,旨在为关键帧插值生成提供更大的灵活性。
|
+ [CogVideoX-Interpolation](https://github.com/feizc/CogvideX-Interpolation): 基于 CogVideoX 结构修改的管道,旨在为关键帧插值生成提供更大的灵活性。
|
||||||
+ [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): DiffSynth 工作室是一款扩散引擎。重构了架构,包括文本编码器、UNet、VAE
|
+ [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): DiffSynth 工作室是一款扩散引擎。重构了架构,包括文本编码器、UNet、VAE
|
||||||
等,在保持与开源社区模型兼容性的同时,提升了计算性能。该框架已经适配 CogVideoX。
|
等,在保持与开源社区模型兼容性的同时,提升了计算性能。该框架已经适配 CogVideoX。
|
||||||
|
+ [CogVideoX-Controlnet](https://github.com/TheDenk/cogvideox-controlnet): 一个包含 CogvideoX 模型的简单 Controlnet 模块的代码。
|
||||||
|
+ [VideoTuna](https://github.com/VideoVerses/VideoTuna):VideoTuna 是首个集成多种 AI 视频生成模型的仓库,支持文本转视频、图像转视频、文本转图像生成。
|
||||||
|
|
||||||
|
|
||||||
## 完整项目代码结构
|
## 完整项目代码结构
|
||||||
|
|
||||||
|
@ -3,15 +3,15 @@ This script demonstrates how to generate a video using the CogVideoX model with
|
|||||||
The script supports different types of video generation, including text-to-video (t2v), image-to-video (i2v),
|
The script supports different types of video generation, including text-to-video (t2v), image-to-video (i2v),
|
||||||
and video-to-video (v2v), depending on the input data and different weight.
|
and video-to-video (v2v), depending on the input data and different weight.
|
||||||
|
|
||||||
- text-to-video: THUDM/CogVideoX-5b or THUDM/CogVideoX-2b
|
- text-to-video: THUDM/CogVideoX-5b, THUDM/CogVideoX-2b or THUDM/CogVideoX1.5-5b
|
||||||
- video-to-video: THUDM/CogVideoX-5b or THUDM/CogVideoX-2b
|
- video-to-video: THUDM/CogVideoX-5b, THUDM/CogVideoX-2b or THUDM/CogVideoX1.5-5b
|
||||||
- image-to-video: THUDM/CogVideoX-5b-I2V
|
- image-to-video: THUDM/CogVideoX-5b-I2V or THUDM/CogVideoX1.5-5b-I2V
|
||||||
|
|
||||||
Running the Script:
|
Running the Script:
|
||||||
To run the script, use the following command with appropriate arguments:
|
To run the script, use the following command with appropriate arguments:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ python cli_demo.py --prompt "A girl riding a bike." --model_path THUDM/CogVideoX-5b --generate_type "t2v"
|
$ python cli_demo.py --prompt "A girl riding a bike." --model_path THUDM/CogVideoX1.5-5b --generate_type "t2v"
|
||||||
```
|
```
|
||||||
|
|
||||||
Additional options are available to specify the model path, guidance scale, number of inference steps, video generation type, and output paths.
|
Additional options are available to specify the model path, guidance scale, number of inference steps, video generation type, and output paths.
|
||||||
@ -23,7 +23,6 @@ from typing import Literal
|
|||||||
import torch
|
import torch
|
||||||
from diffusers import (
|
from diffusers import (
|
||||||
CogVideoXPipeline,
|
CogVideoXPipeline,
|
||||||
CogVideoXDDIMScheduler,
|
|
||||||
CogVideoXDPMScheduler,
|
CogVideoXDPMScheduler,
|
||||||
CogVideoXImageToVideoPipeline,
|
CogVideoXImageToVideoPipeline,
|
||||||
CogVideoXVideoToVideoPipeline,
|
CogVideoXVideoToVideoPipeline,
|
||||||
@ -37,6 +36,9 @@ def generate_video(
|
|||||||
model_path: str,
|
model_path: str,
|
||||||
lora_path: str = None,
|
lora_path: str = None,
|
||||||
lora_rank: int = 128,
|
lora_rank: int = 128,
|
||||||
|
num_frames: int = 81,
|
||||||
|
width: int = 1360,
|
||||||
|
height: int = 768,
|
||||||
output_path: str = "./output.mp4",
|
output_path: str = "./output.mp4",
|
||||||
image_or_video_path: str = "",
|
image_or_video_path: str = "",
|
||||||
num_inference_steps: int = 50,
|
num_inference_steps: int = 50,
|
||||||
@ -45,6 +47,7 @@ def generate_video(
|
|||||||
dtype: torch.dtype = torch.bfloat16,
|
dtype: torch.dtype = torch.bfloat16,
|
||||||
generate_type: str = Literal["t2v", "i2v", "v2v"], # i2v: image to video, v2v: video to video
|
generate_type: str = Literal["t2v", "i2v", "v2v"], # i2v: image to video, v2v: video to video
|
||||||
seed: int = 42,
|
seed: int = 42,
|
||||||
|
fps: int = 8,
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Generates a video based on the given prompt and saves it to the specified path.
|
Generates a video based on the given prompt and saves it to the specified path.
|
||||||
@ -56,11 +59,15 @@ def generate_video(
|
|||||||
- lora_rank (int): The rank of the LoRA weights.
|
- lora_rank (int): The rank of the LoRA weights.
|
||||||
- output_path (str): The path where the generated video will be saved.
|
- output_path (str): The path where the generated video will be saved.
|
||||||
- num_inference_steps (int): Number of steps for the inference process. More steps can result in better quality.
|
- num_inference_steps (int): Number of steps for the inference process. More steps can result in better quality.
|
||||||
|
- num_frames (int): Number of frames to generate. CogVideoX1.0 generates 49 frames for 6 seconds at 8 fps, while CogVideoX1.5 produces either 81 or 161 frames, corresponding to 5 seconds or 10 seconds at 16 fps.
|
||||||
|
- width (int): The width of the generated video, applicable only for CogVideoX1.5-5B-I2V
|
||||||
|
- height (int): The height of the generated video, applicable only for CogVideoX1.5-5B-I2V
|
||||||
- guidance_scale (float): The scale for classifier-free guidance. Higher values can lead to better alignment with the prompt.
|
- guidance_scale (float): The scale for classifier-free guidance. Higher values can lead to better alignment with the prompt.
|
||||||
- num_videos_per_prompt (int): Number of videos to generate per prompt.
|
- num_videos_per_prompt (int): Number of videos to generate per prompt.
|
||||||
- dtype (torch.dtype): The data type for computation (default is torch.bfloat16).
|
- dtype (torch.dtype): The data type for computation (default is torch.bfloat16).
|
||||||
- generate_type (str): The type of video generation (e.g., 't2v', 'i2v', 'v2v').·
|
- generate_type (str): The type of video generation (e.g., 't2v', 'i2v', 'v2v').·
|
||||||
- seed (int): The seed for reproducibility.
|
- seed (int): The seed for reproducibility.
|
||||||
|
- fps (int): The frames per second for the generated video.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
# 1. Load the pre-trained CogVideoX pipeline with the specified precision (bfloat16).
|
# 1. Load the pre-trained CogVideoX pipeline with the specified precision (bfloat16).
|
||||||
@ -97,23 +104,23 @@ def generate_video(
|
|||||||
# and enable to("cuda")
|
# and enable to("cuda")
|
||||||
|
|
||||||
# pipe.to("cuda")
|
# pipe.to("cuda")
|
||||||
|
|
||||||
pipe.enable_sequential_cpu_offload()
|
pipe.enable_sequential_cpu_offload()
|
||||||
|
|
||||||
pipe.vae.enable_slicing()
|
pipe.vae.enable_slicing()
|
||||||
pipe.vae.enable_tiling()
|
pipe.vae.enable_tiling()
|
||||||
|
|
||||||
# 4. Generate the video frames based on the prompt.
|
# 4. Generate the video frames based on the prompt.
|
||||||
# `num_frames` is the Number of frames to generate.
|
# `num_frames` is the Number of frames to generate.
|
||||||
# This is the default value for 6 seconds video and 8 fps and will plus 1 frame for the first frame and 49 frames.
|
|
||||||
if generate_type == "i2v":
|
if generate_type == "i2v":
|
||||||
video_generate = pipe(
|
video_generate = pipe(
|
||||||
|
height=height,
|
||||||
|
width=width,
|
||||||
prompt=prompt,
|
prompt=prompt,
|
||||||
image=image, # The path of the image to be used as the background of the video
|
image=image,
|
||||||
|
# The path of the image, the resolution of video will be the same as the image for CogVideoX1.5-5B-I2V, otherwise it will be 720 * 480
|
||||||
num_videos_per_prompt=num_videos_per_prompt, # Number of videos to generate per prompt
|
num_videos_per_prompt=num_videos_per_prompt, # Number of videos to generate per prompt
|
||||||
num_inference_steps=num_inference_steps, # Number of inference steps
|
num_inference_steps=num_inference_steps, # Number of inference steps
|
||||||
num_frames=49, # Number of frames to generate,changed to 49 for diffusers version `0.30.3` and after.
|
num_frames=num_frames, # Number of frames to generate
|
||||||
use_dynamic_cfg=True, # This id used for DPM Sechduler, for DDIM scheduler, it should be False
|
use_dynamic_cfg=True, # This id used for DPM scheduler, for DDIM scheduler, it should be False
|
||||||
guidance_scale=guidance_scale,
|
guidance_scale=guidance_scale,
|
||||||
generator=torch.Generator().manual_seed(seed), # Set the seed for reproducibility
|
generator=torch.Generator().manual_seed(seed), # Set the seed for reproducibility
|
||||||
).frames[0]
|
).frames[0]
|
||||||
@ -122,7 +129,7 @@ def generate_video(
|
|||||||
prompt=prompt,
|
prompt=prompt,
|
||||||
num_videos_per_prompt=num_videos_per_prompt,
|
num_videos_per_prompt=num_videos_per_prompt,
|
||||||
num_inference_steps=num_inference_steps,
|
num_inference_steps=num_inference_steps,
|
||||||
num_frames=49,
|
num_frames=num_frames,
|
||||||
use_dynamic_cfg=True,
|
use_dynamic_cfg=True,
|
||||||
guidance_scale=guidance_scale,
|
guidance_scale=guidance_scale,
|
||||||
generator=torch.Generator().manual_seed(seed),
|
generator=torch.Generator().manual_seed(seed),
|
||||||
@ -133,13 +140,12 @@ def generate_video(
|
|||||||
video=video, # The path of the video to be used as the background of the video
|
video=video, # The path of the video to be used as the background of the video
|
||||||
num_videos_per_prompt=num_videos_per_prompt,
|
num_videos_per_prompt=num_videos_per_prompt,
|
||||||
num_inference_steps=num_inference_steps,
|
num_inference_steps=num_inference_steps,
|
||||||
# num_frames=49,
|
num_frames=num_frames,
|
||||||
use_dynamic_cfg=True,
|
use_dynamic_cfg=True,
|
||||||
guidance_scale=guidance_scale,
|
guidance_scale=guidance_scale,
|
||||||
generator=torch.Generator().manual_seed(seed), # Set the seed for reproducibility
|
generator=torch.Generator().manual_seed(seed), # Set the seed for reproducibility
|
||||||
).frames[0]
|
).frames[0]
|
||||||
# 5. Export the generated frames to a video file. fps must be 8 for original video.
|
export_to_video(video_generate, output_path, fps=fps)
|
||||||
export_to_video(video_generate, output_path, fps=8)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
@ -152,24 +158,20 @@ if __name__ == "__main__":
|
|||||||
help="The path of the image to be used as the background of the video",
|
help="The path of the image to be used as the background of the video",
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--model_path", type=str, default="THUDM/CogVideoX-5b", help="The path of the pre-trained model to be used"
|
"--model_path", type=str, default="THUDM/CogVideoX1.5-5B", help="Path of the pre-trained model use"
|
||||||
)
|
)
|
||||||
parser.add_argument("--lora_path", type=str, default=None, help="The path of the LoRA weights to be used")
|
parser.add_argument("--lora_path", type=str, default=None, help="The path of the LoRA weights to be used")
|
||||||
parser.add_argument("--lora_rank", type=int, default=128, help="The rank of the LoRA weights")
|
parser.add_argument("--lora_rank", type=int, default=128, help="The rank of the LoRA weights")
|
||||||
parser.add_argument(
|
parser.add_argument("--output_path", type=str, default="./output.mp4", help="The path save generated video")
|
||||||
"--output_path", type=str, default="./output.mp4", help="The path where the generated video will be saved"
|
|
||||||
)
|
|
||||||
parser.add_argument("--guidance_scale", type=float, default=6.0, help="The scale for classifier-free guidance")
|
parser.add_argument("--guidance_scale", type=float, default=6.0, help="The scale for classifier-free guidance")
|
||||||
parser.add_argument(
|
parser.add_argument("--num_inference_steps", type=int, default=50, help="Inference steps")
|
||||||
"--num_inference_steps", type=int, default=50, help="Number of steps for the inference process"
|
parser.add_argument("--num_frames", type=int, default=81, help="Number of steps for the inference process")
|
||||||
)
|
parser.add_argument("--width", type=int, default=1360, help="Number of steps for the inference process")
|
||||||
|
parser.add_argument("--height", type=int, default=768, help="Number of steps for the inference process")
|
||||||
|
parser.add_argument("--fps", type=int, default=16, help="Number of steps for the inference process")
|
||||||
parser.add_argument("--num_videos_per_prompt", type=int, default=1, help="Number of videos to generate per prompt")
|
parser.add_argument("--num_videos_per_prompt", type=int, default=1, help="Number of videos to generate per prompt")
|
||||||
parser.add_argument(
|
parser.add_argument("--generate_type", type=str, default="t2v", help="The type of video generation")
|
||||||
"--generate_type", type=str, default="t2v", help="The type of video generation (e.g., 't2v', 'i2v', 'v2v')"
|
parser.add_argument("--dtype", type=str, default="bfloat16", help="The data type for computation")
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--dtype", type=str, default="bfloat16", help="The data type for computation (e.g., 'float16' or 'bfloat16')"
|
|
||||||
)
|
|
||||||
parser.add_argument("--seed", type=int, default=42, help="The seed for reproducibility")
|
parser.add_argument("--seed", type=int, default=42, help="The seed for reproducibility")
|
||||||
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
@ -180,6 +182,9 @@ if __name__ == "__main__":
|
|||||||
lora_path=args.lora_path,
|
lora_path=args.lora_path,
|
||||||
lora_rank=args.lora_rank,
|
lora_rank=args.lora_rank,
|
||||||
output_path=args.output_path,
|
output_path=args.output_path,
|
||||||
|
num_frames=args.num_frames,
|
||||||
|
width=args.width,
|
||||||
|
height=args.height,
|
||||||
image_or_video_path=args.image_or_video_path,
|
image_or_video_path=args.image_or_video_path,
|
||||||
num_inference_steps=args.num_inference_steps,
|
num_inference_steps=args.num_inference_steps,
|
||||||
guidance_scale=args.guidance_scale,
|
guidance_scale=args.guidance_scale,
|
||||||
@ -187,4 +192,5 @@ if __name__ == "__main__":
|
|||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
generate_type=args.generate_type,
|
generate_type=args.generate_type,
|
||||||
seed=args.seed,
|
seed=args.seed,
|
||||||
|
fps=args.fps,
|
||||||
)
|
)
|
||||||
|
@ -3,7 +3,7 @@ This script demonstrates how to generate a video from a text prompt using CogVid
|
|||||||
|
|
||||||
Note:
|
Note:
|
||||||
|
|
||||||
Must install the `torchao`,`torch`,`diffusers`,`accelerate` library FROM SOURCE to use the quantization feature.
|
Must install the `torchao`,`torch` library FROM SOURCE to use the quantization feature.
|
||||||
Only NVIDIA GPUs like H100 or higher are supported om FP-8 quantization.
|
Only NVIDIA GPUs like H100 or higher are supported om FP-8 quantization.
|
||||||
|
|
||||||
ALL quantization schemes must use with NVIDIA GPUs.
|
ALL quantization schemes must use with NVIDIA GPUs.
|
||||||
@ -51,6 +51,9 @@ def generate_video(
|
|||||||
num_videos_per_prompt: int = 1,
|
num_videos_per_prompt: int = 1,
|
||||||
quantization_scheme: str = "fp8",
|
quantization_scheme: str = "fp8",
|
||||||
dtype: torch.dtype = torch.bfloat16,
|
dtype: torch.dtype = torch.bfloat16,
|
||||||
|
num_frames: int = 81,
|
||||||
|
fps: int = 8,
|
||||||
|
seed: int = 42,
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Generates a video based on the given prompt and saves it to the specified path.
|
Generates a video based on the given prompt and saves it to the specified path.
|
||||||
@ -65,7 +68,6 @@ def generate_video(
|
|||||||
- quantization_scheme (str): The quantization scheme to use ('int8', 'fp8').
|
- quantization_scheme (str): The quantization scheme to use ('int8', 'fp8').
|
||||||
- dtype (torch.dtype): The data type for computation (default is torch.bfloat16).
|
- dtype (torch.dtype): The data type for computation (default is torch.bfloat16).
|
||||||
"""
|
"""
|
||||||
|
|
||||||
text_encoder = T5EncoderModel.from_pretrained(model_path, subfolder="text_encoder", torch_dtype=dtype)
|
text_encoder = T5EncoderModel.from_pretrained(model_path, subfolder="text_encoder", torch_dtype=dtype)
|
||||||
text_encoder = quantize_model(part=text_encoder, quantization_scheme=quantization_scheme)
|
text_encoder = quantize_model(part=text_encoder, quantization_scheme=quantization_scheme)
|
||||||
transformer = CogVideoXTransformer3DModel.from_pretrained(model_path, subfolder="transformer", torch_dtype=dtype)
|
transformer = CogVideoXTransformer3DModel.from_pretrained(model_path, subfolder="transformer", torch_dtype=dtype)
|
||||||
@ -80,54 +82,38 @@ def generate_video(
|
|||||||
torch_dtype=dtype,
|
torch_dtype=dtype,
|
||||||
)
|
)
|
||||||
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
|
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
|
||||||
|
|
||||||
# Using with compile will run faster. First time infer will cost ~30min to compile.
|
|
||||||
# pipe.transformer.to(memory_format=torch.channels_last)
|
|
||||||
|
|
||||||
# for FP8 should remove pipe.enable_model_cpu_offload()
|
|
||||||
pipe.enable_model_cpu_offload()
|
pipe.enable_model_cpu_offload()
|
||||||
|
|
||||||
# This is not for FP8 and INT8 and should remove this line
|
|
||||||
# pipe.enable_sequential_cpu_offload()
|
|
||||||
pipe.vae.enable_slicing()
|
pipe.vae.enable_slicing()
|
||||||
pipe.vae.enable_tiling()
|
pipe.vae.enable_tiling()
|
||||||
|
|
||||||
video = pipe(
|
video = pipe(
|
||||||
prompt=prompt,
|
prompt=prompt,
|
||||||
num_videos_per_prompt=num_videos_per_prompt,
|
num_videos_per_prompt=num_videos_per_prompt,
|
||||||
num_inference_steps=num_inference_steps,
|
num_inference_steps=num_inference_steps,
|
||||||
num_frames=49,
|
num_frames=num_frames,
|
||||||
use_dynamic_cfg=True,
|
use_dynamic_cfg=True,
|
||||||
guidance_scale=guidance_scale,
|
guidance_scale=guidance_scale,
|
||||||
generator=torch.Generator(device="cuda").manual_seed(42),
|
generator=torch.Generator(device="cuda").manual_seed(seed),
|
||||||
).frames[0]
|
).frames[0]
|
||||||
|
|
||||||
export_to_video(video, output_path, fps=8)
|
export_to_video(video, output_path, fps=fps)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
parser = argparse.ArgumentParser(description="Generate a video from a text prompt using CogVideoX")
|
parser = argparse.ArgumentParser(description="Generate a video from a text prompt using CogVideoX")
|
||||||
parser.add_argument("--prompt", type=str, required=True, help="The description of the video to be generated")
|
parser.add_argument("--prompt", type=str, required=True, help="The description of the video to be generated")
|
||||||
|
parser.add_argument("--model_path", type=str, default="THUDM/CogVideoX-5b", help="Path of the pre-trained model")
|
||||||
|
parser.add_argument("--output_path", type=str, default="./output.mp4", help="Path to save generated video")
|
||||||
|
parser.add_argument("--num_inference_steps", type=int, default=50, help="Inference steps")
|
||||||
|
parser.add_argument("--guidance_scale", type=float, default=6.0, help="Classifier-free guidance scale")
|
||||||
|
parser.add_argument("--num_videos_per_prompt", type=int, default=1, help="Videos to generate per prompt")
|
||||||
|
parser.add_argument("--dtype", type=str, default="bfloat16", help="Data type (e.g., 'float16', 'bfloat16')")
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--model_path", type=str, default="THUDM/CogVideoX-5b", help="The path of the pre-trained model to be used"
|
"--quantization_scheme", type=str, default="fp8", choices=["int8", "fp8"], help="Quantization scheme"
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--output_path", type=str, default="./output.mp4", help="The path where the generated video will be saved"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--num_inference_steps", type=int, default=50, help="Number of steps for the inference process"
|
|
||||||
)
|
|
||||||
parser.add_argument("--guidance_scale", type=float, default=6.0, help="The scale for classifier-free guidance")
|
|
||||||
parser.add_argument("--num_videos_per_prompt", type=int, default=1, help="Number of videos to generate per prompt")
|
|
||||||
parser.add_argument(
|
|
||||||
"--dtype", type=str, default="bfloat16", help="The data type for computation (e.g., 'float16', 'bfloat16')"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--quantization_scheme",
|
|
||||||
type=str,
|
|
||||||
default="bf16",
|
|
||||||
choices=["int8", "fp8"],
|
|
||||||
help="The quantization scheme to use (int8, fp8)",
|
|
||||||
)
|
)
|
||||||
|
parser.add_argument("--num_frames", type=int, default=81, help="Number of frames in the video")
|
||||||
|
parser.add_argument("--fps", type=int, default=16, help="Frames per second for output video")
|
||||||
|
parser.add_argument("--seed", type=int, default=42, help="Random seed for reproducibility")
|
||||||
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
dtype = torch.float16 if args.dtype == "float16" else torch.bfloat16
|
dtype = torch.float16 if args.dtype == "float16" else torch.bfloat16
|
||||||
@ -140,4 +126,7 @@ if __name__ == "__main__":
|
|||||||
num_videos_per_prompt=args.num_videos_per_prompt,
|
num_videos_per_prompt=args.num_videos_per_prompt,
|
||||||
quantization_scheme=args.quantization_scheme,
|
quantization_scheme=args.quantization_scheme,
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
|
num_frames=args.num_frames,
|
||||||
|
fps=args.fps,
|
||||||
|
seed=args.seed,
|
||||||
)
|
)
|
||||||
|
@ -1,14 +1,14 @@
|
|||||||
diffusers>=0.31.0
|
diffusers>=0.31.0
|
||||||
accelerate>=1.0.1
|
accelerate>=1.1.1
|
||||||
transformers>=4.46.1
|
transformers>=4.46.2
|
||||||
numpy==1.26.0
|
numpy==1.26.0
|
||||||
torch>=2.5.0
|
torch>=2.5.0
|
||||||
torchvision>=0.20.0
|
torchvision>=0.20.0
|
||||||
sentencepiece>=0.2.0
|
sentencepiece>=0.2.0
|
||||||
SwissArmyTransformer>=0.4.12
|
SwissArmyTransformer>=0.4.12
|
||||||
gradio>=5.4.0
|
gradio>=5.5.0
|
||||||
imageio>=2.35.1
|
imageio>=2.35.1
|
||||||
imageio-ffmpeg>=0.5.1
|
imageio-ffmpeg>=0.5.1
|
||||||
openai>=1.53.0
|
openai>=1.54.0
|
||||||
moviepy>=1.0.3
|
moviepy>=1.0.3
|
||||||
scikit-video>=1.1.11
|
scikit-video>=1.1.11
|
@ -92,6 +92,8 @@ TRANSFORMER_KEYS_RENAME_DICT = {
|
|||||||
"post_attn1_layernorm": "norm2.norm",
|
"post_attn1_layernorm": "norm2.norm",
|
||||||
"time_embed.0": "time_embedding.linear_1",
|
"time_embed.0": "time_embedding.linear_1",
|
||||||
"time_embed.2": "time_embedding.linear_2",
|
"time_embed.2": "time_embedding.linear_2",
|
||||||
|
"ofs_embed.0": "ofs_embedding.linear_1",
|
||||||
|
"ofs_embed.2": "ofs_embedding.linear_2",
|
||||||
"mixins.patch_embed": "patch_embed",
|
"mixins.patch_embed": "patch_embed",
|
||||||
"mixins.final_layer.norm_final": "norm_out.norm",
|
"mixins.final_layer.norm_final": "norm_out.norm",
|
||||||
"mixins.final_layer.linear": "proj_out",
|
"mixins.final_layer.linear": "proj_out",
|
||||||
@ -146,12 +148,13 @@ def update_state_dict_inplace(state_dict: Dict[str, Any], old_key: str, new_key:
|
|||||||
|
|
||||||
|
|
||||||
def convert_transformer(
|
def convert_transformer(
|
||||||
ckpt_path: str,
|
ckpt_path: str,
|
||||||
num_layers: int,
|
num_layers: int,
|
||||||
num_attention_heads: int,
|
num_attention_heads: int,
|
||||||
use_rotary_positional_embeddings: bool,
|
use_rotary_positional_embeddings: bool,
|
||||||
i2v: bool,
|
i2v: bool,
|
||||||
dtype: torch.dtype,
|
dtype: torch.dtype,
|
||||||
|
init_kwargs: Dict[str, Any],
|
||||||
):
|
):
|
||||||
PREFIX_KEY = "model.diffusion_model."
|
PREFIX_KEY = "model.diffusion_model."
|
||||||
|
|
||||||
@ -161,11 +164,13 @@ def convert_transformer(
|
|||||||
num_layers=num_layers,
|
num_layers=num_layers,
|
||||||
num_attention_heads=num_attention_heads,
|
num_attention_heads=num_attention_heads,
|
||||||
use_rotary_positional_embeddings=use_rotary_positional_embeddings,
|
use_rotary_positional_embeddings=use_rotary_positional_embeddings,
|
||||||
use_learned_positional_embeddings=i2v,
|
ofs_embed_dim=512 if (i2v and init_kwargs["patch_size_t"] is not None) else None, # CogVideoX1.5-5B-I2V
|
||||||
|
use_learned_positional_embeddings=i2v and init_kwargs["patch_size_t"] is None, # CogVideoX-5B-I2V
|
||||||
|
**init_kwargs,
|
||||||
).to(dtype=dtype)
|
).to(dtype=dtype)
|
||||||
|
|
||||||
for key in list(original_state_dict.keys()):
|
for key in list(original_state_dict.keys()):
|
||||||
new_key = key[len(PREFIX_KEY):]
|
new_key = key[len(PREFIX_KEY) :]
|
||||||
for replace_key, rename_key in TRANSFORMER_KEYS_RENAME_DICT.items():
|
for replace_key, rename_key in TRANSFORMER_KEYS_RENAME_DICT.items():
|
||||||
new_key = new_key.replace(replace_key, rename_key)
|
new_key = new_key.replace(replace_key, rename_key)
|
||||||
update_state_dict_inplace(original_state_dict, key, new_key)
|
update_state_dict_inplace(original_state_dict, key, new_key)
|
||||||
@ -175,13 +180,18 @@ def convert_transformer(
|
|||||||
if special_key not in key:
|
if special_key not in key:
|
||||||
continue
|
continue
|
||||||
handler_fn_inplace(key, original_state_dict)
|
handler_fn_inplace(key, original_state_dict)
|
||||||
|
|
||||||
transformer.load_state_dict(original_state_dict, strict=True)
|
transformer.load_state_dict(original_state_dict, strict=True)
|
||||||
return transformer
|
return transformer
|
||||||
|
|
||||||
|
|
||||||
def convert_vae(ckpt_path: str, scaling_factor: float, dtype: torch.dtype):
|
def convert_vae(ckpt_path: str, scaling_factor: float, version: str, dtype: torch.dtype):
|
||||||
|
init_kwargs = {"scaling_factor": scaling_factor}
|
||||||
|
if version == "1.5":
|
||||||
|
init_kwargs.update({"invert_scale_latents": True})
|
||||||
|
|
||||||
original_state_dict = get_state_dict(torch.load(ckpt_path, map_location="cpu", mmap=True))
|
original_state_dict = get_state_dict(torch.load(ckpt_path, map_location="cpu", mmap=True))
|
||||||
vae = AutoencoderKLCogVideoX(scaling_factor=scaling_factor).to(dtype=dtype)
|
vae = AutoencoderKLCogVideoX(**init_kwargs).to(dtype=dtype)
|
||||||
|
|
||||||
for key in list(original_state_dict.keys()):
|
for key in list(original_state_dict.keys()):
|
||||||
new_key = key[:]
|
new_key = key[:]
|
||||||
@ -199,6 +209,34 @@ def convert_vae(ckpt_path: str, scaling_factor: float, dtype: torch.dtype):
|
|||||||
return vae
|
return vae
|
||||||
|
|
||||||
|
|
||||||
|
def get_transformer_init_kwargs(version: str):
|
||||||
|
if version == "1.0":
|
||||||
|
vae_scale_factor_spatial = 8
|
||||||
|
init_kwargs = {
|
||||||
|
"patch_size": 2,
|
||||||
|
"patch_size_t": None,
|
||||||
|
"patch_bias": True,
|
||||||
|
"sample_height": 480 // vae_scale_factor_spatial,
|
||||||
|
"sample_width": 720 // vae_scale_factor_spatial,
|
||||||
|
"sample_frames": 49,
|
||||||
|
}
|
||||||
|
|
||||||
|
elif version == "1.5":
|
||||||
|
vae_scale_factor_spatial = 8
|
||||||
|
init_kwargs = {
|
||||||
|
"patch_size": 2,
|
||||||
|
"patch_size_t": 2,
|
||||||
|
"patch_bias": False,
|
||||||
|
"sample_height": 768 // vae_scale_factor_spatial,
|
||||||
|
"sample_width": 1360 // vae_scale_factor_spatial,
|
||||||
|
"sample_frames": 81,
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
raise ValueError("Unsupported version of CogVideoX.")
|
||||||
|
|
||||||
|
return init_kwargs
|
||||||
|
|
||||||
|
|
||||||
def get_args():
|
def get_args():
|
||||||
parser = argparse.ArgumentParser()
|
parser = argparse.ArgumentParser()
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
@ -214,6 +252,12 @@ def get_args():
|
|||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--text_encoder_cache_dir", type=str, default=None, help="Path to text encoder cache directory"
|
"--text_encoder_cache_dir", type=str, default=None, help="Path to text encoder cache directory"
|
||||||
)
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--typecast_text_encoder",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
help="Whether or not to apply fp16/bf16 precision to text_encoder",
|
||||||
|
)
|
||||||
# For CogVideoX-2B, num_layers is 30. For 5B, it is 42
|
# For CogVideoX-2B, num_layers is 30. For 5B, it is 42
|
||||||
parser.add_argument("--num_layers", type=int, default=30, help="Number of transformer blocks")
|
parser.add_argument("--num_layers", type=int, default=30, help="Number of transformer blocks")
|
||||||
# For CogVideoX-2B, num_attention_heads is 30. For 5B, it is 48
|
# For CogVideoX-2B, num_attention_heads is 30. For 5B, it is 48
|
||||||
@ -226,7 +270,18 @@ def get_args():
|
|||||||
parser.add_argument("--scaling_factor", type=float, default=1.15258426, help="Scaling factor in the VAE")
|
parser.add_argument("--scaling_factor", type=float, default=1.15258426, help="Scaling factor in the VAE")
|
||||||
# For CogVideoX-2B, snr_shift_scale is 3.0. For 5B, it is 1.0
|
# For CogVideoX-2B, snr_shift_scale is 3.0. For 5B, it is 1.0
|
||||||
parser.add_argument("--snr_shift_scale", type=float, default=3.0, help="Scaling factor in the VAE")
|
parser.add_argument("--snr_shift_scale", type=float, default=3.0, help="Scaling factor in the VAE")
|
||||||
parser.add_argument("--i2v", action="store_true", default=False, help="Whether to save the model weights in fp16")
|
parser.add_argument(
|
||||||
|
"--i2v",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
help="Whether the model to be converted is the Image-to-Video version of CogVideoX.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--version",
|
||||||
|
choices=["1.0", "1.5"],
|
||||||
|
default="1.0",
|
||||||
|
help="Which version of CogVideoX to use for initializing default modeling parameters.",
|
||||||
|
)
|
||||||
return parser.parse_args()
|
return parser.parse_args()
|
||||||
|
|
||||||
|
|
||||||
@ -242,6 +297,7 @@ if __name__ == "__main__":
|
|||||||
dtype = torch.float16 if args.fp16 else torch.bfloat16 if args.bf16 else torch.float32
|
dtype = torch.float16 if args.fp16 else torch.bfloat16 if args.bf16 else torch.float32
|
||||||
|
|
||||||
if args.transformer_ckpt_path is not None:
|
if args.transformer_ckpt_path is not None:
|
||||||
|
init_kwargs = get_transformer_init_kwargs(args.version)
|
||||||
transformer = convert_transformer(
|
transformer = convert_transformer(
|
||||||
args.transformer_ckpt_path,
|
args.transformer_ckpt_path,
|
||||||
args.num_layers,
|
args.num_layers,
|
||||||
@ -249,14 +305,19 @@ if __name__ == "__main__":
|
|||||||
args.use_rotary_positional_embeddings,
|
args.use_rotary_positional_embeddings,
|
||||||
args.i2v,
|
args.i2v,
|
||||||
dtype,
|
dtype,
|
||||||
|
init_kwargs,
|
||||||
)
|
)
|
||||||
if args.vae_ckpt_path is not None:
|
if args.vae_ckpt_path is not None:
|
||||||
vae = convert_vae(args.vae_ckpt_path, args.scaling_factor, dtype)
|
# Keep VAE in float32 for better quality
|
||||||
|
vae = convert_vae(args.vae_ckpt_path, args.scaling_factor, args.version, torch.float32)
|
||||||
|
|
||||||
text_encoder_id = "/share/official_pretrains/hf_home/t5-v1_1-xxl"
|
text_encoder_id = "google/t5-v1_1-xxl"
|
||||||
tokenizer = T5Tokenizer.from_pretrained(text_encoder_id, model_max_length=TOKENIZER_MAX_LENGTH)
|
tokenizer = T5Tokenizer.from_pretrained(text_encoder_id, model_max_length=TOKENIZER_MAX_LENGTH)
|
||||||
text_encoder = T5EncoderModel.from_pretrained(text_encoder_id, cache_dir=args.text_encoder_cache_dir)
|
text_encoder = T5EncoderModel.from_pretrained(text_encoder_id, cache_dir=args.text_encoder_cache_dir)
|
||||||
|
|
||||||
|
if args.typecast_text_encoder:
|
||||||
|
text_encoder = text_encoder.to(dtype=dtype)
|
||||||
|
|
||||||
# Apparently, the conversion does not work anymore without this :shrug:
|
# Apparently, the conversion does not work anymore without this :shrug:
|
||||||
for param in text_encoder.parameters():
|
for param in text_encoder.parameters():
|
||||||
param.data = param.data.contiguous()
|
param.data = param.data.contiguous()
|
||||||
@ -288,11 +349,6 @@ if __name__ == "__main__":
|
|||||||
scheduler=scheduler,
|
scheduler=scheduler,
|
||||||
)
|
)
|
||||||
|
|
||||||
if args.fp16:
|
|
||||||
pipe = pipe.to(dtype=torch.float16)
|
|
||||||
if args.bf16:
|
|
||||||
pipe = pipe.to(dtype=torch.bfloat16)
|
|
||||||
|
|
||||||
# We don't use variant here because the model must be run in fp16 (2B) or bf16 (5B). It would be weird
|
# We don't use variant here because the model must be run in fp16 (2B) or bf16 (5B). It would be weird
|
||||||
# for users to specify variant when the default is not fp32 and they want to run with the correct default (which
|
# for users to specify variant when the default is not fp32 and they want to run with the correct default (which
|
||||||
# is either fp16/bf16 here).
|
# is either fp16/bf16 here).
|
||||||
|
Loading…
x
Reference in New Issue
Block a user