mirror of
https://github.com/THUDM/CogVideo.git
synced 2025-04-05 19:41:59 +08:00
commit
2fdc59c3ce
3
.gitignore
vendored
3
.gitignore
vendored
@ -7,4 +7,5 @@ logs/
|
||||
*.DS_Store
|
||||
.idea
|
||||
output*
|
||||
test*
|
||||
test*
|
||||
venv
|
29
README.md
29
README.md
@ -22,7 +22,8 @@ Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spac
|
||||
|
||||
## Project Updates
|
||||
|
||||
- 🔥🔥 News: ```2024/11/08```: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX.
|
||||
- 🔥🔥 **News**: ```2024/11/15```: We released the `CogVideoX1.5` model in the diffusers version. Only minor parameter adjustments are needed to continue using previous code.
|
||||
- 🔥 News: ```2024/11/08```: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX.
|
||||
The CogVideoX1.5-5B series supports 10-second videos with higher resolution, and CogVideoX1.5-5B-I2V supports video generation at any resolution.
|
||||
The SAT code has already been updated, while the diffusers version is still under adaptation. Download the SAT version code [here](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT).
|
||||
- 🔥 **News**: ```2024/10/13```: A more cost-effective fine-tuning framework for `CogVideoX-5B` that works with a single
|
||||
@ -43,11 +44,11 @@ The SAT code has already been updated, while the diffusers version is still unde
|
||||
model [CogVLM2-Caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption), used in the training process of
|
||||
CogVideoX to convert video data into text descriptions, has been open-sourced. Welcome to download and use it.
|
||||
- 🔥 ```2024/8/27```: We have open-sourced a larger model in the CogVideoX series, **CogVideoX-5B**. We have
|
||||
significantly optimized the model's inference performance, greatly lowering the inference threshold. You can run *
|
||||
*CogVideoX-2B** on older GPUs like `GTX 1080TI`, and **CogVideoX-5B** on desktop GPUs like `RTX 3060`. Please strictly
|
||||
significantly optimized the model's inference performance, greatly lowering the inference threshold.
|
||||
You can run **CogVideoX-2B** on older GPUs like `GTX 1080TI`, and **CogVideoX-5B** on desktop GPUs like `RTX 3060`. Please strictly
|
||||
follow the [requirements](requirements.txt) to update and install dependencies, and refer
|
||||
to [cli_demo](inference/cli_demo.py) for inference code. Additionally, the open-source license for the **CogVideoX-2B
|
||||
** model has been changed to the **Apache 2.0 License**.
|
||||
to [cli_demo](inference/cli_demo.py) for inference code. Additionally, the open-source license for
|
||||
the **CogVideoX-2B** model has been changed to the **Apache 2.0 License**.
|
||||
- 🔥 ```2024/8/6```: We have open-sourced **3D Causal VAE**, used for **CogVideoX-2B**, which can reconstruct videos with
|
||||
almost no loss.
|
||||
- 🔥 ```2024/8/6```: We have open-sourced the first model of the CogVideoX series video generation models, **CogVideoX-2B
|
||||
@ -193,19 +194,19 @@ models we currently offer, along with their foundational information.
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Inference Precision</td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16</b></td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, Not supported: INT4</td>
|
||||
<td style="text-align: center;"><b>FP16*(Recommended)</b>, BF16, FP32, FP8*, INT8, Not supported: INT4</td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, Not supported: INT4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Single GPU Memory Usage<br></td>
|
||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 66GB <br></td>
|
||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 76GB <br><b>diffusers BF16: from 10GB*</b><br><b>diffusers INT8(torchao): from 7GB*</b></td>
|
||||
<td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GB minimum* </b><br><b>diffusers INT8 (torchao): 3.6GB minimum*</b></td>
|
||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB minimum* </b><br><b>diffusers INT8 (torchao): 4.4GB minimum* </b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Multi-GPU Memory Usage</td>
|
||||
<td colspan="2" style="text-align: center;"><b>Not Supported</b><br></td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16: 24GB* using diffusers</b><br></td>
|
||||
<td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
|
||||
</tr>
|
||||
@ -243,7 +244,8 @@ models we currently offer, along with their foundational information.
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">Download Link (Diffusers)</td>
|
||||
<td colspan="2" style="text-align: center;"> Coming Soon </td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B">🟣 WiseModel</a></td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🟣 WiseModel</a></td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
|
||||
@ -278,15 +280,12 @@ pipe.vae.enable_tiling()
|
||||
used to quantize the text encoder, transformer, and VAE modules to reduce the memory requirements of CogVideoX. This
|
||||
allows the model to run on free T4 Colabs or GPUs with smaller memory! Also, note that TorchAO quantization is fully
|
||||
compatible with `torch.compile`, which can significantly improve inference speed. FP8 precision must be used on
|
||||
devices with NVIDIA H100 and above, requiring source installation of `torch`, `torchao`, `diffusers`, and `accelerate`
|
||||
Python packages. CUDA 12.4 is recommended.
|
||||
devices with NVIDIA H100 and above, requiring source installation of `torch`, `torchao` Python packages. CUDA 12.4 is recommended.
|
||||
+ The inference speed tests also used the above memory optimization scheme. Without memory optimization, inference speed
|
||||
increases by about 10%. Only the `diffusers` version of the model supports quantization.
|
||||
+ The model only supports English input; other languages can be translated into English for use via large model
|
||||
refinement.
|
||||
+ The memory usage of model fine-tuning is tested in an `8 * H100` environment, and the program automatically
|
||||
uses `Zero 2` optimization. If a specific number of GPUs is marked in the table, that number or more GPUs must be used
|
||||
for fine-tuning.
|
||||
|
||||
|
||||
## Friendly Links
|
||||
|
||||
@ -319,6 +318,8 @@ works have already been adapted for CogVideoX, and we invite everyone to use the
|
||||
+ [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): DiffSynth Studio is a diffusion engine. It has
|
||||
restructured the architecture, including text encoders, UNet, VAE, etc., enhancing computational performance while
|
||||
maintaining compatibility with open-source community models. The framework has been adapted for CogVideoX.
|
||||
+ [CogVideoX-Controlnet](https://github.com/TheDenk/cogvideox-controlnet): A simple ControlNet module code that includes the CogVideoX model.
|
||||
+ [VideoTuna](https://github.com/VideoVerses/VideoTuna): VideoTuna is the first repo that integrates multiple AI video generation models for text-to-video, image-to-video, text-to-image generation.
|
||||
|
||||
## Project Structure
|
||||
|
||||
|
20
README_ja.md
20
README_ja.md
@ -22,7 +22,8 @@
|
||||
|
||||
## 更新とニュース
|
||||
|
||||
- 🔥🔥 ニュース: ```2024/11/08```: `CogVideoX1.5` モデルをリリースしました。CogVideoX1.5 は CogVideoX オープンソースモデルのアップグレードバージョンです。
|
||||
- 🔥🔥 **ニュース**: ```2024/11/15```: `CogVideoX1.5`モデルのdiffusersバージョンをリリースしました。わずかなパラメータ調整で以前のコードをそのまま利用可能です。
|
||||
- 🔥 **ニュース**: ```2024/11/08```: `CogVideoX1.5` モデルをリリースしました。CogVideoX1.5 は CogVideoX オープンソースモデルのアップグレードバージョンです。
|
||||
CogVideoX1.5-5B シリーズモデルは、10秒 長の動画とより高い解像度をサポートしており、`CogVideoX1.5-5B-I2V` は任意の解像度での動画生成に対応しています。
|
||||
SAT コードはすでに更新されており、`diffusers` バージョンは現在適応中です。
|
||||
SAT バージョンのコードは [こちら](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT) からダウンロードできます。
|
||||
@ -185,19 +186,19 @@ CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源の
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">推論精度</td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16</b></td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32,FP8*,INT8,INT4非対応</td>
|
||||
<td style="text-align: center;"><b>FP16*(推奨)</b>, BF16, FP32,FP8*,INT8,INT4非対応</td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32,FP8*,INT8,INT4非対応</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">単一GPUメモリ消費量<br></td>
|
||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 66GB <br></td>
|
||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 76GB <br><b>diffusers BF16:10GBから*</b><br><b>diffusers INT8(torchao):7GBから*</b></td>
|
||||
<td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GB以上* </b><br><b>diffusers INT8(torchao): 3.6GB以上*</b></td>
|
||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB以上* </b><br><b>diffusers INT8(torchao): 4.4GB以上* </b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">複数GPU推論メモリ消費量</td>
|
||||
<td colspan="2" style="text-align: center;"><b>非対応</b><br></td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16: 24GB* using diffusers</b><br></td>
|
||||
<td style="text-align: center;"><b>FP16: 10GB* diffusers使用</b><br></td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16: 15GB* diffusers使用</b><br></td>
|
||||
</tr>
|
||||
@ -235,7 +236,8 @@ CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源の
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">ダウンロードリンク (Diffusers)</td>
|
||||
<td colspan="2" style="text-align: center;"> 近日公開 </td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B">🟣 WiseModel</a></td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🟣 WiseModel</a></td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
|
||||
@ -268,13 +270,11 @@ pipe.vae.enable_tiling()
|
||||
は、CogVideoXのメモリ要件を削減するためにテキストエンコーダ、トランスフォーマ、およびVAEモジュールを量子化するために使用できます。これにより、無料のT4
|
||||
Colabやより少ないメモリのGPUでモデルを実行することが可能になります。同様に重要なのは、TorchAOの量子化は`torch.compile`
|
||||
と完全に互換性があり、推論速度を大幅に向上させることができる点です。`NVIDIA H100`およびそれ以上のデバイスでは`FP8`
|
||||
精度を使用する必要があります。これには、`torch`、`torchao`、`diffusers`、`accelerate`
|
||||
Pythonパッケージのソースコードからのインストールが必要です。`CUDA 12.4`の使用をお勧めします。
|
||||
精度を使用する必要があります。これには、`torch`、`torchao` Pythonパッケージのソースコードからのインストールが必要です。`CUDA 12.4`の使用をお勧めします。
|
||||
+ 推論速度テストも同様に、上記のメモリ最適化方法を使用しています。メモリ最適化を使用しない場合、推論速度は約10%向上します。
|
||||
`diffusers`バージョンのモデルのみが量子化をサポートしています。
|
||||
+ モデルは英語入力のみをサポートしており、他の言語は大規模モデルの改善を通じて英語に翻訳できます。
|
||||
+ モデルのファインチューニングに使用されるメモリは`8 * H100`環境でテストされています。プログラムは自動的に`Zero 2`
|
||||
最適化を使用しています。表に具体的なGPU数が記載されている場合、ファインチューニングにはその数以上のGPUが必要です。
|
||||
|
||||
|
||||
## 友好的リンク
|
||||
|
||||
@ -299,6 +299,8 @@ pipe.vae.enable_tiling()
|
||||
キーフレーム補間生成において、より大きな柔軟性を提供することを目的とした、CogVideoX構造を基にした修正版のパイプライン。
|
||||
+ [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): DiffSynth
|
||||
Studioは、拡散エンジンです。テキストエンコーダー、UNet、VAEなどを含むアーキテクチャを再構築し、オープンソースコミュニティモデルとの互換性を維持しつつ、計算性能を向上させました。このフレームワークはCogVideoXに適応しています。
|
||||
+ [CogVideoX-Controlnet](https://github.com/TheDenk/cogvideox-controlnet): CogVideoXモデルを含むシンプルなControlNetモジュールのコード。
|
||||
+ [VideoTuna](https://github.com/VideoVerses/VideoTuna): VideoTuna は、テキストからビデオ、画像からビデオ、テキストから画像生成のための複数のAIビデオ生成モデルを統合した最初のリポジトリです。
|
||||
|
||||
## プロジェクト構造
|
||||
|
||||
|
22
README_zh.md
22
README_zh.md
@ -22,7 +22,8 @@
|
||||
|
||||
## 项目更新
|
||||
|
||||
- 🔥🔥 **News**: ```2024/11/08```: 我们发布 `CogVideoX1.5` 模型。CogVideoX1.5 是 CogVideoX 开源模型的升级版本。
|
||||
- 🔥🔥 **News**: ```2024/11/15```: 我们发布 `CogVideoX1.5` 模型的diffusers版本,仅需调整部分参数仅可沿用之前的代码。
|
||||
- 🔥 **News**: ```2024/11/08```: 我们发布 `CogVideoX1.5` 模型。CogVideoX1.5 是 CogVideoX 开源模型的升级版本。
|
||||
CogVideoX1.5-5B 系列模型支持 **10秒** 长度的视频和更高的分辨率,其中 `CogVideoX1.5-5B-I2V` 支持 **任意分辨率** 的视频生成,SAT代码已经更新。`diffusers`版本还在适配中。SAT版本代码前往 [这里](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT) 下载。
|
||||
- 🔥**News**: ```2024/10/13```: 成本更低,单卡4090可微调 `CogVideoX-5B`
|
||||
的微调框架[cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory)已经推出,多种分辨率微调,欢迎使用。
|
||||
@ -39,8 +40,7 @@ CogVideoX1.5-5B 系列模型支持 **10秒** 长度的视频和更高的分辨
|
||||
- 🔥 ```2024/8/27```: 我们开源 CogVideoX 系列更大的模型 **CogVideoX-5B**
|
||||
。我们大幅度优化了模型的推理性能,推理门槛大幅降低,您可以在 `GTX 1080TI` 等早期显卡运行 **CogVideoX-2B**,在 `RTX 3060`
|
||||
等桌面端甜品卡运行 **CogVideoX-5B** 模型。 请严格按照[要求](requirements.txt)
|
||||
更新安装依赖,推理代码请查看 [cli_demo](inference/cli_demo.py)。同时,**CogVideoX-2B** 模型开源协议已经修改为**Apache 2.0
|
||||
协议**。
|
||||
更新安装依赖,推理代码请查看 [cli_demo](inference/cli_demo.py)。同时,**CogVideoX-2B** 模型开源协议已经修改为**Apache 2.0 协议**。
|
||||
- 🔥 ```2024/8/6```: 我们开源 **3D Causal VAE**,用于 **CogVideoX-2B**,可以几乎无损地重构视频。
|
||||
- 🔥 ```2024/8/6```: 我们开源 CogVideoX 系列视频生成模型的第一个模型, **CogVideoX-2B**。
|
||||
- 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型(现在你可以在 `CogVideo` 分支中看到),这是首个开源的基于
|
||||
@ -176,19 +176,19 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">推理精度</td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16</b></td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16(推荐)</b>, FP16, FP32,FP8*,INT8,不支持INT4</td>
|
||||
<td style="text-align: center;"><b>FP16*(推荐)</b>, BF16, FP32,FP8*,INT8,不支持INT4</td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16(推荐)</b>, FP16, FP32,FP8*,INT8,不支持INT4</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">单GPU显存消耗<br></td>
|
||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 66GB <br></td>
|
||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 76GB <br><b>diffusers BF16 : 10GB起* </b><br><b>diffusers INT8(torchao): 7G起* </b></td>
|
||||
<td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GB起* </b><br><b>diffusers INT8(torchao): 3.6G起*</b></td>
|
||||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB起* </b><br><b>diffusers INT8(torchao): 4.4G起* </b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">多GPU推理显存消耗</td>
|
||||
<td colspan="2" style="text-align: center;"><b>不支持</b><br></td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16: 24GB* using diffusers</b><br></td>
|
||||
<td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
|
||||
<td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
|
||||
</tr>
|
||||
@ -226,7 +226,8 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align: center;">下载链接 (Diffusers)</td>
|
||||
<td colspan="2" style="text-align: center;"> 即将推出 </td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B">🟣 WiseModel</a></td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">🟣 WiseModel</a></td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
|
||||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
|
||||
@ -256,11 +257,9 @@ pipe.vae.enable_tiling()
|
||||
+ [PytorchAO](https://github.com/pytorch/ao) 和 [Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
|
||||
可以用于量化文本编码器、Transformer 和 VAE 模块,以降低 CogVideoX 的内存需求。这使得在免费的 T4 Colab 或更小显存的 GPU
|
||||
上运行模型成为可能!同样值得注意的是,TorchAO 量化完全兼容 `torch.compile`,这可以显著提高推理速度。在 `NVIDIA H100`
|
||||
及以上设备上必须使用 `FP8` 精度,这需要源码安装 `torch`、`torchao`、`diffusers` 和 `accelerate` Python
|
||||
包。建议使用 `CUDA 12.4`。
|
||||
及以上设备上必须使用 `FP8` 精度,这需要源码安装 `torch`、`torchao` Python 包。建议使用 `CUDA 12.4`。
|
||||
+ 推理速度测试同样采用了上述显存优化方案,不采用显存优化的情况下,推理速度提升约10%。 只有`diffusers`版本模型支持量化。
|
||||
+ 模型仅支持英语输入,其他语言可以通过大模型润色时翻译为英语。
|
||||
+ 模型微调所占用的显存是在 `8 * H100` 环境下进行测试,程序已经自动使用`Zero 2` 优化。表格中若有标注具体GPU数量则必须使用大于等于该数量的GPU进行微调。
|
||||
|
||||
## 友情链接
|
||||
|
||||
@ -280,6 +279,9 @@ pipe.vae.enable_tiling()
|
||||
+ [CogVideoX-Interpolation](https://github.com/feizc/CogvideX-Interpolation): 基于 CogVideoX 结构修改的管道,旨在为关键帧插值生成提供更大的灵活性。
|
||||
+ [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): DiffSynth 工作室是一款扩散引擎。重构了架构,包括文本编码器、UNet、VAE
|
||||
等,在保持与开源社区模型兼容性的同时,提升了计算性能。该框架已经适配 CogVideoX。
|
||||
+ [CogVideoX-Controlnet](https://github.com/TheDenk/cogvideox-controlnet): 一个包含 CogvideoX 模型的简单 Controlnet 模块的代码。
|
||||
+ [VideoTuna](https://github.com/VideoVerses/VideoTuna):VideoTuna 是首个集成多种 AI 视频生成模型的仓库,支持文本转视频、图像转视频、文本转图像生成。
|
||||
|
||||
|
||||
## 完整项目代码结构
|
||||
|
||||
|
@ -3,15 +3,15 @@ This script demonstrates how to generate a video using the CogVideoX model with
|
||||
The script supports different types of video generation, including text-to-video (t2v), image-to-video (i2v),
|
||||
and video-to-video (v2v), depending on the input data and different weight.
|
||||
|
||||
- text-to-video: THUDM/CogVideoX-5b or THUDM/CogVideoX-2b
|
||||
- video-to-video: THUDM/CogVideoX-5b or THUDM/CogVideoX-2b
|
||||
- image-to-video: THUDM/CogVideoX-5b-I2V
|
||||
- text-to-video: THUDM/CogVideoX-5b, THUDM/CogVideoX-2b or THUDM/CogVideoX1.5-5b
|
||||
- video-to-video: THUDM/CogVideoX-5b, THUDM/CogVideoX-2b or THUDM/CogVideoX1.5-5b
|
||||
- image-to-video: THUDM/CogVideoX-5b-I2V or THUDM/CogVideoX1.5-5b-I2V
|
||||
|
||||
Running the Script:
|
||||
To run the script, use the following command with appropriate arguments:
|
||||
|
||||
```bash
|
||||
$ python cli_demo.py --prompt "A girl riding a bike." --model_path THUDM/CogVideoX-5b --generate_type "t2v"
|
||||
$ python cli_demo.py --prompt "A girl riding a bike." --model_path THUDM/CogVideoX1.5-5b --generate_type "t2v"
|
||||
```
|
||||
|
||||
Additional options are available to specify the model path, guidance scale, number of inference steps, video generation type, and output paths.
|
||||
@ -23,7 +23,6 @@ from typing import Literal
|
||||
import torch
|
||||
from diffusers import (
|
||||
CogVideoXPipeline,
|
||||
CogVideoXDDIMScheduler,
|
||||
CogVideoXDPMScheduler,
|
||||
CogVideoXImageToVideoPipeline,
|
||||
CogVideoXVideoToVideoPipeline,
|
||||
@ -37,6 +36,9 @@ def generate_video(
|
||||
model_path: str,
|
||||
lora_path: str = None,
|
||||
lora_rank: int = 128,
|
||||
num_frames: int = 81,
|
||||
width: int = 1360,
|
||||
height: int = 768,
|
||||
output_path: str = "./output.mp4",
|
||||
image_or_video_path: str = "",
|
||||
num_inference_steps: int = 50,
|
||||
@ -45,6 +47,7 @@ def generate_video(
|
||||
dtype: torch.dtype = torch.bfloat16,
|
||||
generate_type: str = Literal["t2v", "i2v", "v2v"], # i2v: image to video, v2v: video to video
|
||||
seed: int = 42,
|
||||
fps: int = 8,
|
||||
):
|
||||
"""
|
||||
Generates a video based on the given prompt and saves it to the specified path.
|
||||
@ -56,11 +59,15 @@ def generate_video(
|
||||
- lora_rank (int): The rank of the LoRA weights.
|
||||
- output_path (str): The path where the generated video will be saved.
|
||||
- num_inference_steps (int): Number of steps for the inference process. More steps can result in better quality.
|
||||
- num_frames (int): Number of frames to generate. CogVideoX1.0 generates 49 frames for 6 seconds at 8 fps, while CogVideoX1.5 produces either 81 or 161 frames, corresponding to 5 seconds or 10 seconds at 16 fps.
|
||||
- width (int): The width of the generated video, applicable only for CogVideoX1.5-5B-I2V
|
||||
- height (int): The height of the generated video, applicable only for CogVideoX1.5-5B-I2V
|
||||
- guidance_scale (float): The scale for classifier-free guidance. Higher values can lead to better alignment with the prompt.
|
||||
- num_videos_per_prompt (int): Number of videos to generate per prompt.
|
||||
- dtype (torch.dtype): The data type for computation (default is torch.bfloat16).
|
||||
- generate_type (str): The type of video generation (e.g., 't2v', 'i2v', 'v2v').·
|
||||
- seed (int): The seed for reproducibility.
|
||||
- fps (int): The frames per second for the generated video.
|
||||
"""
|
||||
|
||||
# 1. Load the pre-trained CogVideoX pipeline with the specified precision (bfloat16).
|
||||
@ -97,23 +104,23 @@ def generate_video(
|
||||
# and enable to("cuda")
|
||||
|
||||
# pipe.to("cuda")
|
||||
|
||||
pipe.enable_sequential_cpu_offload()
|
||||
|
||||
pipe.vae.enable_slicing()
|
||||
pipe.vae.enable_tiling()
|
||||
|
||||
# 4. Generate the video frames based on the prompt.
|
||||
# `num_frames` is the Number of frames to generate.
|
||||
# This is the default value for 6 seconds video and 8 fps and will plus 1 frame for the first frame and 49 frames.
|
||||
if generate_type == "i2v":
|
||||
video_generate = pipe(
|
||||
height=height,
|
||||
width=width,
|
||||
prompt=prompt,
|
||||
image=image, # The path of the image to be used as the background of the video
|
||||
image=image,
|
||||
# The path of the image, the resolution of video will be the same as the image for CogVideoX1.5-5B-I2V, otherwise it will be 720 * 480
|
||||
num_videos_per_prompt=num_videos_per_prompt, # Number of videos to generate per prompt
|
||||
num_inference_steps=num_inference_steps, # Number of inference steps
|
||||
num_frames=49, # Number of frames to generate,changed to 49 for diffusers version `0.30.3` and after.
|
||||
use_dynamic_cfg=True, # This id used for DPM Sechduler, for DDIM scheduler, it should be False
|
||||
num_frames=num_frames, # Number of frames to generate
|
||||
use_dynamic_cfg=True, # This id used for DPM scheduler, for DDIM scheduler, it should be False
|
||||
guidance_scale=guidance_scale,
|
||||
generator=torch.Generator().manual_seed(seed), # Set the seed for reproducibility
|
||||
).frames[0]
|
||||
@ -122,7 +129,7 @@ def generate_video(
|
||||
prompt=prompt,
|
||||
num_videos_per_prompt=num_videos_per_prompt,
|
||||
num_inference_steps=num_inference_steps,
|
||||
num_frames=49,
|
||||
num_frames=num_frames,
|
||||
use_dynamic_cfg=True,
|
||||
guidance_scale=guidance_scale,
|
||||
generator=torch.Generator().manual_seed(seed),
|
||||
@ -133,13 +140,12 @@ def generate_video(
|
||||
video=video, # The path of the video to be used as the background of the video
|
||||
num_videos_per_prompt=num_videos_per_prompt,
|
||||
num_inference_steps=num_inference_steps,
|
||||
# num_frames=49,
|
||||
num_frames=num_frames,
|
||||
use_dynamic_cfg=True,
|
||||
guidance_scale=guidance_scale,
|
||||
generator=torch.Generator().manual_seed(seed), # Set the seed for reproducibility
|
||||
).frames[0]
|
||||
# 5. Export the generated frames to a video file. fps must be 8 for original video.
|
||||
export_to_video(video_generate, output_path, fps=8)
|
||||
export_to_video(video_generate, output_path, fps=fps)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
@ -152,24 +158,20 @@ if __name__ == "__main__":
|
||||
help="The path of the image to be used as the background of the video",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--model_path", type=str, default="THUDM/CogVideoX-5b", help="The path of the pre-trained model to be used"
|
||||
"--model_path", type=str, default="THUDM/CogVideoX1.5-5B", help="Path of the pre-trained model use"
|
||||
)
|
||||
parser.add_argument("--lora_path", type=str, default=None, help="The path of the LoRA weights to be used")
|
||||
parser.add_argument("--lora_rank", type=int, default=128, help="The rank of the LoRA weights")
|
||||
parser.add_argument(
|
||||
"--output_path", type=str, default="./output.mp4", help="The path where the generated video will be saved"
|
||||
)
|
||||
parser.add_argument("--output_path", type=str, default="./output.mp4", help="The path save generated video")
|
||||
parser.add_argument("--guidance_scale", type=float, default=6.0, help="The scale for classifier-free guidance")
|
||||
parser.add_argument(
|
||||
"--num_inference_steps", type=int, default=50, help="Number of steps for the inference process"
|
||||
)
|
||||
parser.add_argument("--num_inference_steps", type=int, default=50, help="Inference steps")
|
||||
parser.add_argument("--num_frames", type=int, default=81, help="Number of steps for the inference process")
|
||||
parser.add_argument("--width", type=int, default=1360, help="Number of steps for the inference process")
|
||||
parser.add_argument("--height", type=int, default=768, help="Number of steps for the inference process")
|
||||
parser.add_argument("--fps", type=int, default=16, help="Number of steps for the inference process")
|
||||
parser.add_argument("--num_videos_per_prompt", type=int, default=1, help="Number of videos to generate per prompt")
|
||||
parser.add_argument(
|
||||
"--generate_type", type=str, default="t2v", help="The type of video generation (e.g., 't2v', 'i2v', 'v2v')"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--dtype", type=str, default="bfloat16", help="The data type for computation (e.g., 'float16' or 'bfloat16')"
|
||||
)
|
||||
parser.add_argument("--generate_type", type=str, default="t2v", help="The type of video generation")
|
||||
parser.add_argument("--dtype", type=str, default="bfloat16", help="The data type for computation")
|
||||
parser.add_argument("--seed", type=int, default=42, help="The seed for reproducibility")
|
||||
|
||||
args = parser.parse_args()
|
||||
@ -180,6 +182,9 @@ if __name__ == "__main__":
|
||||
lora_path=args.lora_path,
|
||||
lora_rank=args.lora_rank,
|
||||
output_path=args.output_path,
|
||||
num_frames=args.num_frames,
|
||||
width=args.width,
|
||||
height=args.height,
|
||||
image_or_video_path=args.image_or_video_path,
|
||||
num_inference_steps=args.num_inference_steps,
|
||||
guidance_scale=args.guidance_scale,
|
||||
@ -187,4 +192,5 @@ if __name__ == "__main__":
|
||||
dtype=dtype,
|
||||
generate_type=args.generate_type,
|
||||
seed=args.seed,
|
||||
fps=args.fps,
|
||||
)
|
||||
|
@ -3,7 +3,7 @@ This script demonstrates how to generate a video from a text prompt using CogVid
|
||||
|
||||
Note:
|
||||
|
||||
Must install the `torchao`,`torch`,`diffusers`,`accelerate` library FROM SOURCE to use the quantization feature.
|
||||
Must install the `torchao`,`torch` library FROM SOURCE to use the quantization feature.
|
||||
Only NVIDIA GPUs like H100 or higher are supported om FP-8 quantization.
|
||||
|
||||
ALL quantization schemes must use with NVIDIA GPUs.
|
||||
@ -51,6 +51,9 @@ def generate_video(
|
||||
num_videos_per_prompt: int = 1,
|
||||
quantization_scheme: str = "fp8",
|
||||
dtype: torch.dtype = torch.bfloat16,
|
||||
num_frames: int = 81,
|
||||
fps: int = 8,
|
||||
seed: int = 42,
|
||||
):
|
||||
"""
|
||||
Generates a video based on the given prompt and saves it to the specified path.
|
||||
@ -65,7 +68,6 @@ def generate_video(
|
||||
- quantization_scheme (str): The quantization scheme to use ('int8', 'fp8').
|
||||
- dtype (torch.dtype): The data type for computation (default is torch.bfloat16).
|
||||
"""
|
||||
|
||||
text_encoder = T5EncoderModel.from_pretrained(model_path, subfolder="text_encoder", torch_dtype=dtype)
|
||||
text_encoder = quantize_model(part=text_encoder, quantization_scheme=quantization_scheme)
|
||||
transformer = CogVideoXTransformer3DModel.from_pretrained(model_path, subfolder="transformer", torch_dtype=dtype)
|
||||
@ -80,54 +82,38 @@ def generate_video(
|
||||
torch_dtype=dtype,
|
||||
)
|
||||
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
|
||||
|
||||
# Using with compile will run faster. First time infer will cost ~30min to compile.
|
||||
# pipe.transformer.to(memory_format=torch.channels_last)
|
||||
|
||||
# for FP8 should remove pipe.enable_model_cpu_offload()
|
||||
pipe.enable_model_cpu_offload()
|
||||
|
||||
# This is not for FP8 and INT8 and should remove this line
|
||||
# pipe.enable_sequential_cpu_offload()
|
||||
pipe.vae.enable_slicing()
|
||||
pipe.vae.enable_tiling()
|
||||
|
||||
video = pipe(
|
||||
prompt=prompt,
|
||||
num_videos_per_prompt=num_videos_per_prompt,
|
||||
num_inference_steps=num_inference_steps,
|
||||
num_frames=49,
|
||||
num_frames=num_frames,
|
||||
use_dynamic_cfg=True,
|
||||
guidance_scale=guidance_scale,
|
||||
generator=torch.Generator(device="cuda").manual_seed(42),
|
||||
generator=torch.Generator(device="cuda").manual_seed(seed),
|
||||
).frames[0]
|
||||
|
||||
export_to_video(video, output_path, fps=8)
|
||||
export_to_video(video, output_path, fps=fps)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Generate a video from a text prompt using CogVideoX")
|
||||
parser.add_argument("--prompt", type=str, required=True, help="The description of the video to be generated")
|
||||
parser.add_argument("--model_path", type=str, default="THUDM/CogVideoX-5b", help="Path of the pre-trained model")
|
||||
parser.add_argument("--output_path", type=str, default="./output.mp4", help="Path to save generated video")
|
||||
parser.add_argument("--num_inference_steps", type=int, default=50, help="Inference steps")
|
||||
parser.add_argument("--guidance_scale", type=float, default=6.0, help="Classifier-free guidance scale")
|
||||
parser.add_argument("--num_videos_per_prompt", type=int, default=1, help="Videos to generate per prompt")
|
||||
parser.add_argument("--dtype", type=str, default="bfloat16", help="Data type (e.g., 'float16', 'bfloat16')")
|
||||
parser.add_argument(
|
||||
"--model_path", type=str, default="THUDM/CogVideoX-5b", help="The path of the pre-trained model to be used"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output_path", type=str, default="./output.mp4", help="The path where the generated video will be saved"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num_inference_steps", type=int, default=50, help="Number of steps for the inference process"
|
||||
)
|
||||
parser.add_argument("--guidance_scale", type=float, default=6.0, help="The scale for classifier-free guidance")
|
||||
parser.add_argument("--num_videos_per_prompt", type=int, default=1, help="Number of videos to generate per prompt")
|
||||
parser.add_argument(
|
||||
"--dtype", type=str, default="bfloat16", help="The data type for computation (e.g., 'float16', 'bfloat16')"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--quantization_scheme",
|
||||
type=str,
|
||||
default="bf16",
|
||||
choices=["int8", "fp8"],
|
||||
help="The quantization scheme to use (int8, fp8)",
|
||||
"--quantization_scheme", type=str, default="fp8", choices=["int8", "fp8"], help="Quantization scheme"
|
||||
)
|
||||
parser.add_argument("--num_frames", type=int, default=81, help="Number of frames in the video")
|
||||
parser.add_argument("--fps", type=int, default=16, help="Frames per second for output video")
|
||||
parser.add_argument("--seed", type=int, default=42, help="Random seed for reproducibility")
|
||||
|
||||
args = parser.parse_args()
|
||||
dtype = torch.float16 if args.dtype == "float16" else torch.bfloat16
|
||||
@ -140,4 +126,7 @@ if __name__ == "__main__":
|
||||
num_videos_per_prompt=args.num_videos_per_prompt,
|
||||
quantization_scheme=args.quantization_scheme,
|
||||
dtype=dtype,
|
||||
num_frames=args.num_frames,
|
||||
fps=args.fps,
|
||||
seed=args.seed,
|
||||
)
|
||||
|
@ -1,14 +1,14 @@
|
||||
diffusers>=0.31.0
|
||||
accelerate>=1.0.1
|
||||
transformers>=4.46.1
|
||||
accelerate>=1.1.1
|
||||
transformers>=4.46.2
|
||||
numpy==1.26.0
|
||||
torch>=2.5.0
|
||||
torchvision>=0.20.0
|
||||
sentencepiece>=0.2.0
|
||||
SwissArmyTransformer>=0.4.12
|
||||
gradio>=5.4.0
|
||||
gradio>=5.5.0
|
||||
imageio>=2.35.1
|
||||
imageio-ffmpeg>=0.5.1
|
||||
openai>=1.53.0
|
||||
openai>=1.54.0
|
||||
moviepy>=1.0.3
|
||||
scikit-video>=1.1.11
|
||||
scikit-video>=1.1.11
|
@ -92,6 +92,8 @@ TRANSFORMER_KEYS_RENAME_DICT = {
|
||||
"post_attn1_layernorm": "norm2.norm",
|
||||
"time_embed.0": "time_embedding.linear_1",
|
||||
"time_embed.2": "time_embedding.linear_2",
|
||||
"ofs_embed.0": "ofs_embedding.linear_1",
|
||||
"ofs_embed.2": "ofs_embedding.linear_2",
|
||||
"mixins.patch_embed": "patch_embed",
|
||||
"mixins.final_layer.norm_final": "norm_out.norm",
|
||||
"mixins.final_layer.linear": "proj_out",
|
||||
@ -146,12 +148,13 @@ def update_state_dict_inplace(state_dict: Dict[str, Any], old_key: str, new_key:
|
||||
|
||||
|
||||
def convert_transformer(
|
||||
ckpt_path: str,
|
||||
num_layers: int,
|
||||
num_attention_heads: int,
|
||||
use_rotary_positional_embeddings: bool,
|
||||
i2v: bool,
|
||||
dtype: torch.dtype,
|
||||
ckpt_path: str,
|
||||
num_layers: int,
|
||||
num_attention_heads: int,
|
||||
use_rotary_positional_embeddings: bool,
|
||||
i2v: bool,
|
||||
dtype: torch.dtype,
|
||||
init_kwargs: Dict[str, Any],
|
||||
):
|
||||
PREFIX_KEY = "model.diffusion_model."
|
||||
|
||||
@ -161,11 +164,13 @@ def convert_transformer(
|
||||
num_layers=num_layers,
|
||||
num_attention_heads=num_attention_heads,
|
||||
use_rotary_positional_embeddings=use_rotary_positional_embeddings,
|
||||
use_learned_positional_embeddings=i2v,
|
||||
ofs_embed_dim=512 if (i2v and init_kwargs["patch_size_t"] is not None) else None, # CogVideoX1.5-5B-I2V
|
||||
use_learned_positional_embeddings=i2v and init_kwargs["patch_size_t"] is None, # CogVideoX-5B-I2V
|
||||
**init_kwargs,
|
||||
).to(dtype=dtype)
|
||||
|
||||
for key in list(original_state_dict.keys()):
|
||||
new_key = key[len(PREFIX_KEY):]
|
||||
new_key = key[len(PREFIX_KEY) :]
|
||||
for replace_key, rename_key in TRANSFORMER_KEYS_RENAME_DICT.items():
|
||||
new_key = new_key.replace(replace_key, rename_key)
|
||||
update_state_dict_inplace(original_state_dict, key, new_key)
|
||||
@ -175,13 +180,18 @@ def convert_transformer(
|
||||
if special_key not in key:
|
||||
continue
|
||||
handler_fn_inplace(key, original_state_dict)
|
||||
|
||||
transformer.load_state_dict(original_state_dict, strict=True)
|
||||
return transformer
|
||||
|
||||
|
||||
def convert_vae(ckpt_path: str, scaling_factor: float, dtype: torch.dtype):
|
||||
def convert_vae(ckpt_path: str, scaling_factor: float, version: str, dtype: torch.dtype):
|
||||
init_kwargs = {"scaling_factor": scaling_factor}
|
||||
if version == "1.5":
|
||||
init_kwargs.update({"invert_scale_latents": True})
|
||||
|
||||
original_state_dict = get_state_dict(torch.load(ckpt_path, map_location="cpu", mmap=True))
|
||||
vae = AutoencoderKLCogVideoX(scaling_factor=scaling_factor).to(dtype=dtype)
|
||||
vae = AutoencoderKLCogVideoX(**init_kwargs).to(dtype=dtype)
|
||||
|
||||
for key in list(original_state_dict.keys()):
|
||||
new_key = key[:]
|
||||
@ -199,6 +209,34 @@ def convert_vae(ckpt_path: str, scaling_factor: float, dtype: torch.dtype):
|
||||
return vae
|
||||
|
||||
|
||||
def get_transformer_init_kwargs(version: str):
|
||||
if version == "1.0":
|
||||
vae_scale_factor_spatial = 8
|
||||
init_kwargs = {
|
||||
"patch_size": 2,
|
||||
"patch_size_t": None,
|
||||
"patch_bias": True,
|
||||
"sample_height": 480 // vae_scale_factor_spatial,
|
||||
"sample_width": 720 // vae_scale_factor_spatial,
|
||||
"sample_frames": 49,
|
||||
}
|
||||
|
||||
elif version == "1.5":
|
||||
vae_scale_factor_spatial = 8
|
||||
init_kwargs = {
|
||||
"patch_size": 2,
|
||||
"patch_size_t": 2,
|
||||
"patch_bias": False,
|
||||
"sample_height": 768 // vae_scale_factor_spatial,
|
||||
"sample_width": 1360 // vae_scale_factor_spatial,
|
||||
"sample_frames": 81,
|
||||
}
|
||||
else:
|
||||
raise ValueError("Unsupported version of CogVideoX.")
|
||||
|
||||
return init_kwargs
|
||||
|
||||
|
||||
def get_args():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
@ -214,6 +252,12 @@ def get_args():
|
||||
parser.add_argument(
|
||||
"--text_encoder_cache_dir", type=str, default=None, help="Path to text encoder cache directory"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--typecast_text_encoder",
|
||||
action="store_true",
|
||||
default=False,
|
||||
help="Whether or not to apply fp16/bf16 precision to text_encoder",
|
||||
)
|
||||
# For CogVideoX-2B, num_layers is 30. For 5B, it is 42
|
||||
parser.add_argument("--num_layers", type=int, default=30, help="Number of transformer blocks")
|
||||
# For CogVideoX-2B, num_attention_heads is 30. For 5B, it is 48
|
||||
@ -226,7 +270,18 @@ def get_args():
|
||||
parser.add_argument("--scaling_factor", type=float, default=1.15258426, help="Scaling factor in the VAE")
|
||||
# For CogVideoX-2B, snr_shift_scale is 3.0. For 5B, it is 1.0
|
||||
parser.add_argument("--snr_shift_scale", type=float, default=3.0, help="Scaling factor in the VAE")
|
||||
parser.add_argument("--i2v", action="store_true", default=False, help="Whether to save the model weights in fp16")
|
||||
parser.add_argument(
|
||||
"--i2v",
|
||||
action="store_true",
|
||||
default=False,
|
||||
help="Whether the model to be converted is the Image-to-Video version of CogVideoX.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--version",
|
||||
choices=["1.0", "1.5"],
|
||||
default="1.0",
|
||||
help="Which version of CogVideoX to use for initializing default modeling parameters.",
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
||||
@ -242,6 +297,7 @@ if __name__ == "__main__":
|
||||
dtype = torch.float16 if args.fp16 else torch.bfloat16 if args.bf16 else torch.float32
|
||||
|
||||
if args.transformer_ckpt_path is not None:
|
||||
init_kwargs = get_transformer_init_kwargs(args.version)
|
||||
transformer = convert_transformer(
|
||||
args.transformer_ckpt_path,
|
||||
args.num_layers,
|
||||
@ -249,14 +305,19 @@ if __name__ == "__main__":
|
||||
args.use_rotary_positional_embeddings,
|
||||
args.i2v,
|
||||
dtype,
|
||||
init_kwargs,
|
||||
)
|
||||
if args.vae_ckpt_path is not None:
|
||||
vae = convert_vae(args.vae_ckpt_path, args.scaling_factor, dtype)
|
||||
# Keep VAE in float32 for better quality
|
||||
vae = convert_vae(args.vae_ckpt_path, args.scaling_factor, args.version, torch.float32)
|
||||
|
||||
text_encoder_id = "/share/official_pretrains/hf_home/t5-v1_1-xxl"
|
||||
text_encoder_id = "google/t5-v1_1-xxl"
|
||||
tokenizer = T5Tokenizer.from_pretrained(text_encoder_id, model_max_length=TOKENIZER_MAX_LENGTH)
|
||||
text_encoder = T5EncoderModel.from_pretrained(text_encoder_id, cache_dir=args.text_encoder_cache_dir)
|
||||
|
||||
if args.typecast_text_encoder:
|
||||
text_encoder = text_encoder.to(dtype=dtype)
|
||||
|
||||
# Apparently, the conversion does not work anymore without this :shrug:
|
||||
for param in text_encoder.parameters():
|
||||
param.data = param.data.contiguous()
|
||||
@ -288,11 +349,6 @@ if __name__ == "__main__":
|
||||
scheduler=scheduler,
|
||||
)
|
||||
|
||||
if args.fp16:
|
||||
pipe = pipe.to(dtype=torch.float16)
|
||||
if args.bf16:
|
||||
pipe = pipe.to(dtype=torch.bfloat16)
|
||||
|
||||
# We don't use variant here because the model must be run in fp16 (2B) or bf16 (5B). It would be weird
|
||||
# for users to specify variant when the default is not fp32 and they want to run with the correct default (which
|
||||
# is either fp16/bf16 here).
|
||||
|
Loading…
x
Reference in New Issue
Block a user