update readme with gpu memory cost

2026-07-01 21:28:34 +08:00 · 2024-08-29 20:07:49 +08:00 · 2024-08-29 20:07:49 +08:00 · 29642ade50
commit 29642ade50
parent f9a49ad88f
7 changed files with 112 additions and 61 deletions
--- a/README.md
+++ b/README.md
@ -22,9 +22,12 @@ Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spac

 ## Update and News

- 🔥🔥 **News**: ```2024/8/27```: The **CogVideoX-2B** model's open-source license has been changed to the **Apache 2.0
+- 🔥🔥 **News**: ```2024/8/29```: By adding `pipe.enable_sequential_cpu_offload()` and `pipe.vae.enable_slicing()` to the
+  inference code of CogVideoX-5B, VRAM usage can be reduced to `5GB`. Please check the
+  updated [cli_demo](inference/cli_demo.py).
+- 🔥 **News**: ```2024/8/27```: The **CogVideoX-2B** model's open-source license has been changed to the **Apache 2.0
  License**.
- 🔥🔥 **News**: ```2024/8/27```: We have open-sourced a larger model in the CogVideoX series, **CogVideoX-5B**.
+- 🔥 **News**: ```2024/8/27```: We have open-sourced a larger model in the CogVideoX series, **CogVideoX-5B**.
  We have significantly optimized the model's inference performance, greatly lowering the inference threshold. You can
  run **CogVideoX-2B** on older GPUs like the `GTX 1080TI`, and run the **CogVideoX-5B** model on mid-range GPUs like
  the `RTX 3060`. Please ensure you update and install the dependencies according to
@ -178,9 +181,9 @@ models we currently offer, along with their foundational information.
    <td style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, no support for INT4</td>
  </tr>
  <tr>
-    <td style="text-align: center;">Single GPU VRAM Consumption</td>
-    <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers with torchao</b></td>
-    <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers with torchao</b></td>
+    <td style="text-align: center;">Single GPU VRAM Consumption<br></td>
+    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers BF16: starting from 4GB*</b><br><b>diffusers INT8(torchao): starting from 3.6GB*</b></td>
+    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16: starting from 5GB*</b><br><b>diffusers INT8(torchao): starting from 4.4GB*</b></td>
  </tr>
  <tr>
    <td style="text-align: center;">Multi-GPU Inference VRAM Consumption</td>
@ -240,25 +243,35 @@ models we currently offer, along with their foundational information.

 **Data Explanation**

- When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()`
-  optimization were enabled. This solution has not been tested for actual VRAM/memory usage on devices other than *
-  *NVIDIA A100/H100**. Generally, this solution can be adapted to all devices with **NVIDIA Ampere architecture** and
-  above. If optimization is disabled, VRAM usage will increase significantly, with peak VRAM approximately 3 times the
-  value in the table.
- When performing multi-GPU inference, the `enable_model_cpu_offload()` optimization needs to be disabled.
- Using an INT8 model will result in reduced inference speed. This is done to accommodate GPUs with lower VRAM, allowing
-  inference to run properly with minimal video quality loss, though the inference speed will be significantly reduced.
- The 2B model is trained using `FP16` precision, while the 5B model is trained using `BF16` precision. It is
-  recommended to use the precision used in model training for inference.
- [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
-  used to quantize the Text Encoder, Transformer and VAE modules to lower the memory requirement of CogVideoX. This
-  makes it possible to run the model on free-tier T4 Colab or smaller VRAM GPUs as well! It is also worth noting that
-  TorchAO quantization is fully compatible with `torch.compile`, which allows for much faster inference speed. `FP8`
-  precision must be used on `NVIDIA H100` and above devices, requiring source installation of
-  the `torch`, `torchao`, `diffusers`, and `accelerate` Python packages. `CUDA 12.4` is recommended.
- Inference speed testing also used the aforementioned VRAM optimization scheme. Without VRAM optimization, inference
-  speed increases by about 10%. Only models using `diffusers` support quantization.
- The model only supports English input; other languages can be translated to English during large model refinements.
+ When testing using the `diffusers` library, all optimizations provided by the `diffusers` library were enabled. This
+  solution has not been tested for actual VRAM/memory usage on devices other than **NVIDIA A100 / H100**. Generally,
+  this solution can be adapted to all devices with **NVIDIA Ampere architecture** and above. If the optimizations are
+  disabled, VRAM usage will increase significantly, with peak VRAM usage being about 3 times higher than the table
+  shows. However, speed will increase by 3-4 times. You can selectively disable some optimizations, including:
+
+```
+pipe.enable_model_cpu_offload()
+pipe.enable_sequential_cpu_offload()
+pipe.vae.enable_slicing()
+pipe.vae.enable_tiling()
+``` 
+
+ When performing multi-GPU inference, the `enable_model_cpu_offload()` optimization needs to be disabled.
+ Using INT8 models will reduce inference speed. This is to ensure that GPUs with lower VRAM can perform inference
+  normally while maintaining minimal video quality loss, though inference speed will decrease significantly.
+ The 2B model is trained with `FP16` precision, and the 5B model is trained with `BF16` precision. We recommend using
+  the precision the model was trained with for inference.
+ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
+  used to quantize the text encoder, Transformer, and VAE modules to reduce CogVideoX's memory requirements. This makes
+  it possible to run the model on a free T4 Colab or GPUs with smaller VRAM! It is also worth noting that TorchAO
+  quantization is fully compatible with `torch.compile`, which can significantly improve inference speed. `FP8`
+  precision must be used on devices with `NVIDIA H100` or above, which requires installing
+  the `torch`, `torchao`, `diffusers`, and `accelerate` Python packages from source. `CUDA 12.4` is recommended.
+ The inference speed test also used the above VRAM optimization scheme. Without VRAM optimization, inference speed
+  increases by about 10%. Only the `diffusers` version of the model supports quantization.
+ The model only supports English input; other languages can be translated into English during refinement by a large
+  model.
+

 ## Friendly Links

@ -321,6 +334,8 @@ This folder contains some tools for model conversion / caption generation, etc.

 + [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): Convert SAT model weights to Huggingface model weights.
 + [caption_demo](tools/caption): Caption tool, a model that understands videos and outputs them in text.
+ [AutoDL Mirror](https://www.codewithgpu.com/i/THUDM/CogVideo/CogVideoX-5b-demo): A one-click deployment of Huggingface
+  Space mirror provided by community members.

 ## CogVideo(ICLR'23)

--- a/README_ja.md
+++ b/README_ja.md
@ -22,9 +22,12 @@

 ## 更新とニュース

- 🔥🔥 **ニュース**: ```2024/8/27```: **CogVideoX-2B** モデルのオープンソースライセンスが **Apache 2.0 ライセンス**
+- 🔥🔥 **ニュース**: ```2024/8/29```: `pipe.enable_sequential_cpu_offload()` と `pipe.vae.enable_slicing()`
+  をCogVideoX-5Bの推論コードに追加することで、VRAM使用量を`5GB`
+  まで削減できます。更新された[cli_demo](inference/cli_demo.py)をご覧ください。
+- 🔥**ニュース**: ```2024/8/27```: **CogVideoX-2B** モデルのオープンソースライセンスが **Apache 2.0 ライセンス**
  に変更されました。
- 🔥🔥 **ニュース**: ```2024/8/27```: CogVideoX シリーズのより大きなモデル **CogVideoX-5B** をオープンソース化しました。
+- 🔥**ニュース**: ```2024/8/27```: CogVideoX シリーズのより大きなモデル **CogVideoX-5B** をオープンソース化しました。
  モデルの推論性能を大幅に最適化し、推論のハードルを大幅に下げました。`GTX 1080TI` などの旧型GPUで **CogVideoX-2B**
  を、`RTX 3060` などのミドル
 - 🔥**ニュース**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) は CogVideoX
@ -163,9 +166,9 @@ CogVideoXは[清影](https://chatglm.cn/video?lang=en?fr=osm_cogvideo) 同源の
    <td style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32, FP8*(E4M3, E5M2), INT8, INT4は非対応</td>
  </tr>
  <tr>
-    <td style="text-align: center;">単一GPUのメモリ消費量</td>
-    <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers with torchao</b></td>
-    <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers with torchao</b></td>
+    <td style="text-align: center;">シングルGPU VRAM消費量<br></td>
+    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers BF16: 4GBから*</b><br><b>diffusers INT8(torchao): 3.6GBから*</b></td>
+    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16: 5GBから*</b><br><b>diffusers INT8(torchao): 4.4GBから*</b></td>
  </tr>
  <tr>
    <td style="text-align: center;">複数GPUの推論メモリ消費量</td>
@ -225,25 +228,31 @@ CogVideoXは[清影](https://chatglm.cn/video?lang=en?fr=osm_cogvideo) 同源の

 **データ解説**

-+ diffusersライブラリを使用したテストでは、`enable_model_cpu_offload()`オプションと`pipe.vae.enable_tiling()`
-  最適化が有効になっています。この手法は、**NVIDIA A100 / H100**以外のデバイスでの実際のメモリ/メモリ消費量についてはテストされていません。通常、この手法はすべての
-  **NVIDIA Ampereアーキテクチャ**以上のデバイスに適合します。最適化を無効にすると、メモリ消費量が倍増し、ピークメモリは表の3倍程度になります。
-+ 複数GPUで推論する際は、`enable_model_cpu_offload()`最適化を無効にする必要があります。
-+ INT8モデルを使用すると推論速度が低下します。これは、メモリが少ないGPUで正常に推論を行い、動画品質の損失を最小限に抑えるためです。そのため、推論速度が大幅に低下します。
- [PytorchAO](https://github.com/pytorch/ao) と [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) は、Text
-  Encoder、Transformer、VAE モジュールを量子化して CogVideoX のメモリ要件を下げるために使用できます。これにより、無料の T4
-  Colab や小さな VRAM GPU でもモデルを実行できるようになります！また、TorchAO の量子化は `torch.compile`
+ `diffusers` ライブラリを使用してテストする際、`diffusers` ライブラリに付属するすべての最適化を有効にしました。このソリューションは、
+  **NVIDIA A100 / H100** 以外のデバイスでの実際のVRAM/メモリ使用量についてはテストされていません。一般的に、このソリューションは
+  **NVIDIA Ampereアーキテクチャ**
+  以上のすべてのデバイスに適応できます。最適化を無効にすると、VRAM使用量が大幅に増加し、表の約3倍のピークVRAMを使用しますが、速度は3-4倍向上します。以下の最適化の一部を選択的に無効にすることができます:
+
+```
+pipe.enable_model_cpu_offload()
+pipe.enable_sequential_cpu_offload()
+pipe.vae.enable_slicing()
+pipe.vae.enable_tiling()
+```
+
+ マルチGPU推論を行う際には、`enable_model_cpu_offload()` の最適化を無効にする必要があります。
+ INT8モデルを使用すると推論速度が低下します。これは、ビデオ品質の損失を最小限に抑えながら、VRAMが少ないGPUでも正常に推論できるようにするためですが、推論速度は大幅に低下します。
+ 2Bモデルは `FP16` 精度でトレーニングされ、5Bモデルは `BF16` 精度でトレーニングされています。推論には、モデルがトレーニングされた精度を使用することをお勧めします。
+ [PytorchAO](https://github.com/pytorch/ao) および [Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
+  は、CogVideoXのメモリ要件を削減するためにテキストエンコーダー、トランスフォーマー、およびVAEモジュールを量子化するために使用できます。これにより、無料のT4
+  ColabまたはVRAMが少ないGPUでモデルを実行できるようになります。また、TorchAO量子化は `torch.compile`
  と完全に互換性があり、推論速度を大幅に向上させることができます。`NVIDIA H100` 以上のデバイスでは `FP8`
-  精度が必須であり、`torch`、`torchao`、`diffusers`、`accelerate` の Python
-  パッケージのソースインストールが必要です。`CUDA 12.4` の使用を推奨します。
-+ `FP8`精度は`NVIDIA H100`以上のデバイスでのみ使用でき、`torch`、`torchao`、`diffusers`、`accelerate`
-  のPythonパッケージをソースコードからインストールする必要があります。`CUDA 12.4`の使用を推奨します。
+  精度を使用する必要があり、これには `torch`、`torchao`、`diffusers`、および `accelerate` Python
+  パッケージをソースからインストールする必要があります。`CUDA 12.4` が推奨されます。
 +
-
-推論速度のテストも上記のメモリ最適化手法を使用して行いました。メモリ最適化を行わない場合、推論速度が約10％向上します。量子化をサポートするのは`diffusers`
+推論速度テストでも、上記のVRAM最適化スキームを使用しました。VRAMの最適化を行わない場合、推論速度は約10％向上します。量子化をサポートするのは `diffusers`
 バージョンのモデルのみです。
-
-+ モデルは英語入力のみをサポートしており、他の言語は大モデルでのポストプロセスで英語に翻訳する必要があります。
+ モデルは英語入力のみをサポートしており、他の言語は大規模なモデルでのリファイン時に英語に翻訳できます。

 ## 友好的リンク

--- a/README_zh.md
+++ b/README_zh.md
@ -23,8 +23,10 @@

 ## 项目更新

- 🔥🔥 **News**: ```2024/8/27```: **CogVideoX-2B** 模型开源协议已经修改为**Apache 2.0 协议**。
- 🔥🔥 **News**: ```2024/8/27```:  我们开源 CogVideoX 系列更大的模型 **CogVideoX-5B**
+- 🔥🔥 **News**: ```2024/8/29```: 使用`pipe.enable_sequential_cpu_offload()` 和 `pipe.vae.enable_slicing()`
+  加入到CogVideoX-5B的推理代码中，可以将显存占用下降至`5GB`，请查看[cli_demo](inference/cli_demo.py)的更新。
+- 🔥 **News**: ```2024/8/27```: **CogVideoX-2B** 模型开源协议已经修改为**Apache 2.0 协议**。
+- 🔥 **News**: ```2024/8/27```:  我们开源 CogVideoX 系列更大的模型 **CogVideoX-5B**
  。我们大幅度优化了模型的推理性能，推理门槛大幅降低，您可以在 `GTX 1080TI` 等早期显卡运行 **CogVideoX-2B**，在 `RTX 3060`
  等桌面端甜品卡运行 **CogVideoX-5B** 模型。 请严格按照[要求](requirements.txt)
  更新安装依赖，推理代码请查看 [cli_demo](inference/cli_demo.py)。
@ -157,13 +159,13 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
  </tr>
  <tr>
    <td style="text-align: center;">推理精度</td>
-    <td style="text-align: center;"><b>FP16*(推荐)</b>, BF16, FP32，FP8*(E4M3，E5M2)，INT8，不支持INT4</td>
-    <td style="text-align: center;"><b>BF16(推荐)</b>, FP16, FP32，FP8*(E4M3，E5M2)，INT8，不支持INT4</td>
+    <td style="text-align: center;"><b>FP16*(推荐)</b>, BF16, FP32，FP8*，INT8，不支持INT4</td>
+    <td style="text-align: center;"><b>BF16(推荐)</b>, FP16, FP32，FP8*，INT8，不支持INT4</td>
  </tr>
  <tr>
    <td style="text-align: center;">单GPU显存消耗<br></td>
-    <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers with torchao</b></td>
-    <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers with torchao</b></td>
+    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers BF16: 4GB起* </b><br><b>diffusers INT8(torchao): 3.6G起*</b></td>
+    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB起* </b><br><b>diffusers INT8(torchao): 4.4G起* </b></td>
  </tr>
  <tr>
    <td style="text-align: center;">多GPU推理显存消耗</td>
@ -223,13 +225,19 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源

 **数据解释**

-+ 使用 diffusers 库进行测试时，启用了 `enable_model_cpu_offload()` 选项 和 `pipe.vae.enable_tiling()` 优化，该方案未测试在非
-  **NVIDIA A100 / H100** 外的设备上的实际显存 / 内存占用。通常，该方案可以适配于所有 **NVIDIA 安培架构**
-  以上的设备。若关闭优化，显存占用会成倍增加，峰值显存约为表格的3倍。
+ 使用 diffusers 库进行测试时，启用了全部`diffusers`库自带的优化，该方案未测试在非**NVIDIA A100 / H100** 外的设备上的实际显存 / 内存占用。通常，该方案可以适配于所有 **NVIDIA 安培架构**
+以上的设备。若关闭优化，显存占用会成倍增加，峰值显存约为表格的3倍。但速度提升3-4倍左右。你可以选择性的关闭部分优化，这些优化包括:
+```
+pipe.enable_model_cpu_offload()
+pipe.enable_sequential_cpu_offload()
+pipe.vae.enable_slicing()
+pipe.vae.enable_tiling()
+```
+
 + 多GPU推理时，需要关闭 `enable_model_cpu_offload()` 优化。
 + 使用 INT8 模型会导致推理速度降低，此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失，推理速度大幅降低。
 + 2B 模型采用 `FP16` 精度训练， 5B模型采用 `BF16` 精度训练。我们推荐使用模型训练的精度进行推理。
- [PytorchAO](https://github.com/pytorch/ao) 和 [Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
+ [PytorchAO](https://github.com/pytorch/ao) 和 [Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
  可以用于量化文本编码器、Transformer 和 VAE 模块，以降低 CogVideoX 的内存需求。这使得在免费的 T4 Colab 或更小显存的 GPU
  上运行模型成为可能！同样值得注意的是，TorchAO 量化完全兼容 `torch.compile`，这可以显著提高推理速度。在 `NVIDIA H100`
  及以上设备上必须使用 `FP8` 精度，这需要源码安装 `torch`、`torchao`、`diffusers` 和 `accelerate` Python
@ -243,6 +251,8 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源

 + [Xorbits Inference](https://github.com/xorbitsai/inference): 性能强大且功能全面的分布式推理框架，轻松一键部署你自己的模型或内置的前沿开源模型。
 + [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys): VideoSys 提供了易用且高性能的视频生成基础设施，支持完整的管道，并持续集成最新的模型和技术。
+ [AutoDL镜像](https://www.codewithgpu.com/i/THUDM/CogVideo/CogVideoX-5b-demo): 由社区成员提供的一键部署Huggingface
+  Space镜像。

 ## 完整项目代码结构

--- a/inference/cli_demo_quantization.py
+++ b/inference/cli_demo_quantization.py
@ -78,7 +78,7 @@ def generate_video(
        transformer=transformer,
        vae=vae,
        torch_dtype=dtype,
-    ).to("cuda")
+    )
    pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")

    # Using with compile will run faster. First time infer will cost ~30min to compile.
--- a/inference/gradio_composite_demo/app.py
+++ b/inference/gradio_composite_demo/app.py
@ -32,8 +32,17 @@ snapshot_download(repo_id="AlexWortega/RIFE", local_dir="model_rife")
 pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16).to(device)
 pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")

-pipe.transformer.to(memory_format=torch.channels_last)
-pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)
+# Unnecessary
+
+pipe.enable_model_cpu_offload()
+pipe.enable_sequential_cpu_offload()
+pipe.vae.enable_slicing()
+pipe.vae.enable_tiling()
+
+# Compile
+
+# pipe.transformer.to(memory_format=torch.channels_last)
+# pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)

 os.makedirs("./output", exist_ok=True)
 os.makedirs("./gradio_tmp", exist_ok=True)
--- a/inference/gradio_web_demo.py
+++ b/inference/gradio_web_demo.py
@ -18,9 +18,14 @@ from datetime import datetime, timedelta
 from openai import OpenAI
 import moviepy.editor as mp

-dtype = torch.float16
-device = "cuda" if torch.cuda.is_available() else "cpu"
-pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=dtype).to(device)
+dtype = torch.bfloat16
+device = "cuda"  # Need to use cuda
+
+pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=dtype).to(device)
+pipe.enable_model_cpu_offload()
+pipe.enable_sequential_cpu_offload()
+pipe.vae.enable_slicing()
+pipe.vae.enable_tiling()

 os.makedirs("./output", exist_ok=True)
 os.makedirs("./gradio_tmp", exist_ok=True)
--- a/inference/streamlit_web_demo.py
+++ b/inference/streamlit_web_demo.py
@ -79,6 +79,9 @@ def generate_video(
        dtype=dtype,
    )
    pipe.enable_model_cpu_offload()
+    pipe.enable_sequential_cpu_offload()
+    pipe.vae.enable_slicing()
+    pipe.vae.enable_tiling()
    # Generate video
    video = pipe(
        num_inference_steps=num_inference_steps,