update final data and link

2025-06-13 19:19:15 +08:00 · 2024-08-27 19:12:17 +08:00 · 2024-08-27 19:12:17 +08:00 · f047689759
commit f047689759
parent 46703ef7a8
3 changed files with 74 additions and 69 deletions
--- a/README.md
+++ b/README.md
@ -155,11 +155,13 @@ To view the corresponding prompt words for the gallery, please click [here](reso

 ## Model Introduction

-<table  style="border-collapse: collapse; width: 100%;">
+CogVideoX is an open-source version of the video generation model originating from [QingYing](https://chatglm.cn/video?fr=osm_cogvideo). The table below displays the list of video generation models we currently offer, along with their foundational information.
+
+<table style="border-collapse: collapse; width: 100%;">
  <tr>
    <th style="text-align: center;">Model Name</th>
    <th style="text-align: center;">CogVideoX-2B</th>
-    <th style="text-align: center;">CogVideoX-5B</th>
+    <th style="text-align: center;">CogVideoX-5B (This Repository)</th>
  </tr>
  <tr>
    <td style="text-align: center;">Model Description</td>
@ -168,33 +170,33 @@ To view the corresponding prompt words for the gallery, please click [here](reso
  </tr>
  <tr>
    <td style="text-align: center;">Inference Precision</td>
-    <td style="text-align: center;"><b>FP16*(Recommended)</b>, BF16, FP32, FP8*(E4M3, E5M2), INT8, INT4 not supported</td>
-    <td style="text-align: center;"><b>BF16(Recommended)</b>, FP16, FP32, FP8*(E4M3, E5M2), INT8, INT4 not supported</td>
+    <td style="text-align: center;"><b>FP16* (Recommended)</b>, BF16, FP32, FP8*, INT8, no support for INT4</td>
+    <td style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, no support for INT4</td>
  </tr>
  <tr>
-    <td style="text-align: center;">Single GPU Memory Consumption<br></td>
+    <td style="text-align: center;">Single GPU VRAM Consumption</td>
    <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
    <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
  </tr>
  <tr>
-    <td style="text-align: center;">Multi-GPU Inference Memory Consumption</td>
-    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
-    <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
+    <td style="text-align: center;">Multi-GPU Inference VRAM Consumption</td>
+    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b></td>
+    <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b></td>
  </tr>
  <tr>
-    <td style="text-align: center;">Inference Speed<br>(Step = 50)</td>
-    <td style="text-align: center;">FP16: ~90* s</td>
-    <td style="text-align: center;">BF16: ~180* s</td>
+    <td style="text-align: center;">Inference Speed<br>(Step = 50, FP/BF16)</td>
+    <td style="text-align: center;">Single A100: ~90 seconds<br>Single H100: ~45 seconds</td>
+    <td style="text-align: center;">Single A100: ~180 seconds<br>Single H100: ~90 seconds</td>
  </tr>
  <tr>
-    <td style="text-align: center;">Fine-Tuning Precision</td>
+    <td style="text-align: center;">Fine-tuning Precision</td>
    <td style="text-align: center;"><b>FP16</b></td>
    <td style="text-align: center;"><b>BF16</b></td>
  </tr>
  <tr>
-    <td style="text-align: center;">Fine-Tuning Memory Consumption (per GPU)</td>
+    <td style="text-align: center;">Fine-tuning VRAM Consumption (per GPU)</td>
    <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
-    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
+    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)</td>
  </tr>
  <tr>
    <td style="text-align: center;">Prompt Language</td>
@ -206,46 +208,42 @@ To view the corresponding prompt words for the gallery, please click [here](reso
  </tr>
  <tr>
    <td style="text-align: center;">Video Length</td>
-    <td colspan="2" style="text-align: center;">6 seconds</td>
+    <td colspan="2" style="text-align: center;">6 Seconds</td>
  </tr>
  <tr>
    <td style="text-align: center;">Frame Rate</td>
-    <td colspan="2" style="text-align: center;">8 frames per second</td>
+    <td colspan="2" style="text-align: center;">8 Frames per Second</td>
  </tr>
  <tr>
    <td style="text-align: center;">Video Resolution</td>
-    <td colspan="2" style="text-align: center;">720 * 480, other resolutions not supported (including fine-tuning)</td>
+    <td colspan="2" style="text-align: center;">720 x 480, no support for other resolutions (including fine-tuning)</td>
  </tr>
  <tr>
    <td style="text-align: center;">Positional Encoding</td>
    <td style="text-align: center;">3d_sincos_pos_embed</td>
-    <td style="text-align: center;">3d_rope_pos_embed<br></td>
+    <td style="text-align: center;">3d_rope_pos_embed</td>
  </tr>
  <tr>
-    <td style="text-align: center;">Download Links (Diffusers Model)</td>
+    <td style="text-align: center;">Download Page (Diffusers)</td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
  </tr>
  <tr>
-    <td style="text-align: center;">Download Links (SAT Model)</td>
-    <td colspan="2" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
+    <td style="text-align: center;">Download Page (SAT)</td>
+    <td colspan="2" style="text-align: center;"><a href="./sat/README.md">SAT</a></td>
  </tr>
 </table>

 **Data Explanation**

-+ When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()`
-  optimization were enabled. This setup has not been tested for actual memory/VRAM usage on devices other than **NVIDIA
-  A100 / H100**. Generally, this approach should be compatible with all devices using the **NVIDIA Ampere architecture**
-  and above. If these optimizations are disabled, memory usage will increase significantly, with peak VRAM usage
-  approximately three times higher than the values shown in the table.
-+ When performing multi-GPU inference, the `enable_model_cpu_offload()` optimization must be disabled.
-+ Using the INT8 model will result in slower inference speeds. This is done to ensure that inference can be performed on
-  GPUs with lower memory without significant video quality loss, albeit with a notable reduction in speed.
-+ Inference speed tests were also conducted with the above memory optimizations. Without memory optimization, inference
-  speed increases by approximately 10%. Only the `diffusers` version of the model supports quantization.
-+ The model only supports English input; other languages can be translated into English when refined through large
-  language models.
+- When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()` optimization were enabled. This solution has not been tested for actual VRAM/memory usage on devices other than **NVIDIA A100/H100**. Generally, this solution can be adapted to all devices with **NVIDIA Ampere architecture** and above. If optimization is disabled, VRAM usage will increase significantly, with peak VRAM approximately 3 times the value in the table.
+- When performing multi-GPU inference, the `enable_model_cpu_offload()` optimization needs to be disabled.
+- Using an INT8 model will result in reduced inference speed. This is done to accommodate GPUs with lower VRAM, allowing inference to run properly with minimal video quality loss, though the inference speed will be significantly reduced.
+- The 2B model is trained using `FP16` precision, while the 5B model is trained using `BF16` precision. It is recommended to use the precision used in model training for inference.
+- `FP8` precision must be used on `NVIDIA H100` and above devices, requiring source installation of the `torch`, `torchao`, `diffusers`, and `accelerate` Python packages. `CUDA 12.4` is recommended.
+- Inference speed testing also used the aforementioned VRAM optimization scheme. Without VRAM optimization, inference speed increases by about 10%. Only models using `diffusers` support quantization.
+- The model only supports English input; other languages can be translated to English during large model refinements.
+

 ## Friendly Links

--- a/README_ja.md
+++ b/README_ja.md
@ -139,34 +139,34 @@ pip install -r requirements.txt

 ## モデル紹介

-CogVideoXは [清影](https://chatglm.cn/video?fr=osm_cogvideox) に由来するオープンソース版のビデオ生成モデルです。
-以下の表は、提供しているビデオ生成モデルに関する基本情報を示しています。
+CogVideoXは[清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源のオープンソース版動画生成モデルです。
+以下の表は、提供されている動画生成モデルに関する基本情報を示しています。

 <table style="border-collapse: collapse; width: 100%;">
  <tr>
    <th style="text-align: center;">モデル名</th>
    <th style="text-align: center;">CogVideoX-2B</th>
-    <th style="text-align: center;">CogVideoX-5B</th>
+    <th style="text-align: center;">CogVideoX-5B (本リポジトリ)</th>
  </tr>
  <tr>
    <td style="text-align: center;">モデル紹介</td>
-    <td style="text-align: center;">入門レベルのモデルで、互換性を重視しています。運用や二次開発のコストが低いです。</td>
-    <td style="text-align: center;">より高いビデオ生成品質と優れた視覚効果を提供する大型モデル。</td>
+    <td style="text-align: center;">入門モデルで、互換性を重視。運用および二次開発のコストが低い。</td>
+    <td style="text-align: center;">動画生成品質が高く、視覚効果がより優れた大型モデル。</td>
  </tr>
  <tr>
    <td style="text-align: center;">推論精度</td>
-    <td style="text-align: center;"><b>FP16*(推奨)</b>, BF16, FP32, FP8(E4M3, E5M2), INT8, INT4はサポートされていません</td>
-    <td style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32, FP8(E4M3, E5M2), INT8, INT4はサポートされていません</td>
+    <td style="text-align: center;"><b>FP16*(推奨)</b>, BF16, FP32, FP8*(E4M3, E5M2), INT8, INT4は非対応</td>
+    <td style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32, FP8*(E4M3, E5M2), INT8, INT4は非対応</td>
  </tr>
  <tr>
-    <td style="text-align: center;">単一GPUメモリ消費量<br></td>
+    <td style="text-align: center;">単一GPUのメモリ消費量</td>
    <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
    <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
  </tr>
  <tr>
-    <td style="text-align: center;">マルチGPU推論メモリ消費量</td>
-    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
-    <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
+    <td style="text-align: center;">複数GPUの推論メモリ消費量</td>
+    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b></td>
+    <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b></td>
  </tr>
  <tr>
    <td style="text-align: center;">推論速度<br>(Step = 50)</td>
@ -179,59 +179,55 @@ CogVideoXは [清影](https://chatglm.cn/video?fr=osm_cogvideox) に由来する
    <td style="text-align: center;"><b>BF16</b></td>
  </tr>
  <tr>
-    <td style="text-align: center;">微調整メモリ消費量(各GPU)</td>
+    <td style="text-align: center;">微調整時のメモリ消費量 (1GPUあたり)</td>
    <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
-    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
+    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)</td>
  </tr>
  <tr>
    <td style="text-align: center;">プロンプト言語</td>
    <td colspan="2" style="text-align: center;">英語*</td>
  </tr>
  <tr>
-    <td style="text-align: center;">プロンプト長さ制限</td>
-    <td colspan="2" style="text-align: center;">226 トークン</td>
+    <td style="text-align: center;">プロンプトの長さ上限</td>
+    <td colspan="2" style="text-align: center;">226トークン</td>
  </tr>
  <tr>
-    <td style="text-align: center;">ビデオ長さ</td>
-    <td colspan="2" style="text-align: center;">6 秒</td>
+    <td style="text-align: center;">動画の長さ</td>
+    <td colspan="2" style="text-align: center;">6秒</td>
  </tr>
  <tr>
    <td style="text-align: center;">フレームレート</td>
-    <td colspan="2" style="text-align: center;">8 フレーム/秒</td>
+    <td colspan="2" style="text-align: center;">8フレーム/秒</td>
  </tr>
  <tr>
-    <td style="text-align: center;">ビデオ解像度</td>
-    <td colspan="2" style="text-align: center;">720 * 480、他の解像度はサポートされていません（微調整を含む）</td>
+    <td style="text-align: center;">動画の解像度</td>
+    <td colspan="2" style="text-align: center;">720 * 480、他の解像度はサポートされていません（微調整も含む）</td>
  </tr>
-    <tr>
-    <td style="text-align: center;">位置エンコーディング</td>
+  <tr>
+    <td style="text-align: center;">位置エンコード</td>
    <td style="text-align: center;">3d_sincos_pos_embed</td>
-    <td style="text-align: center;">3d_rope_pos_embed<br></td>
+    <td style="text-align: center;">3d_rope_pos_embed</td>
  </tr>
  <tr>
-    <td style="text-align: center;">ダウンロードリンク (Diffusers モデル)</td>
+    <td style="text-align: center;">ダウンロードリンク (Diffusers)</td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
  </tr>
  <tr>
-    <td style="text-align: center;">ダウンロードリンク (SAT モデル)</td>
+    <td style="text-align: center;">ダウンロードリンク (SAT)</td>
    <td colspan="2" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
  </tr>
 </table>

 **データ解説**

-+ diffusers ライブラリを使用してテストを行った際に、`enable_model_cpu_offload()` オプションと `pipe.vae.enable_tiling()`
-  最適化が有効になっていました。このセットアップは **NVIDIA A100 / H100** 以外のデバイスでの実際のメモリ/VRAM
-  使用量についてはテストされていません。通常、このアプローチは **NVIDIA Ampere アーキテクチャ**
-  以上のすべてのデバイスに適しています。これらの最適化を無効にすると、メモリ使用量が大幅に増加し、表に示されている値の約3倍になります。
-+ マルチGPU推論を行う際には、`enable_model_cpu_offload()` 最適化を無効にする必要があります。
-+ INT8 モデルを使用すると推論速度が低下しますが、これは、メモリの少ないGPUでも正常に推論できるようにし、ビデオ品質の損失を最小限に抑えるためです。推論速度は大幅に低下します。
-
-推論速度テストも上記のメモリ最適化を使用して実施されました。メモリ最適化を使用しない場合、推論速度は約10％向上します。量子化をサポートしているのは `diffusers`
-バージョンのモデルのみです。
-
-+ モデルは英語入力のみをサポートしており、他の言語は大規模な言語モデルを通じて英語に翻訳することで対応できます。
+ diffusersライブラリを使用したテストでは、`enable_model_cpu_offload()`オプションと`pipe.vae.enable_tiling()`最適化が有効になっています。この手法は、**NVIDIA A100 / H100**以外のデバイスでの実際のメモリ/メモリ消費量についてはテストされていません。通常、この手法はすべての**NVIDIA Ampereアーキテクチャ**以上のデバイスに適合します。最適化を無効にすると、メモリ消費量が倍増し、ピークメモリは表の3倍程度になります。
+ 複数GPUで推論する際は、`enable_model_cpu_offload()`最適化を無効にする必要があります。
+ INT8モデルを使用すると推論速度が低下します。これは、メモリが少ないGPUで正常に推論を行い、動画品質の損失を最小限に抑えるためです。そのため、推論速度が大幅に低下します。
+ 2Bモデルは`FP16`精度でトレーニングされ、5Bモデルは`BF16`精度でトレーニングされています。推奨される精度で推論を行うことをお勧めします。
+ `FP8`精度は`NVIDIA H100`以上のデバイスでのみ使用でき、`torch`、`torchao`、`diffusers`、`accelerate`のPythonパッケージをソースコードからインストールする必要があります。`CUDA 12.4`の使用を推奨します。
+ 推論速度のテストも上記のメモリ最適化手法を使用して行いました。メモリ最適化を行わない場合、推論速度が約10％向上します。量子化をサポートするのは`diffusers`バージョンのモデルのみです。
+ モデルは英語入力のみをサポートしており、他の言語は大モデルでのポストプロセスで英語に翻訳する必要があります。

 ## 友好的リンク

--- a/README_zh.md
+++ b/README_zh.md
@ -206,6 +206,15 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
    <td style="text-align: center;">3d_sincos_pos_embed</td>
    <td style="text-align: center;">3d_rope_pos_embed<br></td>
  </tr>
+  <tr>
+    <td style="text-align: center;">下载链接 (Diffusers)</td>
+    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
+    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">下载链接 (SAT)</td>
+    <td colspan="2" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
+  </tr>
 </table>

 **数据解释**
@ -215,6 +224,8 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
  以上的设备。若关闭优化，显存占用会成倍增加，峰值显存约为表格的3倍。
 + 多GPU推理时，需要关闭 `enable_model_cpu_offload()` 优化。
 + 使用 INT8 模型会导致推理速度降低，此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失，推理速度大幅降低。
+ 2B 模型采用 `FP16` 精度训练， 5B模型采用 `BF16` 精度训练。我们推荐使用模型训练的精度进行推理。
+ `FP8` 精度必须在`NVIDIA H100` 及以上的设备上使用，需要源代码安装`torch`,`torchao`,`diffusers`,`accelerate` python包，推荐使用 `CUDA 12.4`。
 + 推理速度测试同样采用了上述显存优化方案，不采用显存优化的情况下，推理速度提升约10%。 只有`diffusers`版本模型支持量化。
 + 模型仅支持英语输入，其他语言可以通过大模型润色时翻译为英语。