From 6d7f6e860196381b5dbd5653e156671390cc3a68 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Tue, 27 Aug 2024 16:05:03 +0800
Subject: [PATCH 1/4] update the release draft readme

---
 README.md                              | 244 ++++++++++++++++++-------
 README_ja.md                           | 244 +++++++++++++++++--------
 README_zh.md                           | 200 +++++++++++++++-----
 inference/cli_demo.py                  |  25 +--
 inference/cli_demo_quantization.py     |  89 ++++++---
 inference/gradio_composite_demo/app.py |  39 ++--
 inference/gradio_web_demo.py           |  32 ++--
 sat/README.md                          |  19 +-
 sat/README_ja.md                       |  17 +-
 sat/README_zh.md                       |  17 +-
 10 files changed, 658 insertions(+), 268 deletions(-)
diff --git a/README.md b/README.md
index 218ff67..0fb23ee 100644
--- a/README.md
+++ b/README.md
@@ -8,7 +8,7 @@
 <img src=resources/logo.svg width="50%"/>
 </div>
 <p align="center">
-🤗 Experience on <a href="https://huggingface.co/spaces/THUDM/CogVideoX" target="_blank">CogVideoX Huggingface Space</a>
+Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B" target="_blank"> 🤗 Huggingface Space</a> or <a href="https://modelscope.cn/studios/ZhipuAI/CogVideoX-5b-demo" target="_blank"> 🤖 ModelScope Space</a>
 </p>
 <p align="center">
 📚 Check here to view <a href="https://arxiv.org/abs/2408.06072" target="_blank">Paper</a>
@@ -22,7 +22,12 @@
 
 ## Update and News
 
-- 🔥🔥 **News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) now supports enhancing videos
+- 🔥🔥 **News**: ```2024/8/27```: We have open-sourced a larger model in the CogVideoX series, **CogVideoX-5B**. At the
+  same time, **CogVideoX-2B** will be licensed under the **Apache 2.0 License**. We have significantly optimized the
+  model's
+  inference performance, greatly lowering the inference threshold. You can now run **CogVideoX-2B** on earlier GPUs like
+  the `GTX 1080TI`, and **CogVideoX-5B** on mainstream desktop GPUs like the `RTX 3060`.
+- 🔥 **News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) now supports enhancing videos
   generated by
   CogVideoX, achieving higher resolution and higher quality video rendering. We welcome you to try it out by following
   the [tutorial](tools/venhancer/README_zh.md).
@@ -80,7 +85,6 @@ with long prompts, and a good prompt directly impacts the quality of the video g
 Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is
 recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform
 rapid stacking and development.
-(18 GB for inference, 40GB for lora finetune)
 
 ### Diffusers
 
@@ -92,51 +96,154 @@ pip install -r requirements.txt
 
 Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the
 significance of common parameters.
-(24GB for inference,fine-tuned code are under development)
 
-## CogVideoX-2B Gallery
+## Gallery
 
-<div align="center">
-  <video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="80%" controls autoplay></video>
-  <p>A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.</p>
-</div>
+### CogVideoX-5B
+<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
+  <tr>
+      <td>
+          <video src="https://github.com/user-attachments/assets/cf5953ea-96d3-48fd-9907-c4708752c714" width="100%" controls autoplay loop></video>
+      </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/fe0a78e6-b669-4800-8cf0-b5f9b5145b52" width="100%" controls autoplay loop></video>
+      </td>
+       <td>
+          <video src="https://github.com/user-attachments/assets/c182f606-8f8c-421d-b414-8487070fcfcb" width="100%" controls autoplay loop></video>
+     </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/7db2bbce-194d-434d-a605-350254b6c298" width="100%" controls autoplay loop></video>
+     </td>
+  </tr>
+  <tr>
+      <td>
+          <video src="https://github.com/user-attachments/assets/62b01046-8cab-44cc-bd45-4d965bb615ec" width="100%" controls autoplay loop></video>
+      </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/d78e552a-4b3f-4b81-ac3f-3898079554f6" width="100%" controls autoplay loop></video>
+      </td>
+       <td>
+          <video src="https://github.com/user-attachments/assets/30894f12-c741-44a2-9e6e-ddcacc231e5b" width="100%" controls autoplay loop></video>
+     </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/926575ca-7150-435b-a0ff-4900a963297b" width="100%" controls autoplay loop></video>
+     </td>
+  </tr>
+</table>
 
-<div align="center">
-  <video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="80%" controls autoplay></video>
-  <p>The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from its tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.</p>
-</div>
+### CogVideoX-2B 
+<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
+  <tr>
+      <td>
+          <video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="100%" controls autoplay loop></video>
+      </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="100%" controls autoplay loop></video>
+      </td>
+       <td>
+          <video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="100%" controls autoplay loop></video>
+     </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="100%" controls autoplay loop></video>
+     </td>
+  </tr>
+</table>
 
-<div align="center">
-  <video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="80%" controls autoplay></video>
-  <p>A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.</p>
-</div>
-
-<div align="center">
-  <video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="80%" controls autoplay></video>
-  <p>In the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.</p>
-</div>
+To view the corresponding prompt words for the gallery, please click [here](resources/galary_prompt.md)
 
 ## Model Introduction
 
-CogVideoX is an open-source version of the video generation model, which is homologous
-to [清影](https://chatglm.cn/video?fr=osm_cogvideox).
+<table  style="border-collapse: collapse; width: 100%;">
+  <tr>
+    <th style="text-align: center;">Model Name</th>
+    <th style="text-align: center;">CogVideoX-2B</th>
+    <th style="text-align: center;">CogVideoX-5B</th>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Model Description</td>
+    <td style="text-align: center;">Entry-level model, balancing compatibility. Low cost for running and secondary development.</td>
+    <td style="text-align: center;">Larger model with higher video generation quality and better visual effects.</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Inference Precision</td>
+    <td style="text-align: center;"><b>FP16*(Recommended)</b>, BF16, FP32, FP8*(E4M3, E5M2), INT8, INT4 not supported</td>
+    <td style="text-align: center;"><b>BF16(Recommended)</b>, FP16, FP32, FP8*(E4M3, E5M2), INT8, INT4 not supported</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Single GPU Memory Consumption<br></td>
+    <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
+    <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Multi-GPU Inference Memory Consumption</td>
+    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
+    <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Inference Speed<br>(Step = 50)</td>
+    <td style="text-align: center;">FP16: ~90* s</td>
+    <td style="text-align: center;">BF16: ~180* s</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Fine-Tuning Precision</td>
+    <td style="text-align: center;"><b>FP16</b></td>
+    <td style="text-align: center;"><b>BF16</b></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Fine-Tuning Memory Consumption (per GPU)</td>
+    <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
+    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Prompt Language</td>
+    <td colspan="2" style="text-align: center;">English*</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Prompt Length Limit</td>
+    <td colspan="2" style="text-align: center;">226 Tokens</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Video Length</td>
+    <td colspan="2" style="text-align: center;">6 seconds</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Frame Rate</td>
+    <td colspan="2" style="text-align: center;">8 frames per second</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Video Resolution</td>
+    <td colspan="2" style="text-align: center;">720 * 480, other resolutions not supported (including fine-tuning)</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Positional Encoding</td>
+    <td style="text-align: center;">3d_sincos_pos_embed</td>
+    <td style="text-align: center;">3d_rope_pos_embed<br></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Download Links (Diffusers Model)</td>
+    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
+    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Download Links (SAT Model)</td>
+    <td colspan="2" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
+  </tr>
+</table>
 
-The table below shows the list of video generation models we currently provide,
-along with related basic information:
+**Data Explanation**
 
-| Model Name                                | CogVideoX-2B                                                                                                                                                                                        | 
-|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Prompt Language                           | English                                                                                                                                                                                             | 
-| Single GPU  Inference (FP16)              | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)   <br>  23.9GB using diffusers                                                                                                      | 
-| Multi GPUs Inference (FP16)               | 20GB minimum per GPU using diffusers                                                                                                                                                                |
-| GPU Memory Required for Fine-tuning(bs=1) | 40GB                                                                                                                                                                                                |
-| Prompt Max  Length                        | 226 Tokens                                                                                                                                                                                          |
-| Video Length                              | 6 seconds                                                                                                                                                                                           | 
-| Frames Per Second                         | 8 frames                                                                                                                                                                                            | 
-| Resolution                                | 720 * 480                                                                                                                                                                                           |
-| Quantized Inference                       | Not Supported                                                                                                                                                                                       |          
-| Download Link (HF diffusers Model)        | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B)   [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b)   [💫 WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) |
-| Download Link (SAT Model)                 | [SAT](./sat/README.md)                                                                                                                                                                              |
++ When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()`
+  optimization were enabled. This setup has not been tested for actual memory/VRAM usage on devices other than **NVIDIA
+  A100 / H100**. Generally, this approach should be compatible with all devices using the **NVIDIA Ampere architecture**
+  and above. If these optimizations are disabled, memory usage will increase significantly, with peak VRAM usage
+  approximately three times higher than the values shown in the table.
++ When performing multi-GPU inference, the `enable_model_cpu_offload()` optimization must be disabled.
++ Using the INT8 model will result in slower inference speeds. This is done to ensure that inference can be performed on
+  GPUs with lower memory without significant video quality loss, albeit with a notable reduction in speed.
++ Inference speed tests were also conducted with the above memory optimizations. Without memory optimization, inference
+  speed increases by approximately 10%. Only the `diffusers` version of the model supports quantization.
++ The model only supports English input; other languages can be translated into English when refined through large
+  language models.
 
 ## Friendly Links
 
@@ -157,20 +264,25 @@ of the **CogVideoX** open-source model.
 
 ### Inference
 
-+ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the
-  significance of common parameters.
-+ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of
-  memory, but it will be optimized in the future.
-+ [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because
-  CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training
-  distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as
-  GPT, Gemini, etc.
-+ [gradio_web_demo](inference/gradio_web_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B
-  model to generate videos. Same as Our Huggingface Space, you can use this script to launch a web demo.
++ [dcli_demo](inference/cli_demo.py): A more detailed inference code explanation, including the significance of
+  common parameters. All of this is covered here.
++ [cli_demo_quantization](inference/cli_demo_quantization.py):
+  Quantized model inference code that can run on devices with lower memory. You can also modify this code to support
+  running CogVideoX models in FP8 precision.
++ [diffusers_vae_demo](inference/cli_vae_demo.py): Code for running VAE inference separately.
++ [space demo](inference/gradio_composite_demo): The same GUI code as used in the Huggingface Space, with frame
+  interpolation and super-resolution tools integrated.
++ [convert_demo](inference/convert_demo.py): How to convert user input into long-form input suitable for CogVideoX.
+  Since CogVideoX is trained on long texts, we need to transform the input text distribution to match the training data
+  using an LLM. The script defaults to using GLM4, but it can be replaced with GPT, Gemini, or any other large language
+  model.
++ [gradio_web_demo](inference/gradio_web_demo.py): A simple Gradio web application demonstrating how to use the
+  CogVideoX-2B model to generate videos. Similar to our Huggingface Space, you can use this script to run a simple web
+  application for video generation.
 
 ```shell
 cd inference
-# For Linux and Windows users (and macOS with Intel??)
+# For Linux and Windows users
 python gradio_web_demo.py # humans mode
 
 # For macOS with Apple Silicon users, Intel not supported, this maybe 20x slower than RTX 4090
@@ -243,20 +355,28 @@ hands-on practice on text-to-video generation. *The original input is in Chinese
 
 ## Open Source Project Plan
 
-- [x] Open source CogVideoX model
-    - [x] Open source 3D Causal VAE used in CogVideoX.
-    - [x] CogVideoX model inference example (CLI / Web Demo)
-    - [x] CogVideoX online experience demo (Huggingface Space)
-    - [x] CogVideoX open source model API interface example (Huggingface)
-    - [x] CogVideoX model fine-tuning example (SAT)
-    - [ ] CogVideoX model fine-tuning example (Huggingface / SAT)
-    - [ ] Open source CogVideoX-Pro (adapted for CogVideoX-2B suite)
-    - [x] Release CogVideoX technical report
+- [x] CogVideoX Model Open Source
+    - [x] CogVideoX Model Inference Example (CLI / Web Demo)
+    - [x] CogVideoX Online Experience Example (Huggingface Space)
+    - [x] CogVideoX Open Source Model API Interface Example (Huggingface)
+    - [x] CogVideoX Model Fine-Tuning Example (SAT)
+    - [ ] CogVideoX Model Fine-Tuning Example (Huggingface Diffusers)
+    - [X] CogVideoX-5B Open Source (Adapted to CogVideoX-2B Suite)
+    - [X] CogVideoX Technical Report Released
+    - [X] CogVideoX Technical Explanation Video
+- [ ] CogVideoX Peripheral Tools
+    - [X] Basic Video Super-Resolution / Frame Interpolation Suite
+    - [ ] Inference Framework Adaptation
+    - [ ] ComfyUI Full Ecosystem Tools
 
-We welcome your contributions. You can click [here](resources/contribute.md) for more information.
+We welcome your contributions! You can click [here](resources/contribute_zh.md) for more information.
 
-## Model License
+## License Agreement
 
 The code in this repository is released under the [Apache 2.0 License](LICENSE).
 
-The model weights and implementation code are released under the [CogVideoX LICENSE](MODEL_LICENSE).
+The CogVideoX-2B model (including its corresponding Transformers module and VAE module) is released under
+the [Apache 2.0 License](LICENSE).
+
+The CogVideoX-5B model (Transformers module) is released under
+the [CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE).
diff --git a/README_ja.md b/README_ja.md
index 22e229f..1565de0 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -8,7 +8,7 @@
 <img src=resources/logo.svg width="50%"/>
 </div>
 <p align="center">
-🤗 <a href="https://huggingface.co/spaces/THUDM/CogVideoX" target="_blank">CogVideoX Huggingface Space</a> で体験
+<a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B" target="_blank"> 🤗 Huggingface Space</a> または <a href="https://modelscope.cn/studios/ZhipuAI/CogVideoX-5b-demo" target="_blank"> 🤖 ModelScope Space</a> で CogVideoX-5B モデルをオンラインで体験してください
 </p>
 <p align="center">
 📚 <a href="https://arxiv.org/abs/2408.06072" target="_blank">論文</a> をチェック
@@ -22,7 +22,10 @@
 
 ## 更新とニュース
 
-- 🔥🔥 **ニュース**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) は CogVideoX
+- 🔥🔥 **ニュース**: ```2024/8/27```: CogVideoXシリーズのより大きなモデル**CogVideoX-5B**をオープンソース化しました。同時に、
+  **CogVideoX-2B**は **Apache 2.0** ライセンスに変更されます。モデルの推論性能を大幅に最適化し、推論のハードルを大きく下げました。これにより、
+  **CogVideoX-2B**は `GTX 1080TI` などの古いGPUで、**CogVideoX-5B**は `RTX 3060` などのデスクトップ向けGPUで実行できます。
+- 🔥**ニュース**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) は CogVideoX
   が生成したビデオの強化をサポートしました。より高い解像度とより高品質なビデオレンダリングを実現します。[チュートリアル](tools/venhancer/README_ja.md)
   に従って、ぜひお試しください。
 - 🔥**ニュース**: 2024/8/15: CogVideoX の依存関係である`SwissArmyTransformer`の依存が`0.4.12`
@@ -71,7 +74,6 @@
 
 [sat_demo](sat/README.md) の指示に従ってください:
 SATウェイトの推論コードと微調整コードが含まれています。CogVideoXモデル構造に基づいて改善することをお勧めします。革新的な研究者は、このコードを使用して迅速なスタッキングと開発を行うことができます。
-(推論には18GB、lora微調整には40GBが必要です)
 
 ### Diffusers
 
@@ -80,49 +82,156 @@ pip install -r requirements.txt
 ```
 
 次に [diffusers_demo](inference/cli_demo.py) を参照してください: 推論コードの詳細な説明が含まれており、一般的なパラメータの意味についても言及しています。
-(推論には24GBが必要で、微調整コードは開発中です)
 
-## CogVideoX-2B ギャラリー
+## Gallery
 
-<div align="center">
-  <video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="80%" controls autoplay></video>
-  <p>詳細に彫刻されたマストと帆を持つ木製の玩具船が、海の波を模倣した豪華な青いカーペットの上を滑らかに進んでいます。船体は濃い茶色に塗られ、小さな窓が付いています。カーペットは柔らかく、テクスチャーがあり、海洋の広がりを連想させる完璧な背景を提供します。船の周りにはさまざまな他の玩具や子供のアイテムがあり、遊び心のある環境を示唆しています。このシーンは、子供時代の無邪気さと想像力を捉えており、玩具船の旅は室内の幻想的な設定での無限の冒険を象徴しています。</p>
-</div>
+### CogVideoX-5B
+<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
+  <tr>
+      <td>
+          <video src="https://github.com/user-attachments/assets/cf5953ea-96d3-48fd-9907-c4708752c714" width="100%" controls autoplay loop></video>
+      </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/fe0a78e6-b669-4800-8cf0-b5f9b5145b52" width="100%" controls autoplay loop></video>
+      </td>
+       <td>
+          <video src="https://github.com/user-attachments/assets/c182f606-8f8c-421d-b414-8487070fcfcb" width="100%" controls autoplay loop></video>
+     </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/7db2bbce-194d-434d-a605-350254b6c298" width="100%" controls autoplay loop></video>
+     </td>
+  </tr>
+  <tr>
+      <td>
+          <video src="https://github.com/user-attachments/assets/62b01046-8cab-44cc-bd45-4d965bb615ec" width="100%" controls autoplay loop></video>
+      </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/d78e552a-4b3f-4b81-ac3f-3898079554f6" width="100%" controls autoplay loop></video>
+      </td>
+       <td>
+          <video src="https://github.com/user-attachments/assets/30894f12-c741-44a2-9e6e-ddcacc231e5b" width="100%" controls autoplay loop></video>
+     </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/926575ca-7150-435b-a0ff-4900a963297b" width="100%" controls autoplay loop></video>
+     </td>
+  </tr>
+</table>
 
-<div align="center">
-  <video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="80%" controls autoplay></video>
-  <p>カメラは、黒いルーフラックを備えた白いビンテージSUVの後ろを追いかけ、急な山道をスピードアップして進みます。タイヤからほこりが舞い上がり、日光がSUVに当たり、暖かい輝きを放ちます。山道は緩やかに曲がり、他の車両は見当たりません。道の両側には赤杉の木が立ち並び、緑のパッチが点在しています。車は後ろから見て、険しい地形を楽々と進んでいるように見えます。山道自体は急な丘と山に囲まれ、上空には青い空と薄い雲が広がっています。</p>
-</div>
+### CogVideoX-2B 
+<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
+  <tr>
+      <td>
+          <video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="100%" controls autoplay loop></video>
+      </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="100%" controls autoplay loop></video>
+      </td>
+       <td>
+          <video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="100%" controls autoplay loop></video>
+     </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="100%" controls autoplay loop></video>
+     </td>
+  </tr>
+</table>
 
-<div align="center">
-  <video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="80%" controls autoplay></video>
-  <p>色とりどりのバンダナを巻いた、擦り切れたデニムジャケットを着たストリートアーティストが、広大なコンクリートの壁の前に立ち、スプレーペイント缶を持ち、斑点のある壁にカラフルな鳥をスプレーペイントしています。</p>
-</div>
-
-<div align="center">
-  <video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="80%" controls autoplay></video>
-  <p>戦争で荒廃した都市の背景に、廃墟と崩れた壁が破壊の物語を語る中、若い少女の感動的なクローズアップがフレームに収められています。彼女の顔は灰で汚れており、周囲の混乱を静かに物語っています。彼女の目は悲しみと回復力の混じった輝きを放ち、紛争の荒廃によって無垢を失った世界の生の感情を捉えています。</p>
-</div>
+ギャラリーの対応するプロンプトワードを表示するには、[こちら](resources/galary_prompt.md)をクリックしてください
 
 ## モデル紹介
 
-CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源のオープンソース版ビデオ生成モデルです。
+CogVideoXは [清影](https://chatglm.cn/video?fr=osm_cogvideox) に由来するオープンソース版のビデオ生成モデルです。
+以下の表は、提供しているビデオ生成モデルに関する基本情報を示しています。
 
-以下の表は、現在提供しているビデオ生成モデルのリストと関連する基本情報を示しています:
+<table style="border-collapse: collapse; width: 100%;">
+  <tr>
+    <th style="text-align: center;">モデル名</th>
+    <th style="text-align: center;">CogVideoX-2B</th>
+    <th style="text-align: center;">CogVideoX-5B</th>
+  </tr>
+  <tr>
+    <td style="text-align: center;">モデル紹介</td>
+    <td style="text-align: center;">入門レベルのモデルで、互換性を重視しています。運用や二次開発のコストが低いです。</td>
+    <td style="text-align: center;">より高いビデオ生成品質と優れた視覚効果を提供する大型モデル。</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">推論精度</td>
+    <td style="text-align: center;"><b>FP16*(推奨)</b>, BF16, FP32, FP8(E4M3, E5M2), INT8, INT4はサポートされていません</td>
+    <td style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32, FP8(E4M3, E5M2), INT8, INT4はサポートされていません</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">単一GPUメモリ消費量<br></td>
+    <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
+    <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">マルチGPU推論メモリ消費量</td>
+    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
+    <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">推論速度<br>(Step = 50)</td>
+    <td style="text-align: center;">FP16: ~90* s</td>
+    <td style="text-align: center;">BF16: ~180* s</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">微調整精度</td>
+    <td style="text-align: center;"><b>FP16</b></td>
+    <td style="text-align: center;"><b>BF16</b></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">微調整メモリ消費量(各GPU)</td>
+    <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
+    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">プロンプト言語</td>
+    <td colspan="2" style="text-align: center;">英語*</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">プロンプト長さ制限</td>
+    <td colspan="2" style="text-align: center;">226 トークン</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">ビデオ長さ</td>
+    <td colspan="2" style="text-align: center;">6 秒</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">フレームレート</td>
+    <td colspan="2" style="text-align: center;">8 フレーム/秒</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">ビデオ解像度</td>
+    <td colspan="2" style="text-align: center;">720 * 480、他の解像度はサポートされていません（微調整を含む）</td>
+  </tr>
+    <tr>
+    <td style="text-align: center;">位置エンコーディング</td>
+    <td style="text-align: center;">3d_sincos_pos_embed</td>
+    <td style="text-align: center;">3d_rope_pos_embed<br></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">ダウンロードリンク (Diffusers モデル)</td>
+    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
+    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">ダウンロードリンク (SAT モデル)</td>
+    <td colspan="2" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
+  </tr>
+</table>
 
-| モデル名                         | CogVideoX-2B                                                                                                                                                                                        | 
-|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| プロンプト言語                      | 英語                                                                                                                                                                                                  | 
-| 単一GPU推論 (FP16)               | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)   <br>  23.9GB using diffusers                                                                                                      | 
-| 複数GPU推論 (FP16)               | 20GB minimum per GPU using diffusers                                                                                                                                                                |
-| 微調整に必要なGPUメモリ(bs=1)          | 40GB                                                                                                                                                                                                |
-| プロンプトの最大長                    | 226 トークン                                                                                                                                                                                            |
-| ビデオの長さ                       | 6秒                                                                                                                                                                                                  | 
-| フレームレート                      | 8フレーム                                                                                                                                                                                               | 
-| 解像度                          | 720 * 480                                                                                                                                                                                           |
-| 量子化推論                        | サポートされていません                                                                                                                                                                                         |          
-| ダウンロードリンク (HF diffusers モデル) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B)   [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b)   [💫 WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) |
-| ダウンロードリンク (SAT モデル)          | [SAT](./sat/README.md)                                                                                                                                                                              |
+**データ解説**
+
++ diffusers ライブラリを使用してテストを行った際に、`enable_model_cpu_offload()` オプションと `pipe.vae.enable_tiling()`
+  最適化が有効になっていました。このセットアップは **NVIDIA A100 / H100** 以外のデバイスでの実際のメモリ/VRAM
+  使用量についてはテストされていません。通常、このアプローチは **NVIDIA Ampere アーキテクチャ**
+  以上のすべてのデバイスに適しています。これらの最適化を無効にすると、メモリ使用量が大幅に増加し、表に示されている値の約3倍になります。
++ マルチGPU推論を行う際には、`enable_model_cpu_offload()` 最適化を無効にする必要があります。
++ INT8 モデルを使用すると推論速度が低下しますが、これは、メモリの少ないGPUでも正常に推論できるようにし、ビデオ品質の損失を最小限に抑えるためです。推論速度は大幅に低下します。
+
+推論速度テストも上記のメモリ最適化を使用して実施されました。メモリ最適化を使用しない場合、推論速度は約10％向上します。量子化をサポートしているのは `diffusers`
+バージョンのモデルのみです。
+
++ モデルは英語入力のみをサポートしており、他の言語は大規模な言語モデルを通じて英語に翻訳することで対応できます。
 
 ## 友好的リンク
 
@@ -132,15 +241,17 @@ CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源の
   強力で包括的な分散推論フレームワークであり、ワンクリックで独自のモデルや最新のオープンソースモデルを簡単にデプロイできます。
 + [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys): VideoSysは、使いやすく高性能なビデオ生成インフラを提供し、最新のモデルや技術を継続的に統合しています。
 
-
 ## プロジェクト構造
 
 このオープンソースリポジトリは、**CogVideoX** オープンソースモデルの基本的な使用方法と微調整の例を迅速に開始するためのガイドです。
 
 ### 推論
 
-+ [diffusers_demo](inference/cli_demo.py): 推論コードの詳細な説明が含まれており、一般的なパラメータの意味についても言及しています。
++ [cli_demo](inference/cli_demo.py): 推論コードの詳細な説明が含まれており、一般的なパラメータの意味についても言及しています。
++ [cli_demo_quantization](inference/cli_demo_quantization.py):
+  量子化モデル推論コードで、低メモリのデバイスでも実行可能です。また、このコードを変更して、FP8 精度の CogVideoX モデルの実行をサポートすることもできます。
 + [diffusers_vae_demo](inference/cli_vae_demo.py): VAE推論コードの実行には現在71GBのメモリが必要ですが、将来的には最適化される予定です。
++ [space demo](inference/gradio_composite_demo): Huggingface Spaceと同じGUIコードで、フレーム補間や超解像ツールが組み込まれています。
 + [convert_demo](inference/convert_demo.py):
   ユーザー入力をCogVideoXに適した形式に変換する方法。CogVideoXは長いキャプションでトレーニングされているため、入力テキストをLLMを使用してトレーニング分布と一致させる必要があります。デフォルトではGLM4を使用しますが、GPT、Geminiなどの他のLLMに置き換えることもできます。
 + [gradio_web_demo](inference/gradio_web_demo.py): CogVideoX-2B モデルを使用して動画を生成する方法を示す、シンプルな
@@ -148,7 +259,7 @@ CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源の
 
 ```shell
 cd inference
-# For Linux and Windows users (and macOS with Intel??)
+# For Linux and Windows users
 python gradio_web_demo.py # humans mode
 
 # For macOS with Apple Silicon users, Intel not supported, this maybe 20x slower than RTX 4090
@@ -178,26 +289,6 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python gradio_web_demo.py # humans mode
 + [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): SATモデルのウェイトをHuggingfaceモデルのウェイトに変換します。
 + [caption_demo](tools/caption): キャプションツール、ビデオを理解し、テキストで出力するモデル。
 
-## プロジェクト計画
-
-- [x] CogVideoXモデルのオープンソース化
-    - [x] CogVideoXで使用される3D Causal VAEのオープンソース化
-    - [x] CogVideoXモデルの推論例 (CLI / Webデモ)
-    - [x] CogVideoXオンライン体験デモ (Huggingface Space)
-    - [x] CogVideoXオープンソースモデルAPIインターフェースの例 (Huggingface)
-    - [x] CogVideoXモデルの微調整例 (SAT)
-    - [ ] CogVideoXモデルの微調整例 (Huggingface / SAT)
-    - [ ] CogVideoX-Proのオープンソース化 (CogVideoX-2Bスイートに適応)
-    - [x] CogVideoX技術レポートの公開
-
-私たちはあなたの貢献を歓迎します。詳細については[こちら](resources/contribute.md)をクリックしてください。
-
-## モデルライセンス
-
-このリポジトリのコードは [Apache 2.0 ライセンス](LICENSE) の下で公開されています。
-
-モデルのウェイトと実装コードは [CogVideoX LICENSE](MODEL_LICENSE) の下で公開されています。
-
 ## CogVideo(ICLR'23)
 
 論文の公式リポジトリ: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868)
@@ -238,19 +329,28 @@ CogVideoのデモは [https://models.aminer.cn/cogvideo](https://models.aminer.c
 
 ## オープンソースプロジェクト計画
 
-- [x] CogVideoX モデルのオープンソース化
-    - [x] CogVideoX モデル推論サンプル (CLI / Web デモ)
-    - [x] CogVideoX オンライン体験サンプル (Huggingface Space)
-    - [x] CogVideoX オープンソースAPIインターフェースサンプル (Huggingface)
-    - [x] CogVideoX モデルの微調整サンプル (SAT)
-    - [ ] CogVideoX モデルの微調整サンプル (Huggingface / SAT)
-    - [ ] CogVideoX-Pro オープンソース化 (CogVideoX-2B スイートに対応)
-    - [X] CogVideoX 技術レポート公開
+- [x] CogVideoX モデルオープンソース化
+    - [x] CogVideoX モデル推論例 (CLI / Web デモ)
+    - [x] CogVideoX オンライン体験例 (Huggingface Space)
+    - [x] CogVideoX オープンソースモデルAPIインターフェース例 (Huggingface)
+    - [x] CogVideoX モデル微調整例 (SAT)
+    - [ ] CogVideoX モデル微調整例 (Huggingface Diffusers)
+    - [X] CogVideoX-5B オープンソース化 (CogVideoX-2B スイートに適応)
+    - [X] CogVideoX 技術報告公開
+    - [X] CogVideoX 技術解説ビデオ
+- [ ] CogVideoX 周辺ツール
+    - [X] 基本的なビデオ超解像 / フレーム補間スイート
+    - [ ] 推論フレームワーク適応
+    - [ ] ComfyUI 完全エコシステムツール
 
-私たちは皆さんの貢献を歓迎しています。詳しくは[こちら](resources/contribute_zh.md)をご覧ください。
+あなたの貢献をお待ちしています！詳細は[こちら](resources/contribute_zh.md)をクリックしてください。
 
-## モデルライセンス
+## ライセンス契約
 
-本リポジトリのコードは [Apache 2.0 ライセンス](LICENSE) の下で公開されています。
+このリポジトリのコードは [Apache 2.0 License](LICENSE) の下で公開されています。
 
-本モデルのウェイトと実装コードは [CogVideoX LICENSE](MODEL_LICENSE) ライセンスに基づいて公開されています。
\ No newline at end of file
+CogVideoX-2B モデル (対応するTransformersモジュールやVAEモジュールを含む) は
+[Apache 2.0 License](LICENSE) の下で公開されています。
+
+CogVideoX-5B モデル (Transformersモジュール) は
+[CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE) の下で公開されています。
\ No newline at end of file
diff --git a/README_zh.md b/README_zh.md
index e448e37..336883d 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -9,7 +9,7 @@
 <img src=resources/logo.svg width="50%"/>
 </div>
 <p align="center">
-🤗 在 <a href="https://huggingface.co/spaces/THUDM/CogVideoX" target="_blank">CogVideoX Huggingface Space</a> 体验视频生成模型
+在 <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B" target="_blank"> 🤗 Huggingface Space</a> 或 <a href="https://modelscope.cn/studios/ZhipuAI/CogVideoX-5b-demo" target="_blank"> 🤖 ModelScope Space</a> 在线体验 CogVideoX-5B 模型
 </p>
 <p align="center">
 📚 查看 <a href="https://arxiv.org/abs/2408.06072" target="_blank">论文</a>
@@ -23,7 +23,10 @@
 
 ## 项目更新
 
-- 🔥🔥**News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) 已经支持对 CogVideoX
+- 🔥🔥 **News**: ```2024/8/27```:  我们开源 CogVideoX 系列更大的模型 **CogVideoX-5B**。同时 **CogVideoX-2B** 将修改为
+  **Apache 2.0 协议**。我们大幅度优化了模型的推理性能，推理门槛大幅降低，您可以在 `GTX 1080TI` 等早期显卡运行 **CogVideoX-2B**
+  ，在 `RTX 3060`等桌面端甜品卡运行 **CogVideoX-5B** 模型。
+- 🔥**News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) 已经支持对 CogVideoX
   生成的视频进行增强，实现更高分辨率，更高质量的视频渲染。欢迎大家按照[教程](tools/venhancer/README_zh.md)体验使用。
 - 🔥**News**: ```2024/8/15```: CogVideoX 依赖中`SwissArmyTransformer`依赖升级到`0.4.12`,
   微调不再需要从源代码安装`SwissArmyTransformer`。同时，`Tied VAE` 技术已经被应用到 `diffusers`
@@ -60,15 +63,14 @@
 
 ### 提示词优化
 
-在开始运行模型之前，请参考[这里](inference/convert_demo.py) 查看我们是怎么使用GLM-4(或者同级别的其他产品，例如GPT-4)
+在开始运行模型之前，请参考 [这里](inference/convert_demo.py) 查看我们是怎么使用GLM-4(或者同级别的其他产品，例如GPT-4)
 大模型对模型进行优化的，这很重要，
 由于模型是在长提示词下训练的，一个好的提示词直接影响了视频生成的质量。
 
 ### SAT
 
-查看sat文件夹下的[sat_demo](sat/README.md)：包含了 SAT 权重的推理代码和微调代码，推荐基于此代码进行 CogVideoX
+查看sat文件夹下的 [sat_demo](sat/README.md)：包含了 SAT 权重的推理代码和微调代码，推荐基于此代码进行 CogVideoX
 模型结构的改进，研究者使用该代码可以更好的进行快速的迭代和开发。
-(18 GB 推理, 40GB lora微调)
 
 ### Diffusers
 
@@ -76,49 +78,145 @@
 pip install -r requirements.txt
 ```
 
-查看[diffusers_demo](inference/cli_demo.py)：包含对推理代码更详细的解释，包括各种关键的参数。（24GB 推理，微调代码正在开发）
+查看[diffusers_demo](inference/cli_demo.py)：包含对推理代码更详细的解释，包括各种关键的参数。
 
-## CogVideoX-2B 视频作品
+## 视频作品
 
-<div align="center">
-  <video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="80%" controls autoplay></video>
-  <p>A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.</p>
-</div>
+### CogVideoX-5B
+<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
+  <tr>
+      <td>
+          <video src="https://github.com/user-attachments/assets/cf5953ea-96d3-48fd-9907-c4708752c714" width="100%" controls autoplay loop></video>
+      </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/fe0a78e6-b669-4800-8cf0-b5f9b5145b52" width="100%" controls autoplay loop></video>
+      </td>
+       <td>
+          <video src="https://github.com/user-attachments/assets/c182f606-8f8c-421d-b414-8487070fcfcb" width="100%" controls autoplay loop></video>
+     </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/7db2bbce-194d-434d-a605-350254b6c298" width="100%" controls autoplay loop></video>
+     </td>
+  </tr>
+  <tr>
+      <td>
+          <video src="https://github.com/user-attachments/assets/62b01046-8cab-44cc-bd45-4d965bb615ec" width="100%" controls autoplay loop></video>
+      </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/d78e552a-4b3f-4b81-ac3f-3898079554f6" width="100%" controls autoplay loop></video>
+      </td>
+       <td>
+          <video src="https://github.com/user-attachments/assets/30894f12-c741-44a2-9e6e-ddcacc231e5b" width="100%" controls autoplay loop></video>
+     </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/926575ca-7150-435b-a0ff-4900a963297b" width="100%" controls autoplay loop></video>
+     </td>
+  </tr>
+</table>
 
-<div align="center">
-  <video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="80%" controls autoplay></video>
-  <p>The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from its tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.</p>
-</div>
+### CogVideoX-2B 
+<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
+  <tr>
+      <td>
+          <video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="100%" controls autoplay loop></video>
+      </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="100%" controls autoplay loop></video>
+      </td>
+       <td>
+          <video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="100%" controls autoplay loop></video>
+     </td>
+      <td>
+          <video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="100%" controls autoplay loop></video>
+     </td>
+  </tr>
+</table>
 
-<div align="center">
-  <video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="80%" controls autoplay></video>
-  <p>A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.</p>
-</div>
 
-<div align="center">
-  <video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="80%" controls autoplay></video>
-  <p>In the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.</p>
-</div>
+查看画廊的对应提示词，请点击[这里](resources/galary_prompt.md)
 
 ## 模型介绍
 
 CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源版本视频生成模型。
+下表展示我们提供的视频生成模型相关基础信息:
 
-下表展示目前我们提供的视频生成模型列表，以及相关基础信息:
+<table  style="border-collapse: collapse; width: 100%;">
+  <tr>
+    <th style="text-align: center;">模型名</th>
+    <th style="text-align: center;">CogVideoX-2B</th>
+    <th style="text-align: center;">CogVideoX-5B (本仓库)</th>
+  </tr>
+  <tr>
+    <td style="text-align: center;">模型介绍</td>
+    <td style="text-align: center;">入门级模型，兼顾兼容性。运行，二次开发成本低。</td>
+    <td style="text-align: center;">视频生成质量更高，视觉效果更好的更大尺寸模型。</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">推理精度</td>
+    <td style="text-align: center;"><b>FP16*(推荐)</b>, BF16, FP32，FP8*(E4M3，E5M2)，INT8，不支持INT4</td>
+    <td style="text-align: center;"><b>BF16(推荐)</b>, FP16, FP32，FP8*(E4M3，E5M2)，INT8，不支持INT4</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">单GPU显存消耗<br></td>
+    <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
+    <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">多GPU推理显存消耗</td>
+    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
+    <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">推理速度<br>(Step = 50)</td>
+    <td style="text-align: center;">FP16: ~90* s</td>
+    <td style="text-align: center;">BF16: ~180* s</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">微调精度</td>
+    <td style="text-align: center;"><b>FP16</b></td>
+    <td style="text-align: center;"><b>BF16</b></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">微调显存消耗(每卡)</td>
+    <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
+    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">提示词语言</td>
+    <td colspan="2" style="text-align: center;">English*</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">提示词长度上限</td>
+    <td colspan="2" style="text-align: center;">226 Tokens</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">视频长度</td>
+    <td colspan="2" style="text-align: center;">6 秒</td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">帧率</td>
+    <td colspan="2" style="text-align: center;">8 帧 / 秒 </td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">视频分辨率</td>
+    <td colspan="2" style="text-align: center;">720 * 480，不支持其他分辨率(含微调)</td>
+  </tr>
+    <tr>
+    <td style="text-align: center;">位置编码</td>
+    <td style="text-align: center;">3d_sincos_pos_embed</td>
+    <td style="text-align: center;">3d_rope_pos_embed<br></td>
+  </tr>
+</table>
 
-| 模型名                 | CogVideoX-2B                                                                                                                    | 
-|---------------------|---------------------------------------------------------------------------------------------------------------------------------|
-| 提示词语言               | English                                                                                                                         | 
-| 单GPU推理 (FP-16) 显存消耗 | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)   <br>  23.9GB using diffusers                                  | 
-| 多GPU推理 (FP-16) 显存消耗 | 20GB minimum per GPU using diffusers                                                                                            |                                                                                                            
-| 微调显存消耗 (bs=1)       | 42GB                                                                                                                            |
-| 提示词长度上限             | 226 Tokens                                                                                                                      |
-| 视频长度                | 6 seconds                                                                                                                       | 
-| 帧率（每秒）              | 8 frames                                                                                                                        | 
-| 视频分辨率               | 720 * 480                                                                                                                       |
-| 量化推理                | 不支持                                                                                                                             |          
-| 下载地址 (Diffusers 模型) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B)  [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) |
-| 下载地址 (SAT 模型)       | [SAT](./sat/README_zh.md)                                                                                                       |
+**数据解释**
+
++ 使用 diffusers 库进行测试时，启用了 `enable_model_cpu_offload()` 选项 和 `pipe.vae.enable_tiling()` 优化，该方案未测试在非
+  **NVIDIA A100 / H100** 外的设备上的实际显存 / 内存占用。通常，该方案可以适配于所有 **NVIDIA 安培架构**
+  以上的设备。若关闭优化，显存占用会成倍增加，峰值显存约为表格的3倍。
++ 多GPU推理时，需要关闭 `enable_model_cpu_offload()` 优化。
++ 使用 INT8 模型会导致推理速度降低，此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失，推理速度大幅降低。
++ 推理速度测试同样采用了上述显存优化方案，不采用显存优化的情况下，推理速度提升约10%。 只有`diffusers`版本模型支持量化。
++ 模型仅支持英语输入，其他语言可以通过大模型润色时翻译为英语。
 
 ## 友情链接
 
@@ -133,16 +231,19 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
 
 ### inference
 
-+ [diffusers_demo](inference/cli_demo.py): 更详细的推理代码讲解，常见参数的意义，在这里都会提及。
-+ [diffusers_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码，目前需要71GB显存，将来会优化。
++ [cli_demo](inference/cli_demo.py): 更详细的推理代码讲解，常见参数的意义，在这里都会提及。
++ [cli_demo_quantization](inference/cli_demo_quantization.py):
+  量化模型推理代码，可以在显存较低的设备上运行，也可以基于此代码修改，以支持运行FP8等精度的CogVideoX模型。请注意，FP8 仅测试通过，且必须将 `torch-nightly`,`torchao`源代码安装，不建议在生产环境中使用。
++ [diffusers_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码。
++ [space demo](inference/gradio_composite_demo): Huggingface Space同款的 GUI 代码，植入了插帧，超分工具。
 + [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合
   CogVideoX的长输入。因为CogVideoX是在长文本上训练的，所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4，也可以替换为GPT、Gemini等任意大语言模型。
-+ [gradio_web_demo](inference/gradio_web_demo.py): 一个简单的gradio网页应用，展示如何使用 CogVideoX-2B 模型生成视频。
-  与我们的 Huggingface Space 类似，你可以使用此脚本运行一个简单的网页应用，用于生成视频。
++ [gradio_web_demo](inference/gradio_web_demo.py): 一个简单的gradio网页应用，展示如何使用 CogVideoX-2B 模型生成视频。 与我们的
+  Huggingface Space 类似，你可以使用此脚本运行一个简单的网页应用，用于生成视频。
 
 ```shell
 cd inference
-# For Linux and Windows users (and macOS with Intel??)
+# For Linux and Windows users
 python gradio_web_demo.py # humans mode
 
 # For macOS with Apple Silicon users, Intel not supported, this maybe 20x slower than RTX 4090
@@ -216,9 +317,14 @@ CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.amine
     - [x] CogVideoX 在线体验示例 (Huggingface Space)
     - [x] CogVideoX 开源模型API接口示例 (Huggingface)
     - [x] CogVideoX 模型微调示例 (SAT)
-    - [ ] CogVideoX 模型微调示例 (Huggingface / SAT)
-    - [ ] CogVideoX-Pro 开源(适配 CogVideoX-2B 套件)
+    - [ ] CogVideoX 模型微调示例 (Huggingface Diffusers)
+    - [X] CogVideoX-5B 开源 (适配 CogVideoX-2B 套件)
     - [X] CogVideoX 技术报告公开
+    - [X] CogVideoX 技术讲解视频
+- [ ] CogVideoX 周边工具
+    - [X] 视频超分 / 插帧基础套件
+    - [ ] 推理框架适配
+    - [ ] ComfyUI 完整生态工具
 
 我们欢迎您的贡献，您可以点击[这里](resources/contribute_zh.md)查看更多信息。
 
@@ -226,4 +332,8 @@ CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.amine
 
 本仓库代码使用 [Apache 2.0 协议](LICENSE) 发布。
 
-本模型权重和模型实现代码根据 [CogVideoX LICENSE](MODEL_LICENSE) 许可证发布。
+CogVideoX-2B 模型 (包括其对应的Transformers模块，VAE模块) 根据 [Apache 2.0 协议](LICENSE) 许可证发布。
+
+CogVideoX-5B 模型 (Transformers 模块)
+根据 [CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE)
+许可证发布。
\ No newline at end of file
diff --git a/inference/cli_demo.py b/inference/cli_demo.py
index f65c60d..73c6186 100644
--- a/inference/cli_demo.py
+++ b/inference/cli_demo.py
@@ -12,18 +12,18 @@ Run the script:
 
 import argparse
 import torch
-from diffusers import CogVideoXPipeline, CogVideoXDDIMScheduler
+from diffusers import CogVideoXPipeline, CogVideoXDDIMScheduler, CogVideoXDPMScheduler
 from diffusers.utils import export_to_video
 
 
 def generate_video(
-        prompt: str,
-        model_path: str,
-        output_path: str = "./output.mp4",
-        num_inference_steps: int = 50,
-        guidance_scale: float = 6.0,
-        num_videos_per_prompt: int = 1,
-        dtype: torch.dtype = torch.bfloat16,
+    prompt: str,
+    model_path: str,
+    output_path: str = "./output.mp4",
+    num_inference_steps: int = 50,
+    guidance_scale: float = 6.0,
+    num_videos_per_prompt: int = 1,
+    dtype: torch.dtype = torch.bfloat16,
 ):
     """
     Generates a video based on the given prompt and saves it to the specified path.
@@ -47,10 +47,12 @@ def generate_video(
 
     # 2. Set Scheduler.
     # Can be changed to `CogVideoXDPMScheduler` or `CogVideoXDDIMScheduler`.
-    # We recommend using `CogVideoXDDIMScheduler` for better results.
-    pipe.scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
+    # We recommend using `CogVideoXDDIMScheduler` for CogVideoX-2B and `CogVideoXDPMScheduler` for CogVideoX-5B.
+    # pipe.scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
+    pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
 
     # 3. Enable CPU offload for the model, enable tiling.
+    # turn off if you have multiple GPUs or enough GPU memory(such as H100) and it will cost less time in inference
     pipe.enable_model_cpu_offload()
     pipe.vae.enable_tiling()
 
@@ -63,7 +65,8 @@ def generate_video(
         num_videos_per_prompt=num_videos_per_prompt,  # Number of videos to generate per prompt
         num_inference_steps=num_inference_steps,  # Number of inference steps
         num_frames=49,  # Number of frames to generate，changed to 49 for diffusers version `0.31.0` and after.
-        guidance_scale=guidance_scale,  # Guidance scale for classifier-free guidance
+        use_dynamic_cfg=True,  ## This id used for DPM Sechduler, for DDIM scheduler, it should be False
+        guidance_scale=guidance_scale,  # Guidance scale for classifier-free guidance, can set to 7 for DPM scheduler
         generator=torch.Generator().manual_seed(42),  # Set the seed for reproducibility
     ).frames[0]
 
diff --git a/inference/cli_demo_quantization.py b/inference/cli_demo_quantization.py
index 7fb6b1a..d49d340 100644
--- a/inference/cli_demo_quantization.py
+++ b/inference/cli_demo_quantization.py
@@ -1,36 +1,56 @@
 """
-This script demonstrates how to generate a video from a text prompt using CogVideoX with 🤗Huggingface Diffusers Pipeline.
+This script demonstrates how to generate a video from a text prompt using CogVideoX with quantization.
 
 Note:
-    This script requires the `diffusers>=0.30.1` and `torchao>=0.4.0` library to be installed.
 
-Run the script:
-    $ python cli_demo.py --prompt "A girl ridding a bike." --model_path THUDM/CogVideoX-2b
+Must install the `torchao`，`torch`,`diffusers`,`accelerate` library FROM SOURCE to use the quantization feature.
+Only NVIDIA GPUs like H100 or higher are supported om FP-8 quantization.
+
+ALL quantization schemes must using with NVIDIA GPUs.
+
+# Run the script:
+
+python cli_demo_quantization.py --prompt "A girl riding a bike." --model_path THUDM/CogVideoX-2b --quantization_scheme fp8 --dtype float16
+python cli_demo_quantization.py --prompt "A girl riding a bike." --model_path THUDM/CogVideoX-5b --quantization_scheme fp8 --dtype bfloat16
 
-In this script, we have only provided the script for testing and inference in INT8 for the entire process
-(including T5 Encoder, CogVideoX Transformer, VAE).
-You can use other functionalities provided by torchao to convert to other precisions.
-Please note that INT4 is not supported.
 """
-import argparse
 
+import argparse
+import os
 import torch
-from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
+import torch._dynamo
+from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline, CogVideoXDPMScheduler
 from diffusers.utils import export_to_video
 from transformers import T5EncoderModel
-
-# Make sure to install torchao>=0.4.0
 from torchao.quantization import quantize_, int8_weight_only
+from torchao.float8.inference import ActivationCasting, QuantConfig, quantize_to_float8
+
+os.environ["TORCH_LOGS"] = "+dynamo,output_code,graph_breaks,recompiles"
+torch._dynamo.config.suppress_errors = True
+torch.set_float32_matmul_precision("high")
+torch._inductor.config.conv_1x1_as_mm = True
+torch._inductor.config.coordinate_descent_tuning = True
+torch._inductor.config.epilogue_fusion = False
+torch._inductor.config.coordinate_descent_check_all_directions = True
+
+
+def quantize_model(part, quantization_scheme):
+    if quantization_scheme == "int8":
+        quantize_(part, int8_weight_only())
+    elif quantization_scheme == "fp8":
+        quantize_to_float8(part, QuantConfig(ActivationCasting.DYNAMIC))
+    return part
 
 
 def generate_video(
-        prompt: str,
-        model_path: str,
-        output_path: str = "./output.mp4",
-        num_inference_steps: int = 50,
-        guidance_scale: float = 6.0,
-        num_videos_per_prompt: int = 1,
-        dtype: torch.dtype = torch.bfloat16,
+    prompt: str,
+    model_path: str,
+    output_path: str = "./output.mp4",
+    num_inference_steps: int = 50,
+    guidance_scale: float = 6.0,
+    num_videos_per_prompt: int = 1,
+    quantization_scheme: str = "fp8",
+    dtype: torch.dtype = torch.bfloat16,
 ):
     """
     Generates a video based on the given prompt and saves it to the specified path.
@@ -42,24 +62,28 @@ def generate_video(
     - num_inference_steps (int): Number of steps for the inference process. More steps can result in better quality.
     - guidance_scale (float): The scale for classifier-free guidance. Higher values can lead to better alignment with the prompt.
     - num_videos_per_prompt (int): Number of videos to generate per prompt.
+    - quantization_scheme (str): The quantization scheme to use ('int8', 'fp8').
     - dtype (torch.dtype): The data type for computation (default is torch.bfloat16).
-
     """
 
     text_encoder = T5EncoderModel.from_pretrained(model_path, subfolder="text_encoder", torch_dtype=dtype)
-    quantize_(text_encoder, int8_weight_only())
-    transformer = CogVideoXTransformer3DModel.from_pretrained(model_path, subfolder="transformer",
-                                                              torch_dtype=dtype)
-    quantize_(transformer, int8_weight_only())
+    text_encoder = quantize_model(part=text_encoder, quantization_scheme=quantization_scheme)
+    transformer = CogVideoXTransformer3DModel.from_pretrained(model_path, subfolder="transformer", torch_dtype=dtype)
+    transformer = quantize_model(part=transformer, quantization_scheme=quantization_scheme)
     vae = AutoencoderKLCogVideoX.from_pretrained(model_path, subfolder="vae", torch_dtype=dtype)
-    quantize_(vae, int8_weight_only())
+    vae = quantize_model(part=vae, quantization_scheme=quantization_scheme)
     pipe = CogVideoXPipeline.from_pretrained(
         model_path,
         text_encoder=text_encoder,
         transformer=transformer,
         vae=vae,
         torch_dtype=dtype,
-    )
+    ).to("cuda")
+    pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
+
+    # Using with compile will run faster. First time infer will cost ~30min to compile.
+    # pipe.transformer.to(memory_format=torch.channels_last)
+    # for FP8 should remove  pipe.enable_model_cpu_offload()
     pipe.enable_model_cpu_offload()
     pipe.vae.enable_tiling()
     video = pipe(
@@ -67,8 +91,9 @@ def generate_video(
         num_videos_per_prompt=num_videos_per_prompt,
         num_inference_steps=num_inference_steps,
         num_frames=49,
+        use_dynamic_cfg=True,  ## This id used for DPM Sechduler, for DDIM scheduler, it should be False
         guidance_scale=guidance_scale,
-        generator=torch.Generator().manual_seed(42),
+        generator=torch.Generator(device="cuda").manual_seed(42),
     ).frames[0]
 
     export_to_video(video, output_path, fps=8)
@@ -89,7 +114,14 @@ if __name__ == "__main__":
     parser.add_argument("--guidance_scale", type=float, default=6.0, help="The scale for classifier-free guidance")
     parser.add_argument("--num_videos_per_prompt", type=int, default=1, help="Number of videos to generate per prompt")
     parser.add_argument(
-        "--dtype", type=str, default="bfloat16", help="The data type for computation (e.g., 'float16' or 'bfloat16')"
+        "--dtype", type=str, default="bfloat16", help="The data type for computation (e.g., 'float16', 'bfloat16')"
+    )
+    parser.add_argument(
+        "--quantization_scheme",
+        type=str,
+        default="bf16",
+        choices=["int8", "fp8"],
+        help="The quantization scheme to use (int8, fp8)",
     )
 
     args = parser.parse_args()
@@ -101,5 +133,6 @@ if __name__ == "__main__":
         num_inference_steps=args.num_inference_steps,
         guidance_scale=args.guidance_scale,
         num_videos_per_prompt=args.num_videos_per_prompt,
+        quantization_scheme=args.quantization_scheme,
         dtype=dtype,
     )
diff --git a/inference/gradio_composite_demo/app.py b/inference/gradio_composite_demo/app.py
index 4a4f59a..b3114aa 100644
--- a/inference/gradio_composite_demo/app.py
+++ b/inference/gradio_composite_demo/app.py
@@ -1,3 +1,11 @@
+"""
+THis is the main file for the gradio web demo. It uses the CogVideoX-5B model to generate videos gradio web demo.
+set environment variable OPENAI_API_KEY to use the OpenAI API to enhance the prompt.
+
+Usage:
+    OpenAI_API_KEY=your_openai_api_key OpenAI_BASE_URL=https://api.openai.com/v1 python inference/gradio_web_demo.py
+"""
+
 import math
 import os
 import random
@@ -6,7 +14,7 @@ import time
 
 import gradio as gr
 import torch
-from diffusers import CogVideoXPipeline, CogVideoXDDIMScheduler,CogVideoXDPMScheduler
+from diffusers import CogVideoXPipeline, CogVideoXDDIMScheduler, CogVideoXDPMScheduler
 from datetime import datetime, timedelta
 
 from diffusers.image_processor import VaeImageProcessor
@@ -98,14 +106,14 @@ def convert_prompt(prompt: str, retry_times: int = 3) -> str:
 
 
 def infer(
-        prompt: str,
-        num_inference_steps: int,
-        guidance_scale: float,
-        seed: int = -1,
-        #progress=gr.Progress(track_tqdm=True),
+    prompt: str,
+    num_inference_steps: int,
+    guidance_scale: float,
+    seed: int = -1,
+    progress=gr.Progress(track_tqdm=True),
 ):
     if seed == -1:
-        seed = random.randint(0, 2 ** 8 - 1)
+        seed = random.randint(0, 2**8 - 1)
     video_pt = pipe(
         prompt=prompt,
         num_videos_per_prompt=1,
@@ -172,10 +180,6 @@ with gr.Blocks() as demo:
                 )
                 enhance_button = gr.Button("✨ Enhance Prompt(Optional)")
 
-            gr.Markdown(
-                "<span style='color:red; font-weight:bold;'>For the CogVideoX-5B model, 50 steps will take approximately 120 seconds.</span>"
-            )
-
             with gr.Group():
                 with gr.Column():
                     with gr.Row():
@@ -262,20 +266,13 @@ with gr.Blocks() as demo:
     </table>
         """)
 
-
-    def generate(prompt,
-                 seed_value,
-                 scale_status,
-                 rife_status, 
-                 progress=gr.Progress(track_tqdm=True)
-                ):
-
+    def generate(prompt, seed_value, scale_status, rife_status, progress=gr.Progress(track_tqdm=True)):
         latents, seed = infer(
             prompt,
             num_inference_steps=50,  # NOT Changed
             guidance_scale=7.0,  # NOT Changed
             seed=seed_value,
-            #progress=progress,
+            # progress=progress,
         )
         if scale_status:
             latents = utils.upscale_batch_and_concatenate(upscale_model, latents, device)
@@ -300,11 +297,9 @@ with gr.Blocks() as demo:
 
         return video_path, video_update, gif_update, seed_update
 
-
     def enhance_prompt_func(prompt):
         return convert_prompt(prompt, retry_times=1)
 
-
     generate_button.click(
         generate,
         inputs=[prompt, seed_param, enable_scale, enable_rife],
diff --git a/inference/gradio_web_demo.py b/inference/gradio_web_demo.py
index 25075da..8204a8f 100644
--- a/inference/gradio_web_demo.py
+++ b/inference/gradio_web_demo.py
@@ -1,3 +1,11 @@
+"""
+THis is the main file for the gradio web demo. It uses the CogVideoX-2B model to generate videos gradio web demo.
+set environment variable OPENAI_API_KEY to use the OpenAI API to enhance the prompt.
+
+Usage:
+    OpenAI_API_KEY=your_openai_api_key OpenAI_BASE_URL=https://api.openai.com/v1 python inference/gradio_web_demo.py
+"""
+
 import os
 import threading
 import time
@@ -151,14 +159,17 @@ with gr.Blocks() as demo:
 
             with gr.Row():
                 gr.Markdown(
-                    "✨Upon pressing the enhanced prompt button, we will use [GLM-4 Model](https://github.com/THUDM/GLM-4) to polish the prompt and overwrite the original one.")
+                    "✨Upon pressing the enhanced prompt button, we will use [GLM-4 Model](https://github.com/THUDM/GLM-4) to polish the prompt and overwrite the original one."
+                )
                 enhance_button = gr.Button("✨ Enhance Prompt(Optional)")
 
             with gr.Column():
-                gr.Markdown("**Optional Parameters** (default values are recommended)<br>"
-                            "Increasing the number of inference steps will produce more detailed videos, but it will slow down the process.<br>"
-                            "50 steps are recommended for most cases.<br>"
-                            "For the 5B model, 50 steps will take approximately 350 seconds.")
+                gr.Markdown(
+                    "**Optional Parameters** (default values are recommended)<br>"
+                    "Increasing the number of inference steps will produce more detailed videos, but it will slow down the process.<br>"
+                    "50 steps are recommended for most cases.<br>"
+                    "For the 5B model, 50 steps will take approximately 350 seconds."
+                )
                 with gr.Row():
                     num_inference_steps = gr.Number(label="Inference Steps", value=50)
                     guidance_scale = gr.Number(label="Guidance Scale", value=6.0)
@@ -206,7 +217,6 @@ with gr.Blocks() as demo:
     </table>
     """)
 
-
     def generate(prompt, num_inference_steps, guidance_scale, model_choice, progress=gr.Progress(track_tqdm=True)):
         tensor = infer(prompt, num_inference_steps, guidance_scale, progress=progress)
         video_path = save_video(tensor)
@@ -216,22 +226,16 @@ with gr.Blocks() as demo:
 
         return video_path, video_update, gif_update
 
-
     def enhance_prompt_func(prompt):
         return convert_prompt(prompt, retry_times=1)
 
-
     generate_button.click(
         generate,
         inputs=[prompt, num_inference_steps, guidance_scale],
-        outputs=[video_output, download_video_button, download_gif_button]
+        outputs=[video_output, download_video_button, download_gif_button],
     )
 
-    enhance_button.click(
-        enhance_prompt_func,
-        inputs=[prompt],
-        outputs=[prompt]
-    )
+    enhance_button.click(enhance_prompt_func, inputs=[prompt], outputs=[prompt])
 
 if __name__ == "__main__":
     demo.launch()
diff --git a/sat/README.md b/sat/README.md
index e70b488..df49e30 100644
--- a/sat/README.md
+++ b/sat/README.md
@@ -19,8 +19,9 @@ pip install -r requirements.txt
 
 ### 2. Download the model weights
 
-First, go to the SAT mirror to download the dependencies.
+### 2. Download model weights
 
+First, go to the SAT mirror to download the model weights. For the CogVideoX-2B model, please download as follows:
 ```shell
 mkdir CogVideoX-2b-sat
 cd CogVideoX-2b-sat
@@ -31,13 +32,21 @@ wget https://cloud.tsinghua.edu.cn/f/556a3e1329e74f1bac45/?dl=1
 mv 'index.html?dl=1' transformer.zip
 unzip transformer.zip
 ```
-
-Then unzip, the model structure should look like this:
+For the CogVideoX-5B model, please download as follows (VAE files are the same):
+```shell
+mkdir CogVideoX-5b-sat
+cd CogVideoX-5b-sat
+wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
+mv 'index.html?dl=1' vae.zip
+unzip vae.zip
+```
+Then, you need to go to [Tsinghua Cloud Disk](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list) to download our model and unzip it.
+After sorting, the complete model structure of the two models should be as follows:
 
 ```
 .
 ├── transformer
-│   ├── 1000
+│   ├── 1000 (or 1)
 │   │   └── mp_rank_00_model_states.pt
 │   └── latest
 └── vae
@@ -71,8 +80,6 @@ loading it into Deepspeed in Finetune.
 0 directories, 8 files
 ```
 
-Here is the English translation of the provided text:
-
 ### 3. Modify the file in `configs/cogvideox_2b.yaml`.
 
 ```yaml
diff --git a/sat/README_ja.md b/sat/README_ja.md
index 3867cfa..0e2ce34 100644
--- a/sat/README_ja.md
+++ b/sat/README_ja.md
@@ -19,7 +19,7 @@ pip install -r requirements.txt
 
 ### 2. モデルウェイトをダウンロードします
 
-まず、SAT ミラーにアクセスして依存関係をダウンロードします。
+まず、SAT ミラーに移動してモデルの重みをダウンロードします。 CogVideoX-2B モデルの場合は、次のようにダウンロードしてください。
 
 ```shell
 mkdir CogVideoX-2b-sat
@@ -32,12 +32,23 @@ mv 'index.html?dl=1' transformer.zip
 unzip transformer.zip
 ```
 
-次に解凍し、モデル構造は次のようになります：
+CogVideoX-5B モデルの場合は、次のようにダウンロードしてください (VAE ファイルは同じです)。
+
+```shell
+mkdir CogVideoX-5b-sat
+cd CogVideoX-5b-sat
+wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
+mv 'index.html?dl=1' vae.zip
+unzip vae.zip
+```
+
+次に、[Tsinghua Cloud Disk](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list) に移動してモデルをダウンロードし、解凍する必要があります。
+整理すると、2 つのモデルの完全なモデル構造は次のようになります。 モデル構造は次のようになります：
 
 ```
 .
 ├── transformer
-│   ├── 1000
+│   ├── 1000 (or 1)
 │   │   └── mp_rank_00_model_states.pt
 │   └── latest
 └── vae
diff --git a/sat/README_zh.md b/sat/README_zh.md
index 6fc1c16..a5df855 100644
--- a/sat/README_zh.md
+++ b/sat/README_zh.md
@@ -18,8 +18,7 @@ pip install -r requirements.txt
 
 ### 2. 下载模型权重
 
-首先，前往 SAT 镜像下载依赖。
-
+首先，前往 SAT 镜像下载模型权重。对于 CogVideoX-2B 模型，请按照如下方式下载:
 ```shell
 mkdir CogVideoX-2b-sat
 cd CogVideoX-2b-sat
@@ -30,13 +29,21 @@ wget https://cloud.tsinghua.edu.cn/f/556a3e1329e74f1bac45/?dl=1
 mv 'index.html?dl=1' transformer.zip
 unzip transformer.zip
 ```
-
-然后，解压文件，模型结构应该如下
+对于 CogVideoX-5B 模型，请按照如下方式下载(VAE文件相同):
+```shell
+mkdir CogVideoX-5b-sat
+cd CogVideoX-5b-sat
+wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
+mv 'index.html?dl=1' vae.zip
+unzip vae.zip
+```
+然后，您需要前往[清华云盘](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list)下载我们的模型，并进行解压。
+整理之后， 两个模型的完整模型结构应该如下:
 
 ```
 .
 ├── transformer
-│   ├── 1000
+│   ├── 1000 (or 1)
 │   │   └── mp_rank_00_model_states.pt
 │   └── latest
 └── vae

From 46703ef7a8b8f9597244028940b967d30b2d6ba9 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Tue, 27 Aug 2024 16:32:05 +0800
Subject: [PATCH 2/4] user guide

---
 README.md    | 8 +++++---
 README_ja.md | 4 ++--
 README_zh.md | 4 ++--
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/README.md b/README.md
index 0fb23ee..8833c61 100644
--- a/README.md
+++ b/README.md
@@ -11,7 +11,7 @@
 Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B" target="_blank"> 🤗 Huggingface Space</a> or <a href="https://modelscope.cn/studios/ZhipuAI/CogVideoX-5b-demo" target="_blank"> 🤖 ModelScope Space</a>
 </p>
 <p align="center">
-📚 Check here to view <a href="https://arxiv.org/abs/2408.06072" target="_blank">Paper</a>
+📚 View the <a href="https://arxiv.org/abs/2408.06072" target="_blank">paper</a> and <a href="https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh" target="_blank">user guide</a>
 </p>
 <p align="center">
     👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/B94UfuhN" target="_blank">Discord</a> 
@@ -100,6 +100,7 @@ significance of common parameters.
 ## Gallery
 
 ### CogVideoX-5B
+
 <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
   <tr>
       <td>
@@ -131,7 +132,8 @@ significance of common parameters.
   </tr>
 </table>
 
-### CogVideoX-2B 
+### CogVideoX-2B
+
 <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
   <tr>
       <td>
@@ -274,7 +276,7 @@ of the **CogVideoX** open-source model.
   interpolation and super-resolution tools integrated.
 + [convert_demo](inference/convert_demo.py): How to convert user input into long-form input suitable for CogVideoX.
   Since CogVideoX is trained on long texts, we need to transform the input text distribution to match the training data
-  using an LLM. The script defaults to using GLM4, but it can be replaced with GPT, Gemini, or any other large language
+  using an LLM. The script defaults to using GLM-4, but it can be replaced with GPT, Gemini, or any other large language
   model.
 + [gradio_web_demo](inference/gradio_web_demo.py): A simple Gradio web application demonstrating how to use the
   CogVideoX-2B model to generate videos. Similar to our Huggingface Space, you can use this script to run a simple web
diff --git a/README_ja.md b/README_ja.md
index 1565de0..aea1e2a 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -11,7 +11,7 @@
 <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B" target="_blank"> 🤗 Huggingface Space</a> または <a href="https://modelscope.cn/studios/ZhipuAI/CogVideoX-5b-demo" target="_blank"> 🤖 ModelScope Space</a> で CogVideoX-5B モデルをオンラインで体験してください
 </p>
 <p align="center">
-📚 <a href="https://arxiv.org/abs/2408.06072" target="_blank">論文</a> をチェック
+📚 <a href="https://arxiv.org/abs/2408.06072" target="_blank">論文</a>と<a href="https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh" target="_blank">使用ドキュメント</a>を表示します。
 </p>
 <p align="center">
     👋 <a href="resources/WECHAT.md" target="_blank">WeChat</a> と <a href="https://discord.gg/B94UfuhN" target="_blank">Discord</a> に参加
@@ -253,7 +253,7 @@ CogVideoXは [清影](https://chatglm.cn/video?fr=osm_cogvideox) に由来する
 + [diffusers_vae_demo](inference/cli_vae_demo.py): VAE推論コードの実行には現在71GBのメモリが必要ですが、将来的には最適化される予定です。
 + [space demo](inference/gradio_composite_demo): Huggingface Spaceと同じGUIコードで、フレーム補間や超解像ツールが組み込まれています。
 + [convert_demo](inference/convert_demo.py):
-  ユーザー入力をCogVideoXに適した形式に変換する方法。CogVideoXは長いキャプションでトレーニングされているため、入力テキストをLLMを使用してトレーニング分布と一致させる必要があります。デフォルトではGLM4を使用しますが、GPT、Geminiなどの他のLLMに置き換えることもできます。
+  ユーザー入力をCogVideoXに適した形式に変換する方法。CogVideoXは長いキャプションでトレーニングされているため、入力テキストをLLMを使用してトレーニング分布と一致させる必要があります。デフォルトではGLM-4を使用しますが、GPT、Geminiなどの他のLLMに置き換えることもできます。
 + [gradio_web_demo](inference/gradio_web_demo.py): CogVideoX-2B モデルを使用して動画を生成する方法を示す、シンプルな
   Gradio Web UI デモです。私たちの Huggingface Space と同様に、このスクリプトを使用して Web デモを起動することができます。
 
diff --git a/README_zh.md b/README_zh.md
index 336883d..6814f9f 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -12,7 +12,7 @@
 在 <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B" target="_blank"> 🤗 Huggingface Space</a> 或 <a href="https://modelscope.cn/studios/ZhipuAI/CogVideoX-5b-demo" target="_blank"> 🤖 ModelScope Space</a> 在线体验 CogVideoX-5B 模型
 </p>
 <p align="center">
-📚 查看 <a href="https://arxiv.org/abs/2408.06072" target="_blank">论文</a>
+📚 查看 <a href="https://arxiv.org/abs/2408.06072" target="_blank">论文</a> 和 <a href="https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh" target="_blank">使用文档</a>
 </p>
 <p align="center">
     👋 加入我们的 <a href="resources/WECHAT.md" target="_blank">微信</a> 和  <a href="https://discord.gg/B94UfuhN" target="_blank">Discord</a> 
@@ -237,7 +237,7 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
 + [diffusers_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码。
 + [space demo](inference/gradio_composite_demo): Huggingface Space同款的 GUI 代码，植入了插帧，超分工具。
 + [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合
-  CogVideoX的长输入。因为CogVideoX是在长文本上训练的，所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4，也可以替换为GPT、Gemini等任意大语言模型。
+  CogVideoX的长输入。因为CogVideoX是在长文本上训练的，所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM-4，也可以替换为GPT、Gemini等任意大语言模型。
 + [gradio_web_demo](inference/gradio_web_demo.py): 一个简单的gradio网页应用，展示如何使用 CogVideoX-2B 模型生成视频。 与我们的
   Huggingface Space 类似，你可以使用此脚本运行一个简单的网页应用，用于生成视频。
 

From f04768975941c25ea4570e658aaa7c0d73a7aedd Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Tue, 27 Aug 2024 19:12:17 +0800
Subject: [PATCH 3/4] update final data and link

---
 README.md    | 64 ++++++++++++++++++++++++-------------------------
 README_ja.md | 68 +++++++++++++++++++++++++---------------------------
 README_zh.md | 11 +++++++++
 3 files changed, 74 insertions(+), 69 deletions(-)

diff --git a/README.md b/README.md
index 8833c61..5189e2b 100644
--- a/README.md
+++ b/README.md
@@ -155,11 +155,13 @@ To view the corresponding prompt words for the gallery, please click [here](reso
 
 ## Model Introduction
 
-<table  style="border-collapse: collapse; width: 100%;">
+CogVideoX is an open-source version of the video generation model originating from [QingYing](https://chatglm.cn/video?fr=osm_cogvideo). The table below displays the list of video generation models we currently offer, along with their foundational information.
+
+<table style="border-collapse: collapse; width: 100%;">
   <tr>
     <th style="text-align: center;">Model Name</th>
     <th style="text-align: center;">CogVideoX-2B</th>
-    <th style="text-align: center;">CogVideoX-5B</th>
+    <th style="text-align: center;">CogVideoX-5B (This Repository)</th>
   </tr>
   <tr>
     <td style="text-align: center;">Model Description</td>
@@ -168,33 +170,33 @@ To view the corresponding prompt words for the gallery, please click [here](reso
   </tr>
   <tr>
     <td style="text-align: center;">Inference Precision</td>
-    <td style="text-align: center;"><b>FP16*(Recommended)</b>, BF16, FP32, FP8*(E4M3, E5M2), INT8, INT4 not supported</td>
-    <td style="text-align: center;"><b>BF16(Recommended)</b>, FP16, FP32, FP8*(E4M3, E5M2), INT8, INT4 not supported</td>
+    <td style="text-align: center;"><b>FP16* (Recommended)</b>, BF16, FP32, FP8*, INT8, no support for INT4</td>
+    <td style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, no support for INT4</td>
   </tr>
   <tr>
-    <td style="text-align: center;">Single GPU Memory Consumption<br></td>
+    <td style="text-align: center;">Single GPU VRAM Consumption</td>
     <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
     <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
   </tr>
   <tr>
-    <td style="text-align: center;">Multi-GPU Inference Memory Consumption</td>
-    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
-    <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
+    <td style="text-align: center;">Multi-GPU Inference VRAM Consumption</td>
+    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b></td>
+    <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b></td>
   </tr>
   <tr>
-    <td style="text-align: center;">Inference Speed<br>(Step = 50)</td>
-    <td style="text-align: center;">FP16: ~90* s</td>
-    <td style="text-align: center;">BF16: ~180* s</td>
+    <td style="text-align: center;">Inference Speed<br>(Step = 50, FP/BF16)</td>
+    <td style="text-align: center;">Single A100: ~90 seconds<br>Single H100: ~45 seconds</td>
+    <td style="text-align: center;">Single A100: ~180 seconds<br>Single H100: ~90 seconds</td>
   </tr>
   <tr>
-    <td style="text-align: center;">Fine-Tuning Precision</td>
+    <td style="text-align: center;">Fine-tuning Precision</td>
     <td style="text-align: center;"><b>FP16</b></td>
     <td style="text-align: center;"><b>BF16</b></td>
   </tr>
   <tr>
-    <td style="text-align: center;">Fine-Tuning Memory Consumption (per GPU)</td>
+    <td style="text-align: center;">Fine-tuning VRAM Consumption (per GPU)</td>
     <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
-    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
+    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)</td>
   </tr>
   <tr>
     <td style="text-align: center;">Prompt Language</td>
@@ -206,46 +208,42 @@ To view the corresponding prompt words for the gallery, please click [here](reso
   </tr>
   <tr>
     <td style="text-align: center;">Video Length</td>
-    <td colspan="2" style="text-align: center;">6 seconds</td>
+    <td colspan="2" style="text-align: center;">6 Seconds</td>
   </tr>
   <tr>
     <td style="text-align: center;">Frame Rate</td>
-    <td colspan="2" style="text-align: center;">8 frames per second</td>
+    <td colspan="2" style="text-align: center;">8 Frames per Second</td>
   </tr>
   <tr>
     <td style="text-align: center;">Video Resolution</td>
-    <td colspan="2" style="text-align: center;">720 * 480, other resolutions not supported (including fine-tuning)</td>
+    <td colspan="2" style="text-align: center;">720 x 480, no support for other resolutions (including fine-tuning)</td>
   </tr>
   <tr>
     <td style="text-align: center;">Positional Encoding</td>
     <td style="text-align: center;">3d_sincos_pos_embed</td>
-    <td style="text-align: center;">3d_rope_pos_embed<br></td>
+    <td style="text-align: center;">3d_rope_pos_embed</td>
   </tr>
   <tr>
-    <td style="text-align: center;">Download Links (Diffusers Model)</td>
+    <td style="text-align: center;">Download Page (Diffusers)</td>
     <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
     <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
   </tr>
   <tr>
-    <td style="text-align: center;">Download Links (SAT Model)</td>
-    <td colspan="2" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
+    <td style="text-align: center;">Download Page (SAT)</td>
+    <td colspan="2" style="text-align: center;"><a href="./sat/README.md">SAT</a></td>
   </tr>
 </table>
 
 **Data Explanation**
 
-+ When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()`
-  optimization were enabled. This setup has not been tested for actual memory/VRAM usage on devices other than **NVIDIA
-  A100 / H100**. Generally, this approach should be compatible with all devices using the **NVIDIA Ampere architecture**
-  and above. If these optimizations are disabled, memory usage will increase significantly, with peak VRAM usage
-  approximately three times higher than the values shown in the table.
-+ When performing multi-GPU inference, the `enable_model_cpu_offload()` optimization must be disabled.
-+ Using the INT8 model will result in slower inference speeds. This is done to ensure that inference can be performed on
-  GPUs with lower memory without significant video quality loss, albeit with a notable reduction in speed.
-+ Inference speed tests were also conducted with the above memory optimizations. Without memory optimization, inference
-  speed increases by approximately 10%. Only the `diffusers` version of the model supports quantization.
-+ The model only supports English input; other languages can be translated into English when refined through large
-  language models.
+- When testing with the diffusers library, the `enable_model_cpu_offload()` option and `pipe.vae.enable_tiling()` optimization were enabled. This solution has not been tested for actual VRAM/memory usage on devices other than **NVIDIA A100/H100**. Generally, this solution can be adapted to all devices with **NVIDIA Ampere architecture** and above. If optimization is disabled, VRAM usage will increase significantly, with peak VRAM approximately 3 times the value in the table.
+- When performing multi-GPU inference, the `enable_model_cpu_offload()` optimization needs to be disabled.
+- Using an INT8 model will result in reduced inference speed. This is done to accommodate GPUs with lower VRAM, allowing inference to run properly with minimal video quality loss, though the inference speed will be significantly reduced.
+- The 2B model is trained using `FP16` precision, while the 5B model is trained using `BF16` precision. It is recommended to use the precision used in model training for inference.
+- `FP8` precision must be used on `NVIDIA H100` and above devices, requiring source installation of the `torch`, `torchao`, `diffusers`, and `accelerate` Python packages. `CUDA 12.4` is recommended.
+- Inference speed testing also used the aforementioned VRAM optimization scheme. Without VRAM optimization, inference speed increases by about 10%. Only models using `diffusers` support quantization.
+- The model only supports English input; other languages can be translated to English during large model refinements.
+
 
 ## Friendly Links
 
diff --git a/README_ja.md b/README_ja.md
index aea1e2a..9072e43 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -139,34 +139,34 @@ pip install -r requirements.txt
 
 ## モデル紹介
 
-CogVideoXは [清影](https://chatglm.cn/video?fr=osm_cogvideox) に由来するオープンソース版のビデオ生成モデルです。
-以下の表は、提供しているビデオ生成モデルに関する基本情報を示しています。
+CogVideoXは[清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源のオープンソース版動画生成モデルです。
+以下の表は、提供されている動画生成モデルに関する基本情報を示しています。
 
 <table style="border-collapse: collapse; width: 100%;">
   <tr>
     <th style="text-align: center;">モデル名</th>
     <th style="text-align: center;">CogVideoX-2B</th>
-    <th style="text-align: center;">CogVideoX-5B</th>
+    <th style="text-align: center;">CogVideoX-5B (本リポジトリ)</th>
   </tr>
   <tr>
     <td style="text-align: center;">モデル紹介</td>
-    <td style="text-align: center;">入門レベルのモデルで、互換性を重視しています。運用や二次開発のコストが低いです。</td>
-    <td style="text-align: center;">より高いビデオ生成品質と優れた視覚効果を提供する大型モデル。</td>
+    <td style="text-align: center;">入門モデルで、互換性を重視。運用および二次開発のコストが低い。</td>
+    <td style="text-align: center;">動画生成品質が高く、視覚効果がより優れた大型モデル。</td>
   </tr>
   <tr>
     <td style="text-align: center;">推論精度</td>
-    <td style="text-align: center;"><b>FP16*(推奨)</b>, BF16, FP32, FP8(E4M3, E5M2), INT8, INT4はサポートされていません</td>
-    <td style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32, FP8(E4M3, E5M2), INT8, INT4はサポートされていません</td>
+    <td style="text-align: center;"><b>FP16*(推奨)</b>, BF16, FP32, FP8*(E4M3, E5M2), INT8, INT4は非対応</td>
+    <td style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32, FP8*(E4M3, E5M2), INT8, INT4は非対応</td>
   </tr>
   <tr>
-    <td style="text-align: center;">単一GPUメモリ消費量<br></td>
+    <td style="text-align: center;">単一GPUのメモリ消費量</td>
     <td style="text-align: center;">FP16: 18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>12.5GB* using diffusers</b><br><b>INT8: 7.8GB* using diffusers</b></td>
     <td style="text-align: center;">BF16: 26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> / <b>20.7GB* using diffusers</b><br><b>INT8: 11.4GB* using diffusers</b></td>
   </tr>
   <tr>
-    <td style="text-align: center;">マルチGPU推論メモリ消費量</td>
-    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
-    <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
+    <td style="text-align: center;">複数GPUの推論メモリ消費量</td>
+    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b></td>
+    <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b></td>
   </tr>
   <tr>
     <td style="text-align: center;">推論速度<br>(Step = 50)</td>
@@ -179,59 +179,55 @@ CogVideoXは [清影](https://chatglm.cn/video?fr=osm_cogvideox) に由来する
     <td style="text-align: center;"><b>BF16</b></td>
   </tr>
   <tr>
-    <td style="text-align: center;">微調整メモリ消費量(各GPU)</td>
+    <td style="text-align: center;">微調整時のメモリ消費量 (1GPUあたり)</td>
     <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
-    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
+    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)</td>
   </tr>
   <tr>
     <td style="text-align: center;">プロンプト言語</td>
     <td colspan="2" style="text-align: center;">英語*</td>
   </tr>
   <tr>
-    <td style="text-align: center;">プロンプト長さ制限</td>
-    <td colspan="2" style="text-align: center;">226 トークン</td>
+    <td style="text-align: center;">プロンプトの長さ上限</td>
+    <td colspan="2" style="text-align: center;">226トークン</td>
   </tr>
   <tr>
-    <td style="text-align: center;">ビデオ長さ</td>
-    <td colspan="2" style="text-align: center;">6 秒</td>
+    <td style="text-align: center;">動画の長さ</td>
+    <td colspan="2" style="text-align: center;">6秒</td>
   </tr>
   <tr>
     <td style="text-align: center;">フレームレート</td>
-    <td colspan="2" style="text-align: center;">8 フレーム/秒</td>
+    <td colspan="2" style="text-align: center;">8フレーム/秒</td>
   </tr>
   <tr>
-    <td style="text-align: center;">ビデオ解像度</td>
-    <td colspan="2" style="text-align: center;">720 * 480、他の解像度はサポートされていません（微調整を含む）</td>
+    <td style="text-align: center;">動画の解像度</td>
+    <td colspan="2" style="text-align: center;">720 * 480、他の解像度はサポートされていません（微調整も含む）</td>
   </tr>
-    <tr>
-    <td style="text-align: center;">位置エンコーディング</td>
+  <tr>
+    <td style="text-align: center;">位置エンコード</td>
     <td style="text-align: center;">3d_sincos_pos_embed</td>
-    <td style="text-align: center;">3d_rope_pos_embed<br></td>
+    <td style="text-align: center;">3d_rope_pos_embed</td>
   </tr>
   <tr>
-    <td style="text-align: center;">ダウンロードリンク (Diffusers モデル)</td>
+    <td style="text-align: center;">ダウンロードリンク (Diffusers)</td>
     <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
     <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
   </tr>
   <tr>
-    <td style="text-align: center;">ダウンロードリンク (SAT モデル)</td>
+    <td style="text-align: center;">ダウンロードリンク (SAT)</td>
     <td colspan="2" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
   </tr>
 </table>
 
 **データ解説**
 
-+ diffusers ライブラリを使用してテストを行った際に、`enable_model_cpu_offload()` オプションと `pipe.vae.enable_tiling()`
-  最適化が有効になっていました。このセットアップは **NVIDIA A100 / H100** 以外のデバイスでの実際のメモリ/VRAM
-  使用量についてはテストされていません。通常、このアプローチは **NVIDIA Ampere アーキテクチャ**
-  以上のすべてのデバイスに適しています。これらの最適化を無効にすると、メモリ使用量が大幅に増加し、表に示されている値の約3倍になります。
-+ マルチGPU推論を行う際には、`enable_model_cpu_offload()` 最適化を無効にする必要があります。
-+ INT8 モデルを使用すると推論速度が低下しますが、これは、メモリの少ないGPUでも正常に推論できるようにし、ビデオ品質の損失を最小限に抑えるためです。推論速度は大幅に低下します。
-
-推論速度テストも上記のメモリ最適化を使用して実施されました。メモリ最適化を使用しない場合、推論速度は約10％向上します。量子化をサポートしているのは `diffusers`
-バージョンのモデルのみです。
-
-+ モデルは英語入力のみをサポートしており、他の言語は大規模な言語モデルを通じて英語に翻訳することで対応できます。
++ diffusersライブラリを使用したテストでは、`enable_model_cpu_offload()`オプションと`pipe.vae.enable_tiling()`最適化が有効になっています。この手法は、**NVIDIA A100 / H100**以外のデバイスでの実際のメモリ/メモリ消費量についてはテストされていません。通常、この手法はすべての**NVIDIA Ampereアーキテクチャ**以上のデバイスに適合します。最適化を無効にすると、メモリ消費量が倍増し、ピークメモリは表の3倍程度になります。
++ 複数GPUで推論する際は、`enable_model_cpu_offload()`最適化を無効にする必要があります。
++ INT8モデルを使用すると推論速度が低下します。これは、メモリが少ないGPUで正常に推論を行い、動画品質の損失を最小限に抑えるためです。そのため、推論速度が大幅に低下します。
++ 2Bモデルは`FP16`精度でトレーニングされ、5Bモデルは`BF16`精度でトレーニングされています。推奨される精度で推論を行うことをお勧めします。
++ `FP8`精度は`NVIDIA H100`以上のデバイスでのみ使用でき、`torch`、`torchao`、`diffusers`、`accelerate`のPythonパッケージをソースコードからインストールする必要があります。`CUDA 12.4`の使用を推奨します。
++ 推論速度のテストも上記のメモリ最適化手法を使用して行いました。メモリ最適化を行わない場合、推論速度が約10％向上します。量子化をサポートするのは`diffusers`バージョンのモデルのみです。
++ モデルは英語入力のみをサポートしており、他の言語は大モデルでのポストプロセスで英語に翻訳する必要があります。
 
 ## 友好的リンク
 
diff --git a/README_zh.md b/README_zh.md
index 6814f9f..f2b9626 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -206,6 +206,15 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
     <td style="text-align: center;">3d_sincos_pos_embed</td>
     <td style="text-align: center;">3d_rope_pos_embed<br></td>
   </tr>
+  <tr>
+    <td style="text-align: center;">下载链接 (Diffusers)</td>
+    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
+    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">下载链接 (SAT)</td>
+    <td colspan="2" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
+  </tr>
 </table>
 
 **数据解释**
@@ -215,6 +224,8 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
   以上的设备。若关闭优化，显存占用会成倍增加，峰值显存约为表格的3倍。
 + 多GPU推理时，需要关闭 `enable_model_cpu_offload()` 优化。
 + 使用 INT8 模型会导致推理速度降低，此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失，推理速度大幅降低。
++ 2B 模型采用 `FP16` 精度训练， 5B模型采用 `BF16` 精度训练。我们推荐使用模型训练的精度进行推理。
++ `FP8` 精度必须在`NVIDIA H100` 及以上的设备上使用，需要源代码安装`torch`,`torchao`,`diffusers`,`accelerate` python包，推荐使用 `CUDA 12.4`。
 + 推理速度测试同样采用了上述显存优化方案，不采用显存优化的情况下，推理速度提升约10%。 只有`diffusers`版本模型支持量化。
 + 模型仅支持英语输入，其他语言可以通过大模型润色时翻译为英语。
 

From f57945811d237064fd354527736636898ef8f8e9 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Tue, 27 Aug 2024 19:13:33 +0800
Subject: [PATCH 4/4] update readme

---
 README.md    | 2 +-
 README_ja.md | 2 +-
 README_zh.md | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 5189e2b..fd6a91c 100644
--- a/README.md
+++ b/README.md
@@ -161,7 +161,7 @@ CogVideoX is an open-source version of the video generation model originating fr
   <tr>
     <th style="text-align: center;">Model Name</th>
     <th style="text-align: center;">CogVideoX-2B</th>
-    <th style="text-align: center;">CogVideoX-5B (This Repository)</th>
+    <th style="text-align: center;">CogVideoX-5B</th>
   </tr>
   <tr>
     <td style="text-align: center;">Model Description</td>
diff --git a/README_ja.md b/README_ja.md
index 9072e43..c9ef84f 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -146,7 +146,7 @@ CogVideoXは[清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源のオー
   <tr>
     <th style="text-align: center;">モデル名</th>
     <th style="text-align: center;">CogVideoX-2B</th>
-    <th style="text-align: center;">CogVideoX-5B (本リポジトリ)</th>
+    <th style="text-align: center;">CogVideoX-5B</th>
   </tr>
   <tr>
     <td style="text-align: center;">モデル紹介</td>
diff --git a/README_zh.md b/README_zh.md
index f2b9626..addeb2d 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -144,7 +144,7 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
   <tr>
     <th style="text-align: center;">模型名</th>
     <th style="text-align: center;">CogVideoX-2B</th>
-    <th style="text-align: center;">CogVideoX-5B (本仓库)</th>
+    <th style="text-align: center;">CogVideoX-5B</th>
   </tr>
   <tr>
     <td style="text-align: center;">模型介绍</td>

Model Name	CogVideoX-2B	CogVideoX-5B
Model Description	Entry-level model, balancing compatibility. Low cost for running and secondary development.	Larger model with higher video generation quality and better visual effects.
Inference Precision	*FP16(Recommended)*, BF16, FP32, FP8(E4M3, E5M2), INT8, INT4 not supported	BF16(Recommended), FP16, FP32, FP8*(E4M3, E5M2), INT8, INT4 not supported
Single GPU Memory Consumption	FP16: 18GB using SAT / *12.5GB using diffusers INT8: 7.8GB* using diffusers**	BF16: 26GB using SAT / *20.7GB using diffusers INT8: 11.4GB* using diffusers**
Multi-GPU Inference Memory Consumption	*FP16: 10GB using diffusers**	*BF16: 15GB using diffusers**
Inference Speed (Step = 50)	FP16: ~90* s	BF16: ~180* s
Fine-Tuning Precision	FP16	BF16
Fine-Tuning Memory Consumption (per GPU)	47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT)	63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)
Prompt Language	English*
Prompt Length Limit	226 Tokens
Video Length	6 seconds
Frame Rate	8 frames per second
Video Resolution	720 * 480, other resolutions not supported (including fine-tuning)
Positional Encoding	3d_sincos_pos_embed	3d_rope_pos_embed
Download Links (Diffusers Model)	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel
Download Links (SAT Model)	SAT
モデル名	CogVideoX-2B	CogVideoX-5B
モデル紹介	入門レベルのモデルで、互換性を重視しています。運用や二次開発のコストが低いです。	より高いビデオ生成品質と優れた視覚効果を提供する大型モデル。
推論精度	*FP16(推奨)**, BF16, FP32, FP8(E4M3, E5M2), INT8, INT4はサポートされていません	BF16(推奨), FP16, FP32, FP8(E4M3, E5M2), INT8, INT4はサポートされていません
単一GPUメモリ消費量	FP16: 18GB using SAT / *12.5GB using diffusers INT8: 7.8GB* using diffusers**	BF16: 26GB using SAT / *20.7GB using diffusers INT8: 11.4GB* using diffusers**
マルチGPU推論メモリ消費量	*FP16: 10GB using diffusers**	*BF16: 15GB using diffusers**
推論速度 (Step = 50)	FP16: ~90* s	BF16: ~180* s
微調整精度	FP16	BF16
微調整メモリ消費量(各GPU)	47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT)	63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)
プロンプト言語	英語*
プロンプト長さ制限	226 トークン
ビデオ長さ	6 秒
フレームレート	8 フレーム/秒
ビデオ解像度	720 * 480、他の解像度はサポートされていません（微調整を含む）
位置エンコーディング	3d_sincos_pos_embed	3d_rope_pos_embed
ダウンロードリンク (Diffusers モデル)	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel
ダウンロードリンク (SAT モデル)	SAT
模型名	CogVideoX-2B	CogVideoX-5B (本仓库)
模型介绍	入门级模型，兼顾兼容性。运行，二次开发成本低。	视频生成质量更高，视觉效果更好的更大尺寸模型。
推理精度	*FP16(推荐)*, BF16, FP32，FP8(E4M3，E5M2)，INT8，不支持INT4	BF16(推荐), FP16, FP32，FP8*(E4M3，E5M2)，INT8，不支持INT4
单GPU显存消耗	FP16: 18GB using SAT / *12.5GB using diffusers INT8: 7.8GB* using diffusers**	BF16: 26GB using SAT / *20.7GB using diffusers INT8: 11.4GB* using diffusers**
多GPU推理显存消耗	*FP16: 10GB using diffusers**	*BF16: 15GB using diffusers**
推理速度 (Step = 50)	FP16: ~90* s	BF16: ~180* s
微调精度	FP16	BF16
微调显存消耗(每卡)	47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT)	63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)
提示词语言	English*
提示词长度上限	226 Tokens
视频长度	6 秒
帧率	8 帧 / 秒
视频分辨率	720 * 480，不支持其他分辨率(含微调)
位置编码	3d_sincos_pos_embed	3d_rope_pos_embed