Merge pull request #107 from THUDM/CogVideoX_dev

VAE demo rewrite
2025-04-06 03:57:56 +08:00 · 2024-08-09 21:03:10 +08:00 · 2024-08-09 21:03:10 +08:00 · a68f5ab16c
commit a68f5ab16c
parent 69930d8463 18c1669a8e
9 changed files with 218 additions and 137 deletions
--- a/README.md
+++ b/README.md
@ -21,9 +21,11 @@

 ## Update and News

- 🔥 **News**: ```2024/8/7```: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be performed
+- 🔥 **News**: ```2024/8/7```: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be
+  performed
  on a single 3090 GPU. For more details, please refer to the [code](inference/cli_demo.py).
- 🔥 **News**: ```2024/8/6```: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can reconstruct
+- 🔥 **News**: ```2024/8/6```: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can
+  reconstruct
  the video almost losslessly.
 - 🔥 **News**: ```2024/8/6```: We have open-sourced **CogVideoX-2B**，the first model in the CogVideoX series of video
  generation models.
@ -55,9 +57,9 @@ Jump to a specific section:

 ### Prompt Optimization

-Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use the GLM-4 model to
-optimize the prompt. This is crucial because the model is trained with long prompts, and a good prompt directly affects
-the quality of the generated video.
+Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use large models like
+GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained
+with long prompts, and a good prompt directly impacts the quality of the video generation.

 ### SAT

@ -124,6 +126,15 @@ along with related basic information:
 | Download Link (HF diffusers Model)        | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B)   [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b)   [💫 WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) |
 | Download Link (SAT Model)                 | [SAT](./sat/README.md)                                                                                                                                                                              |

+## Friendly Links
+
+We highly welcome contributions from the community and actively contribute to the open-source community. The following
+works have already been adapted for CogVideoX, and we invite everyone to use them:
+
+ [Xorbits Inference](https://github.com/xorbitsai/inference): A powerful and comprehensive distributed inference
+  framework, allowing you to easily deploy your own models or the latest cutting-edge open-source models with just one
+  click.
+
 ## Project Structure

 This open-source repository will guide developers to quickly get started with the basic usage and fine-tuning examples
--- a/README_ja.md
+++ b/README_ja.md
@ -21,10 +21,13 @@

 ## 更新とニュース

- 🔥 **ニュース**: ```2024/8/7```: CogVideoX は `diffusers` バージョン 0.30.0 に統合されました。単一の 3090 GPU で推論を実行できます。詳細については [コード](inference/cli_demo.py) を参照してください。
+- 🔥 **ニュース**: ```2024/8/7```: CogVideoX は `diffusers` バージョン 0.30.0 に統合されました。単一の 3090 GPU
+  で推論を実行できます。詳細については [コード](inference/cli_demo.py) を参照してください。
 - 🔥 **ニュース**: ```2024/8/6```: **CogVideoX-2B** で使用される **3D Causal VAE** もオープンソース化しました。これにより、ビデオをほぼ無損失で再構築できます。
 - 🔥 **ニュース**: ```2024/8/6```: **CogVideoX-2B**、CogVideoXシリーズのビデオ生成モデルの最初のモデルをオープンソース化しました。
- 🌱 **ソース**: ```2022/5/19```: **CogVideo** (現在 `CogVideo` ブランチで確認できます) をオープンソース化しました。これは、最初のオープンソースの事前学習済みテキストからビデオ生成モデルであり、技術的な詳細については [ICLR'23 CogVideo 論文](https://arxiv.org/abs/2205.15868) をご覧ください。
+- 🌱 **ソース**: ```2022/5/19```: **CogVideo** (現在 `CogVideo` ブランチで確認できます)
+  をオープンソース化しました。これは、最初のオープンソースの事前学習済みテキストからビデオ生成モデルであり、技術的な詳細については [ICLR'23 CogVideo 論文](https://arxiv.org/abs/2205.15868)
+  をご覧ください。

 **より強力なモデルが、より大きなパラメータサイズで登場予定です。お楽しみに！**

@ -50,11 +53,13 @@

 ### プロンプトの最適化

-モデルを実行する前に、[このガイド](inference/convert_demo.py) を参照して、GLM-4 モデルを使用してプロンプトを最適化する方法を確認してください。これは重要です。モデルは長いプロンプトでトレーニングされているため、良いプロンプトは生成されるビデオの品質に直接影響します。
+モデルを実行する前に、[こちら](inference/convert_demo.py)
+を参考にして、GLM-4（または同等の製品、例えばGPT-4）の大規模モデルを使用してどのようにモデルを最適化するかをご確認ください。これは非常に重要です。モデルは長いプロンプトでトレーニングされているため、良いプロンプトがビデオ生成の品質に直接影響を与えます。

 ### SAT

-[sat_demo](sat/README.md) の指示に従ってください: SATウェイトの推論コードと微調整コードが含まれています。CogVideoXモデル構造に基づいて改善することをお勧めします。革新的な研究者は、このコードを使用して迅速なスタッキングと開発を行うことができます。
+[sat_demo](sat/README.md) の指示に従ってください:
+SATウェイトの推論コードと微調整コードが含まれています。CogVideoXモデル構造に基づいて改善することをお勧めします。革新的な研究者は、このコードを使用して迅速なスタッキングと開発を行うことができます。
 (推論には18GB、lora微調整には40GBが必要です)

 ### Diffusers
@ -94,19 +99,26 @@ CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源の

 以下の表は、現在提供しているビデオ生成モデルのリストと関連する基本情報を示しています:

-| モデル名                                | CogVideoX-2B                                                                                                                                                                                        | 
-|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| プロンプト言語                           | 英語                                                                                                                                                                                             | 
-| 単一GPU推論 (FP16)              | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)   <br>  23.9GB using diffusers                                                                                                      | 
+| モデル名                         | CogVideoX-2B                                                                                                                                                                                        | 
+|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| プロンプト言語                      | 英語                                                                                                                                                                                                  | 
+| 単一GPU推論 (FP16)               | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)   <br>  23.9GB using diffusers                                                                                                      | 
 | 複数GPU推論 (FP16)               | 20GB minimum per GPU using diffusers                                                                                                                                                                |
-| 微調整に必要なGPUメモリ(bs=1) | 40GB                                                                                                                                                                                                |
-| プロンプトの最大長                        | 226 トークン                                                                                                                                                                                          |
-| ビデオの長さ                              | 6秒                                                                                                                                                                                           | 
-| フレームレート                         | 8フレーム                                                                                                                                                                                            | 
-| 解像度                                | 720 * 480                                                                                                                                                                                           |
-| 量子化推論                       | サポートされていません                                                                                                                                                                                       |          
-| ダウンロードリンク (HF diffusers モデル)        | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B)   [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b)   [💫 WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) |
-| ダウンロードリンク (SAT モデル)                 | [SAT](./sat/README.md)                                                                                                                                                                              |
+| 微調整に必要なGPUメモリ(bs=1)          | 40GB                                                                                                                                                                                                |
+| プロンプトの最大長                    | 226 トークン                                                                                                                                                                                            |
+| ビデオの長さ                       | 6秒                                                                                                                                                                                                  | 
+| フレームレート                      | 8フレーム                                                                                                                                                                                               | 
+| 解像度                          | 720 * 480                                                                                                                                                                                           |
+| 量子化推論                        | サポートされていません                                                                                                                                                                                         |          
+| ダウンロードリンク (HF diffusers モデル) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B)   [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b)   [💫 WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) |
+| ダウンロードリンク (SAT モデル)          | [SAT](./sat/README.md)                                                                                                                                                                              |
+
+## 友好的リンク
+
+コミュニティからの貢献を大歓迎し、私たちもオープンソースコミュニティに積極的に貢献しています。以下の作品はすでにCogVideoXに対応しており、ぜひご利用ください：
+
+ [Xorbits Inference](https://github.com/xorbitsai/inference):
+  強力で包括的な分散推論フレームワークであり、ワンクリックで独自のモデルや最新のオープンソースモデルを簡単にデプロイできます。

 ## プロジェクト構造

@ -116,14 +128,17 @@ CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源の

 + [diffusers_demo](inference/cli_demo.py): 推論コードの詳細な説明が含まれており、一般的なパラメータの意味についても言及しています。
 + [diffusers_vae_demo](inference/cli_vae_demo.py): VAE推論コードの実行には現在71GBのメモリが必要ですが、将来的には最適化される予定です。
-+ [convert_demo](inference/convert_demo.py): ユーザー入力をCogVideoXに適した形式に変換する方法。CogVideoXは長いキャプションでトレーニングされているため、入力テキストをLLMを使用してトレーニング分布と一致させる必要があります。デフォルトではGLM4を使用しますが、GPT、Geminiなどの他のLLMに置き換えることもできます。
-+ [gradio_web_demo](inference/gradio_web_demo.py): CogVideoX-2Bモデルを使用してビデオを生成する方法を示すシンプルなgradio Web UI。
+ [convert_demo](inference/convert_demo.py):
+  ユーザー入力をCogVideoXに適した形式に変換する方法。CogVideoXは長いキャプションでトレーニングされているため、入力テキストをLLMを使用してトレーニング分布と一致させる必要があります。デフォルトではGLM4を使用しますが、GPT、Geminiなどの他のLLMに置き換えることもできます。
+ [gradio_web_demo](inference/gradio_web_demo.py): CogVideoX-2Bモデルを使用してビデオを生成する方法を示すシンプルなgradio
+  Web UI。

 <div style="text-align: center;">
    <img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
 </div>

-+ [streamlit_web_demo](inference/streamlit_web_demo.py): CogVideoX-2Bモデルを使用してビデオを生成する方法を示すシンプルなstreamlit Webアプリケーション。
+ [streamlit_web_demo](inference/streamlit_web_demo.py): CogVideoX-2Bモデルを使用してビデオを生成する方法を示すシンプルなstreamlit
+  Webアプリケーション。

 <div style="text-align: center;">
    <img src="resources/web_demo.png" style="width: 100%; height: auto;" />
@ -131,13 +146,14 @@ CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源の

 ### sat

-+ [sat_demo](sat/README.md): SATウェイトの推論コードと微調整コードが含まれています。CogVideoXモデル構造に基づいて改善することをお勧めします。革新的な研究者は、このコードを使用して迅速なスタッキングと開発を行うことができます。
+ [sat_demo](sat/README.md):
+  SATウェイトの推論コードと微調整コードが含まれています。CogVideoXモデル構造に基づいて改善することをお勧めします。革新的な研究者は、このコードを使用して迅速なスタッキングと開発を行うことができます。

 ### ツール

 このフォルダには、モデル変換/キャプション生成などのツールが含まれています。

-+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): SATモデルのウェイトをHuggingfaceモデルのウェイトに変換します。 
+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): SATモデルのウェイトをHuggingfaceモデルのウェイトに変換します。
 + [caption_demo](tools/caption): キャプションツール、ビデオを理解し、テキストで出力するモデル。

 ## プロジェクト計画
@ -161,7 +177,9 @@ CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源の
 モデルのウェイトと実装コードは [CogVideoX LICENSE](MODEL_LICENSE) の下で公開されています。

 ## CogVideo(ICLR'23)
-論文の公式リポジトリ: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) は [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo) にあります。
+
+論文の公式リポジトリ: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868)
+は [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo) にあります。

 **CogVideoは比較的高フレームレートのビデオを生成することができます。**
 32フレームの4秒間のクリップが以下に示されています。
@ -174,8 +192,8 @@ CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源の
 </div>


-CogVideoのデモは [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/) で体験できます。*元の入力は中国語です。*
-
+CogVideoのデモは [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/) で体験できます。
+*元の入力は中国語です。*

 ## 引用

--- a/README_zh.md
+++ b/README_zh.md
@ -22,7 +22,8 @@

 ## 项目更新

- 🔥 **News**: ```2024/8/7```: CogVideoX 已经合并入 `diffusers` 0.30.0版本，单张3090可以推理，详情请见[代码](inference/cli_demo.py)。
+- 🔥 **News**: ```2024/8/7```: CogVideoX 已经合并入 `diffusers`
+  0.30.0版本，单张3090可以推理，详情请见[代码](inference/cli_demo.py)。
 - 🔥 **News**: ```2024/8/6```: 我们开源 **3D Causal VAE**，用于 **CogVideoX-2B**，可以几乎无损地重构视频。
 - 🔥 **News**: ```2024/8/6```: 我们开源 CogVideoX 系列视频生成模型的第一个模型, **CogVideoX-2B**。
 - 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型（现在你可以在 `CogVideo` 分支中看到），这是首个开源的基于
@ -51,8 +52,8 @@

 ### 提示词优化

-在开始运行模型之前，请参考[这里](inference/convert_demo.py) 查看我们是怎么使用GLM-4大模型对模型进行优化的，这很重要，
-由于模型是在长提示词下训练的，一额好的直接影响了视频生成的质量。
+在开始运行模型之前，请参考[这里](inference/convert_demo.py) 查看我们是怎么使用GLM-4(或者同级别的其他产品，例如GPT-4)大模型对模型进行优化的，这很重要，
+由于模型是在长提示词下训练的，一个好的提示词直接影响了视频生成的质量。

 ### SAT

@ -96,19 +97,25 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源

 下表展示目前我们提供的视频生成模型列表，以及相关基础信息:

-| 模型名                 | CogVideoX-2B                                                                                                                  | 
-|---------------------|-------------------------------------------------------------------------------------------------------------------------------|
-| 提示词语言               | English                                                                                                                       | 
-| 单GPU推理 (FP-16) 显存消耗 | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)   <br>  23.9GB using diffusers                         | 
-| 多GPU推理 (FP-16) 显存消耗 | 20GB minimum per GPU using diffusers                                                                                          |                                                                                                            
-| 微调显存消耗 (bs=1)       | 42GB                                                                                                                          |
-| 提示词长度上限             | 226 Tokens                                                                                                                    |
-| 视频长度                | 6 seconds                                                                                                                     | 
-| 帧率（每秒）              | 8 frames                                                                                                                      | 
-| 视频分辨率               | 720 * 480                                                                                                                     |
-| 量化推理                | 不支持                                                                                                                           |          
+| 模型名                 | CogVideoX-2B                                                                                                                    | 
+|---------------------|---------------------------------------------------------------------------------------------------------------------------------|
+| 提示词语言               | English                                                                                                                         | 
+| 单GPU推理 (FP-16) 显存消耗 | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)   <br>  23.9GB using diffusers                                  | 
+| 多GPU推理 (FP-16) 显存消耗 | 20GB minimum per GPU using diffusers                                                                                            |                                                                                                            
+| 微调显存消耗 (bs=1)       | 42GB                                                                                                                            |
+| 提示词长度上限             | 226 Tokens                                                                                                                      |
+| 视频长度                | 6 seconds                                                                                                                       | 
+| 帧率（每秒）              | 8 frames                                                                                                                        | 
+| 视频分辨率               | 720 * 480                                                                                                                       |
+| 量化推理                | 不支持                                                                                                                             |          
 | 下载地址 (Diffusers 模型) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B)  [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) |
-| 下载地址 (SAT 模型)       | [SAT](./sat/README_zh.md)                                                                                                     |
+| 下载地址 (SAT 模型)       | [SAT](./sat/README_zh.md)                                                                                                       |
+
+## 友情链接
+
+我们非常欢迎来自社区的贡献，并积极的贡献开源社区。以下作品已经对CogVideoX进行了适配，欢迎大家使用:
+
+ [Xorbits Inference](https://github.com/xorbitsai/inference): 性能强大且功能全面的分布式推理框架，轻松一键部署你自己的模型或内置的前沿开源模型。

 ## 完整项目代码结构

--- a/inference/cli_demo.py
+++ b/inference/cli_demo.py
@ -22,7 +22,7 @@ from diffusers import CogVideoXPipeline


 def export_to_video_imageio(
-        video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
+    video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
 ) -> str:
    """
    Export the video frames to a video file using imageio lib to Avoid "green screen" issue (for example CogVideoX)
@ -38,14 +38,14 @@ def export_to_video_imageio(


 def generate_video(
-        prompt: str,
-        model_path: str,
-        output_path: str = "./output.mp4",
-        num_inference_steps: int = 50,
-        guidance_scale: float = 6.0,
-        num_videos_per_prompt: int = 1,
-        device: str = "cuda",
-        dtype: torch.dtype = torch.float16,
+    prompt: str,
+    model_path: str,
+    output_path: str = "./output.mp4",
+    num_inference_steps: int = 50,
+    guidance_scale: float = 6.0,
+    num_videos_per_prompt: int = 1,
+    device: str = "cuda",
+    dtype: torch.dtype = torch.float16,
 ):
    """
    Generates a video based on the given prompt and saves it to the specified path.
--- a/inference/cli_vae_demo.py
+++ b/inference/cli_vae_demo.py
@ -1,14 +1,24 @@
 """
-This script demonstrates how to encode video frames using a pre-trained CogVideoX model with 🤗 Huggingface Diffusers.
+This script is designed to demonstrate how to use the CogVideoX-2b VAE model for video encoding and decoding.
+It allows you to encode a video into a latent representation, decode it back into a video, or perform both operations sequentially.
+Before running the script, make sure to clone the CogVideoX Hugging Face model repository and set the `{your local diffusers path}` argument to the path of the cloned repository.

-Note:
-    This script requires the `diffusers>=0.30.0` library to be installed.
-    If the video appears “completely green” and cannot be viewed, please switch to a different player to watch it. This is a normal phenomenon.
-    Cost 71GB of GPU memory for encoding a 6s video at 720p resolution.
+Command 1: Encoding Video
+Encodes the video located at ../resources/videos/1.mp4 using the CogVideoX-2b VAE model.
+Memory Usage: ~34GB of GPU memory for encoding.
+If you do not have enough GPU memory, we provide a pre-encoded tensor file (encoded.pt) in the resources folder and you can still run the decoding command.
+$ python cli_vae_demo.py --model_path {your local diffusers path}/CogVideoX-2b/vae/ --video_path ../resources/videos/1.mp4 --mode encode

-Run the script:
-    $ python cli_demo.py --model_path THUDM/CogVideoX-2b --video_path path/to/video.mp4 --output_path path/to/output
+Command 2: Decoding Video

+Decodes the latent representation stored in encoded.pt back into a video.
+Memory Usage: ~19GB of GPU memory for decoding.
+$ python cli_vae_demo.py --model_path {your local diffusers path}/CogVideoX-2b/vae/ --encoded_path ./encoded.pt --mode decode
+
+Command 3: Encoding and Decoding Video
+Encodes the video located at ../resources/videos/1.mp4 and then immediately decodes it.
+Memory Usage: 34GB for encoding + 19GB for decoding (sequentially).
+$ python cli_vae_demo.py --model_path {your local diffusers path}/CogVideoX-2b/vae/ --video_path ../resources/videos/1.mp4 --mode both
 """

 import argparse
@ -19,7 +29,7 @@ from diffusers import AutoencoderKLCogVideoX
 from torchvision import transforms


-def vae_demo(model_path, video_path, dtype, device):
+def encode_video(model_path, video_path, dtype, device):
    """
    Loads a pre-trained AutoencoderKLCogVideoX model and encodes the video frames.

@ -32,50 +42,58 @@ def vae_demo(model_path, video_path, dtype, device):
    Returns:
    - torch.Tensor: The encoded video frames.
    """
-    # Load the pre-trained model
    model = AutoencoderKLCogVideoX.from_pretrained(model_path, torch_dtype=dtype).to(device)
-
-    # Load video frames
    video_reader = imageio.get_reader(video_path, "ffmpeg")
-    frames = []
-    for frame in video_reader:
-        frames.append(frame)
+
+    frames = [transforms.ToTensor()(frame) for frame in video_reader]
    video_reader.close()

-    # Transform frames to Tensor
-    transform = transforms.Compose(
-        [
-            transforms.ToTensor(),
-        ]
-    )
-    frames_tensor = torch.stack([transform(frame) for frame in frames]).to(device)
+    frames_tensor = torch.stack(frames).to(device).permute(1, 0, 2, 3).unsqueeze(0).to(dtype)

-    # Add batch dimension and reshape to [1, 3, 49, 480, 720]
-    frames_tensor = frames_tensor.permute(1, 0, 2, 3).unsqueeze(0).to(dtype).to(device)
-
-    # Run the model with Encoder and Decoder
    with torch.no_grad():
-        output = model(frames_tensor)
+        encoded_frames = model.encode(frames_tensor)[0].sample()
+    return encoded_frames

-    return output
+
+def decode_video(model_path, encoded_tensor_path, dtype, device):
+    """
+    Loads a pre-trained AutoencoderKLCogVideoX model and decodes the encoded video frames.
+
+    Parameters:
+    - model_path (str): The path to the pre-trained model.
+    - encoded_tensor_path (str): The path to the encoded tensor file.
+    - dtype (torch.dtype): The data type for computation.
+    - device (str): The device to use for computation (e.g., "cuda" or "cpu").
+
+    Returns:
+    - torch.Tensor: The decoded video frames.
+    """
+    model = AutoencoderKLCogVideoX.from_pretrained(model_path, torch_dtype=dtype).to(device)
+    encoded_frames = torch.load(encoded_tensor_path, weights_only=True).to(device).to(dtype)
+    with torch.no_grad():
+        decoded_frames = []
+        for i in range(6):  # 6 seconds
+            start_frame, end_frame = (0, 3) if i == 0 else (2 * i + 1, 2 * i + 3)
+            current_frames = model.decode(encoded_frames[:, :, start_frame:end_frame]).sample
+            decoded_frames.append(current_frames)
+        model.clear_fake_context_parallel_cache()
+
+        decoded_frames = torch.cat(decoded_frames, dim=2)
+    return decoded_frames


 def save_video(tensor, output_path):
    """
-    Saves the encoded video frames to a video file.
+    Saves the video frames to a video file.

    Parameters:
-    - tensor (torch.Tensor): The encoded video frames.
+    - tensor (torch.Tensor): The video frames tensor.
    - output_path (str): The path to save the output video.
    """
-    # Remove batch dimension and permute back to [49, 480, 720, 3]
    frames = tensor[0].squeeze(0).permute(1, 2, 3, 0).cpu().numpy()
+    frames = np.clip(frames, 0, 1) * 255
+    frames = frames.astype(np.uint8)

-    # Clip values to [0, 1] and convert to uint8
-    frames = np.clip(frames, 0, 1)
-    frames = (frames * 255).astype(np.uint8)
-
-    # Save frames to video
    writer = imageio.get_writer(output_path + "/output.mp4", fps=30)
    for frame in frames:
        writer.append_data(frame)
@ -83,10 +101,14 @@ def save_video(tensor, output_path):


 if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description="Convert a CogVideoX model to Diffusers")
+    parser = argparse.ArgumentParser(description="CogVideoX encode/decode demo")
    parser.add_argument("--model_path", type=str, required=True, help="The path to the CogVideoX model")
-    parser.add_argument("--video_path", type=str, required=True, help="The path to the video file")
-    parser.add_argument("--output_path", type=str, default="./", help="The path to save the output video")
+    parser.add_argument("--video_path", type=str, help="The path to the video file (for encoding)")
+    parser.add_argument("--encoded_path", type=str, help="The path to the encoded tensor file (for decoding)")
+    parser.add_argument("--output_path", type=str, default=".", help="The path to save the output file")
+    parser.add_argument(
+        "--mode", type=str, choices=["encode", "decode", "both"], required=True, help="Mode: encode, decode, or both"
+    )
    parser.add_argument(
        "--dtype", type=str, default="float16", help="The data type for computation (e.g., 'float16' or 'float32')"
    )
@ -95,9 +117,21 @@ if __name__ == "__main__":
    )
    args = parser.parse_args()

-    # Set device and dtype
    device = torch.device(args.device)
    dtype = torch.float16 if args.dtype == "float16" else torch.float32

-    output = vae_demo(args.model_path, args.video_path, dtype, device)
-    save_video(output, args.output_path)
+    if args.mode == "encode":
+        assert args.video_path, "Video path must be provided for encoding."
+        encoded_output = encode_video(args.model_path, args.video_path, dtype, device)
+        torch.save(encoded_output, args.output_path + "/encoded.pt")
+        print(f"Finished encoding the video to a tensor, save it to a file at {encoded_output}/encoded.pt")
+    elif args.mode == "decode":
+        assert args.encoded_path, "Encoded tensor path must be provided for decoding."
+        decoded_output = decode_video(args.model_path, args.encoded_path, dtype, device)
+        save_video(decoded_output, args.output_path)
+        print(f"Finished decoding the video and saved it to a file at {args.output_path}/output.mp4")
+    elif args.mode == "both":
+        assert args.video_path, "Video path must be provided for encoding."
+        encoded_output = encode_video(args.model_path, args.video_path, dtype, device)
+        decoded_output = decode_video(args.model_path, args.output_path + "/encoded.pt", dtype, device)
+        save_video(decoded_output, args.output_path)
--- a/inference/encoded.pt
+++ b/inference/encoded.pt
--- a/inference/gradio_web_demo.py
+++ b/inference/gradio_web_demo.py
@ -34,7 +34,7 @@ Video descriptions must have the same num of words as examples below. Extra word


 def export_to_video_imageio(
-        video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
+    video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
 ) -> str:
    """
    Export the video frames to a video file using imageio lib to Avoid "green screen" issue (for example CogVideoX)
@ -62,20 +62,34 @@ def convert_prompt(prompt: str, retry_times: int = 3) -> str:
        response = client.chat.completions.create(
            messages=[
                {"role": "system", "content": sys_prompt},
-                {"role": "user",
-                 "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : "a girl is on the beach"'},
-                {"role": "assistant",
-                 "content": "A radiant woman stands on a deserted beach, arms outstretched, wearing a beige trench coat, white blouse, light blue jeans, and chic boots, against a backdrop of soft sky and sea. Moments later, she is seen mid-twirl, arms exuberant, with the lighting suggesting dawn or dusk. Then, she runs along the beach, her attire complemented by an off-white scarf and black ankle boots, the tranquil sea behind her. Finally, she holds a paper airplane, her pose reflecting joy and freedom, with the ocean's gentle waves and the sky's soft pastel hues enhancing the serene ambiance."},
-                {"role": "user",
-                 "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : "A man jogging on a football field"'},
-                {"role": "assistant",
-                 "content": "A determined man in athletic attire, including a blue long-sleeve shirt, black shorts, and blue socks, jogs around a snow-covered soccer field, showcasing his solitary exercise in a quiet, overcast setting. His long dreadlocks, focused expression, and the serene winter backdrop highlight his dedication to fitness. As he moves, his attire, consisting of a blue sports sweatshirt, black athletic pants, gloves, and sneakers, grips the snowy ground. He is seen running past a chain-link fence enclosing the playground area, with a basketball hoop and children's slide, suggesting a moment of solitary exercise amidst the empty field."},
-                {"role": "user",
-                 "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : " A woman is dancing, HD footage, close-up"'},
-                {"role": "assistant",
-                 "content": "A young woman with her hair in an updo and wearing a teal hoodie stands against a light backdrop, initially looking over her shoulder with a contemplative expression. She then confidently makes a subtle dance move, suggesting rhythm and movement. Next, she appears poised and focused, looking directly at the camera. Her expression shifts to one of introspection as she gazes downward slightly. Finally, she dances with confidence, her left hand over her heart, symbolizing a poignant moment, all while dressed in the same teal hoodie against a plain, light-colored background."},
-                {"role": "user",
-                 "content": f'Create an imaginative video descriptive caption or modify an earlier caption in ENGLISH for the user input: "{text}"'},
+                {
+                    "role": "user",
+                    "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : "a girl is on the beach"',
+                },
+                {
+                    "role": "assistant",
+                    "content": "A radiant woman stands on a deserted beach, arms outstretched, wearing a beige trench coat, white blouse, light blue jeans, and chic boots, against a backdrop of soft sky and sea. Moments later, she is seen mid-twirl, arms exuberant, with the lighting suggesting dawn or dusk. Then, she runs along the beach, her attire complemented by an off-white scarf and black ankle boots, the tranquil sea behind her. Finally, she holds a paper airplane, her pose reflecting joy and freedom, with the ocean's gentle waves and the sky's soft pastel hues enhancing the serene ambiance.",
+                },
+                {
+                    "role": "user",
+                    "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : "A man jogging on a football field"',
+                },
+                {
+                    "role": "assistant",
+                    "content": "A determined man in athletic attire, including a blue long-sleeve shirt, black shorts, and blue socks, jogs around a snow-covered soccer field, showcasing his solitary exercise in a quiet, overcast setting. His long dreadlocks, focused expression, and the serene winter backdrop highlight his dedication to fitness. As he moves, his attire, consisting of a blue sports sweatshirt, black athletic pants, gloves, and sneakers, grips the snowy ground. He is seen running past a chain-link fence enclosing the playground area, with a basketball hoop and children's slide, suggesting a moment of solitary exercise amidst the empty field.",
+                },
+                {
+                    "role": "user",
+                    "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : " A woman is dancing, HD footage, close-up"',
+                },
+                {
+                    "role": "assistant",
+                    "content": "A young woman with her hair in an updo and wearing a teal hoodie stands against a light backdrop, initially looking over her shoulder with a contemplative expression. She then confidently makes a subtle dance move, suggesting rhythm and movement. Next, she appears poised and focused, looking directly at the camera. Her expression shifts to one of introspection as she gazes downward slightly. Finally, she dances with confidence, her left hand over her heart, symbolizing a poignant moment, all while dressed in the same teal hoodie against a plain, light-colored background.",
+                },
+                {
+                    "role": "user",
+                    "content": f'Create an imaginative video descriptive caption or modify an earlier caption in ENGLISH for the user input: "{text}"',
+                },
            ],
            model="glm-4-0520",
            temperature=0.01,
@ -88,12 +102,7 @@ def convert_prompt(prompt: str, retry_times: int = 3) -> str:
    return prompt


-def infer(
-        prompt: str,
-        num_inference_steps: int,
-        guidance_scale: float,
-        progress=gr.Progress(track_tqdm=True)
-):
+def infer(prompt: str, num_inference_steps: int, guidance_scale: float, progress=gr.Progress(track_tqdm=True)):
    torch.cuda.empty_cache()

    prompt_embeds, _ = pipe.encode_prompt(
@ -113,7 +122,6 @@ def infer(
        negative_prompt_embeds=torch.zeros_like(prompt_embeds),
    ).frames[0]

-
    return video


@ -124,11 +132,12 @@ def save_video(tensor):
    export_to_video_imageio(tensor[1:], video_path)
    return video_path

+
 def convert_to_gif(video_path):
    clip = mp.VideoFileClip(video_path)
    clip = clip.set_fps(8)
    clip = clip.resize(height=240)
-    gif_path = video_path.replace('.mp4', '.gif')
+    gif_path = video_path.replace(".mp4", ".gif")
    clip.write_gif(gif_path, fps=8)
    return gif_path

@ -137,7 +146,7 @@ def delete_old_files():
    while True:
        now = datetime.now()
        cutoff = now - timedelta(minutes=10)
-        output_dir = './output'
+        output_dir = "./output"
        for filename in os.listdir(output_dir):
            file_path = os.path.join(output_dir, filename)
            if os.path.isfile(file_path):
@ -169,13 +178,16 @@ with gr.Blocks() as demo:
            prompt = gr.Textbox(label="Prompt (Less than 200 Words)", placeholder="Enter your prompt here", lines=5)
            with gr.Row():
                gr.Markdown(
-                    "✨Upon pressing the enhanced prompt button, we will use [GLM-4 Model](https://github.com/THUDM/GLM-4) to polish the prompt and overwrite the original one.")
+                    "✨Upon pressing the enhanced prompt button, we will use [GLM-4 Model](https://github.com/THUDM/GLM-4) to polish the prompt and overwrite the original one."
+                )
                enhance_button = gr.Button("✨ Enhance Prompt(Optional)")

            with gr.Column():
-                gr.Markdown("**Optional Parameters** (default values are recommended)<br>"
-                            "Turn Inference Steps larger if you want more detailed video, but it will be slower.<br>"
-                            "50 steps are recommended for most cases. will cause 120 seconds for inference.<br>")
+                gr.Markdown(
+                    "**Optional Parameters** (default values are recommended)<br>"
+                    "Turn Inference Steps larger if you want more detailed video, but it will be slower.<br>"
+                    "50 steps are recommended for most cases. will cause 120 seconds for inference.<br>"
+                )
                with gr.Row():
                    num_inference_steps = gr.Number(label="Inference Steps", value=50)
                    guidance_scale = gr.Number(label="Guidance Scale", value=6.0)
@ -222,7 +234,6 @@ with gr.Blocks() as demo:
        </table>
    """)

-
    def generate(prompt, num_inference_steps, guidance_scale, progress=gr.Progress(track_tqdm=True)):
        tensor = infer(prompt, num_inference_steps, guidance_scale, progress=progress)
        video_path = save_video(tensor)
@ -232,22 +243,16 @@ with gr.Blocks() as demo:

        return video_path, video_update, gif_update

-
    def enhance_prompt_func(prompt):
        return convert_prompt(prompt, retry_times=1)

-
    generate_button.click(
        generate,
        inputs=[prompt, num_inference_steps, guidance_scale],
-        outputs=[video_output, download_video_button, download_gif_button]
+        outputs=[video_output, download_video_button, download_gif_button],
    )

-    enhance_button.click(
-        enhance_prompt_func,
-        inputs=[prompt],
-        outputs=[prompt]
-    )
+    enhance_button.click(enhance_prompt_func, inputs=[prompt], outputs=[prompt])

 if __name__ == "__main__":
    demo.launch(server_name="127.0.0.1", server_port=7870, share=True)
--- a/sat/README_ja.md
+++ b/sat/README_ja.md
@ -1,6 +1,7 @@
 # SAT CogVideoX-2B

-このフォルダには、[SAT](https://github.com/THUDM/SwissArmyTransformer) ウェイトを使用した推論コードと、SAT ウェイトのファインチューニングコードが含まれています。
+このフォルダには、[SAT](https://github.com/THUDM/SwissArmyTransformer) ウェイトを使用した推論コードと、SAT
+ウェイトのファインチューニングコードが含まれています。

 このコードは、チームがモデルをトレーニングするために使用したフレームワークです。コメントが少なく、注意深く研究する必要があります。

@ -86,7 +87,9 @@ first_stage_config:
    ckpt_path: "{your_CogVideoX-2b-sat_path}/vae/3d-vae.pt" ## VAE モデルパス
 ```

-+ 複数のプロンプトを保存するために txt を使用する場合は、`configs/test.txt` を参照して変更してください。1行に1つのプロンプトを記述します。プロンプトの書き方がわからない場合は、最初に [このコード](../inference/convert_demo.py) を使用して LLM によるリファインメントを呼び出すことができます。
+ 複数のプロンプトを保存するために txt を使用する場合は、`configs/test.txt`
+  を参照して変更してください。1行に1つのプロンプトを記述します。プロンプトの書き方がわからない場合は、最初に [このコード](../inference/convert_demo.py)
+  を使用して LLM によるリファインメントを呼び出すことができます。
 + コマンドラインを入力として使用する場合は、次のように変更します。

 ```yaml
@ -113,7 +116,8 @@ bash inference.sh

 ### 環境の準備

-現在、SAT はソースコードからインストールする必要があり、正常にファインチューニングを行うためにはこれが必要です。この問題は将来の安定版で解決される予定です。
+ご注意ください、現在、SATを正常にファインチューニングするためには、ソースコードからインストールする必要があります。
+これは、まだpipパッケージバージョンにリリースされていない最新の機能を使用する必要があるためです。この問題は、今後の安定版で解決する予定です。

 ```
 git clone https://github.com/THUDM/SwissArmyTransformer.git
@ -143,7 +147,9 @@ pip install -e .

 ### 設定ファイルの変更

-`Lora` と 全パラメータファインチューニングの2つの方法をサポートしています。これらのファインチューニング方法は `transformer` 部分にのみ適用されます。`VAE` 部分は変更されません。`T5` はエンコーダーとしてのみ使用されます。
+`Lora` と
+全パラメータファインチューニングの2つの方法をサポートしています。これらのファインチューニング方法は `transformer`
+部分にのみ適用されます。`VAE` 部分は変更されません。`T5` はエンコーダーとしてのみ使用されます。

 `configs/cogvideox_2b_sft.yaml` (全量ファインチューニング用) を次のように変更します。

@ -190,7 +196,8 @@ model:
 1. 推論コードを実行してファインチューニングを開始します。

 ```shell
-bash finetune.sh
+bash finetune_single_gpu.sh # Single GPU
+bash finetune_multi_gpus.sh # Multi GPUs
 ```

 ### Huggingface Diffusers サポートのウェイトに変換
--- a/sat/README_zh.md
+++ b/sat/README_zh.md
@ -112,7 +112,6 @@ bash inference.sh

 ### 准备环境

-
 请注意，目前，SAT需要从源码安装，才能正常微调。
 这是因为你需要使用还没发型到pip包版本的最新代码所支持的功能。
 我们将会在未来的稳定版本解决这个问题。