update cogvideox1.5

2025-04-05 19:41:59 +08:00 · 2024-11-07 23:43:46 +08:00 · 2024-11-07 23:43:46 +08:00 · 806a7f609f
commit 806a7f609f
parent 0ae12e3ea3
9 changed files with 762 additions and 305 deletions
--- a/README.md
+++ b/README.md
@ -22,7 +22,10 @@ Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spac

 ## Project Updates

- 🔥🔥 **News**: ```2024/10/13```: A more cost-effective fine-tuning framework for `CogVideoX-5B` that works with a single
+- 🔥🔥 News: ```2024/11/08```: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX.
+The CogVideoX1.5-5B series supports 10-second videos with higher resolution, and CogVideoX1.5-5B-I2V supports video generation at any resolution. 
+The SAT code has already been updated, while the diffusers version is still under adaptation. Download the SAT version code [here](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT).
+- 🔥 **News**: ```2024/10/13```: A more cost-effective fine-tuning framework for `CogVideoX-5B` that works with a single
  4090 GPU, [cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory), has been released. It supports
  fine-tuning with multiple resolutions. Feel free to use it!
 - 🔥 **News**: ```2024/10/10```: We have updated our technical report. Please
@ -68,7 +71,6 @@ Jump to a specific section:
    - [Tools](#tools)
 - [Introduction to CogVideo(ICLR'23) Model](#cogvideoiclr23)
 - [Citations](#Citation)
- [Open Source Project Plan](#Open-Source-Project-Plan)
 - [Model License](#Model-License)

 ## Quick Start
@ -172,79 +174,85 @@ models we currently offer, along with their foundational information.
    <th style="text-align: center;">CogVideoX-2B</th>
    <th style="text-align: center;">CogVideoX-5B</th>
    <th style="text-align: center;">CogVideoX-5B-I2V</th>
+    <th style="text-align: center;">CogVideoX1.5-5B</th>
+    <th style="text-align: center;">CogVideoX1.5-5B-I2V</th>
  </tr>
  <tr>
-    <td style="text-align: center;">Model Description</td>
-    <td style="text-align: center;">Entry-level model, balancing compatibility. Low cost for running and secondary development.</td>
-    <td style="text-align: center;">Larger model with higher video generation quality and better visual effects.</td>
-    <td style="text-align: center;">CogVideoX-5B image-to-video version.</td>
+    <td style="text-align: center;">Release Date</td>
+    <th style="text-align: center;">August 6, 2024</th>
+    <th style="text-align: center;">August 27, 2024</th>
+    <th style="text-align: center;">September 19, 2024</th>
+    <th style="text-align: center;">November 8, 2024</th>
+    <th style="text-align: center;">November 8, 2024</th>
+  </tr>
+  <tr>
+    <td style="text-align: center;">Video Resolution</td>
+    <td colspan="3" style="text-align: center;">720 * 480</td>
+    <td colspan="1" style="text-align: center;">1360 * 768</td>
+    <td colspan="1" style="text-align: center;">256 <= W <=1360<br>256 <= H <=768<br> W,H % 16 == 0</td>
  </tr>
  <tr>
    <td style="text-align: center;">Inference Precision</td>
    <td style="text-align: center;"><b>FP16*(recommended)</b>, BF16, FP32, FP8*, INT8, not supported: INT4</td>
-    <td colspan="2" style="text-align: center;"><b>BF16 (recommended)</b>, FP16, FP32, FP8*, INT8, not supported: INT4</td>
+    <td colspan="2" style="text-align: center;"><b>BF16(recommended)</b>, FP16, FP32, FP8*, INT8, not supported: INT4</td>
+    <td colspan="2" style="text-align: center;"><b>BF16</b></td>
  </tr>
  <tr>
-    <td style="text-align: center;">Single GPU Memory Usage<br></td>
-    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: from 4GB* </b><br><b>diffusers INT8 (torchao): from 3.6GB*</b></td>
-    <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16: from 5GB* </b><br><b>diffusers INT8 (torchao): from 4.4GB*</b></td>
+    <td style="text-align: center;">Single GPU Memory Usage</td>
+    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB<br><b>diffusers FP16: from 4GB*</b><br><b>diffusers INT8(torchao): from 3.6GB*</b></td>
+    <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB<br><b>diffusers BF16 : from 5GB*</b><br><b>diffusers INT8(torchao): from 4.4GB*</b></td>
+    <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 66GB<br></td>
  </tr>
  <tr>
-    <td style="text-align: center;">Multi-GPU Inference Memory Usage</td>
+    <td style="text-align: center;">Multi-GPU Memory Usage</td>
    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
    <td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
+    <td colspan="2" style="text-align: center;"><b>Not supported</b><br></td>
  </tr>
  <tr>
    <td style="text-align: center;">Inference Speed<br>(Step = 50, FP/BF16)</td>
    <td style="text-align: center;">Single A100: ~90 seconds<br>Single H100: ~45 seconds</td>
    <td colspan="2" style="text-align: center;">Single A100: ~180 seconds<br>Single H100: ~90 seconds</td>
-  </tr>
-  <tr>
-    <td style="text-align: center;">Fine-tuning Precision</td>
-    <td style="text-align: center;"><b>FP16</b></td>
-    <td colspan="2" style="text-align: center;"><b>BF16</b></td>
-  </tr>
-  <tr>
-    <td style="text-align: center;">Fine-tuning Memory Usage</td>
-    <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
-    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
-    <td style="text-align: center;">78 GB (bs=1, LORA)<br> 75GB (bs=1, SFT, 16GPU)<br></td>
+    <td colspan="2" style="text-align: center;">Single A100: ~1000 seconds (5-second video)<br>Single H100: ~550 seconds (5-second video)</td>
  </tr>
  <tr>
    <td style="text-align: center;">Prompt Language</td>
-    <td colspan="3" style="text-align: center;">English*</td>
+    <td colspan="5" style="text-align: center;">English*</td>
  </tr>
  <tr>
-    <td style="text-align: center;">Maximum Prompt Length</td>
+    <td style="text-align: center;">Prompt Token Limit</td>
    <td colspan="3" style="text-align: center;">226 Tokens</td>
+    <td colspan="2" style="text-align: center;">224 Tokens</td>
  </tr>
  <tr>
    <td style="text-align: center;">Video Length</td>
-    <td colspan="3" style="text-align: center;">6 Seconds</td>
+    <td colspan="3" style="text-align: center;">6 seconds</td>
+    <td colspan="2" style="text-align: center;">5 or 10 seconds</td>
  </tr>
  <tr>
    <td style="text-align: center;">Frame Rate</td>
-    <td colspan="3" style="text-align: center;">8 Frames / Second</td>
+    <td colspan="3" style="text-align: center;">8 frames / second</td>
+    <td colspan="2" style="text-align: center;">16 frames / second</td>
  </tr>
  <tr>
-    <td style="text-align: center;">Video Resolution</td>
-    <td colspan="3" style="text-align: center;">720 x 480, no support for other resolutions (including fine-tuning)</td>
-  </tr>
-    <tr>
-    <td style="text-align: center;">Position Encoding</td>
+    <td style="text-align: center;">Positional Encoding</td>
    <td style="text-align: center;">3d_sincos_pos_embed</td>
    <td style="text-align: center;">3d_sincos_pos_embed</td>
    <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
+    <td style="text-align: center;">3d_sincos_pos_embed</td>
+    <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
  </tr>
  <tr>
    <td style="text-align: center;">Download Link (Diffusers)</td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
+    <td colspan="2" style="text-align: center;"> Coming Soon </td>
  </tr>
  <tr>
    <td style="text-align: center;">Download Link (SAT)</td>
-    <td colspan="3" style="text-align: center;"><a href="./sat/README.md">SAT</a></td>
+    <td colspan="3" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
+    <td colspan="2" style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5b-SAT">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">🟣 WiseModel</a></td>
  </tr>
 </table>

@ -422,7 +430,7 @@ hands-on practice on text-to-video generation. *The original input is in Chinese

 We welcome your contributions! You can click [here](resources/contribute.md) for more information.

-## License Agreement
+## Model-License

 The code in this repository is released under the [Apache 2.0 License](LICENSE).

--- a/README_ja.md
+++ b/README_ja.md
@ -1,6 +1,6 @@
 # CogVideo & CogVideoX

-[Read this in English](./README_zh.md)
+[Read this in English](./README.md)

 [中文阅读](./README_zh.md)

@ -22,9 +22,14 @@

 ## 更新とニュース

- 🔥🔥 **ニュース**: ```2024/10/13```: コスト削減のため、単一の4090 GPUで`CogVideoX-5B`
+- 🔥🔥 ニュース: ```2024/11/08```: `CogVideoX1.5` モデルをリリースしました。CogVideoX1.5 は CogVideoX オープンソースモデルのアップグレードバージョンです。
+CogVideoX1.5-5B シリーズモデルは、10秒 長の動画とより高い解像度をサポートしており、`CogVideoX1.5-5B-I2V` は任意の解像度での動画生成に対応しています。
+SAT コードはすでに更新されており、`diffusers` バージョンは現在適応中です。
+SAT バージョンのコードは [こちら](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT) からダウンロードできます。
+- 🔥 **ニュース**: ```2024/10/13```: コスト削減のため、単一の4090 GPUで`CogVideoX-5B`
  を微調整できるフレームワーク [cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory)
-  がリリースされました。複数の解像度での微調整に対応しています。ぜひご利用ください！- 🔥**ニュース**: ```2024/10/10```:
+  がリリースされました。複数の解像度での微調整に対応しています。ぜひご利用ください！
+- 🔥**ニュース**: ```2024/10/10```:
  技術報告書を更新し、より詳細なトレーニング情報とデモを追加しました。
 - 🔥 **ニュース**: ```2024/10/10```: 技術報告書を更新しました。[こちら](https://arxiv.org/pdf/2408.06072)
  をクリックしてご覧ください。さらにトレーニングの詳細とデモを追加しました。デモを見るには[こちら](https://yzy-thu.github.io/CogVideoX-demo/)
@ -34,7 +39,7 @@
 - 🔥**ニュース**: ```2024/9/19```: CogVideoXシリーズの画像生成ビデオモデル **CogVideoX-5B-I2V**
  をオープンソース化しました。このモデルは、画像を背景入力として使用し、プロンプトワードと組み合わせてビデオを生成することができ、より高い制御性を提供します。これにより、CogVideoXシリーズのモデルは、テキストからビデオ生成、ビデオの継続、画像からビデオ生成の3つのタスクをサポートするようになりました。オンラインでの[体験](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space)
  をお楽しみください。
- 🔥🔥 **ニュース**: ```2024/9/19```:
+- 🔥 **ニュース**: ```2024/9/19```:
  CogVideoXのトレーニングプロセスでビデオデータをテキスト記述に変換するために使用されるキャプションモデル [CogVLM2-Caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption)
  をオープンソース化しました。ダウンロードしてご利用ください。
 - 🔥 ```2024/8/27```: CogVideoXシリーズのより大きなモデル **CogVideoX-5B**
@ -63,11 +68,10 @@
 - [プロジェクト構造](#プロジェクト構造)
    - [推論](#推論)
    - [sat](#sat)
-    - [ツール](#ツール)
- [プロジェクト計画](#プロジェクト計画)
- [モデルライセンス](#モデルライセンス)
+    - [ツール](#ツール)=
 - [CogVideo(ICLR'23)モデル紹介](#CogVideoICLR23)
 - [引用](#引用)
+- [ライセンス契約](#ライセンス契約)

 ## クイックスタート

@ -156,79 +160,91 @@ pip install -r requirements.txt
 CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源のオープンソース版ビデオ生成モデルです。
 以下の表に、提供しているビデオ生成モデルの基本情報を示します:

-<table  style="border-collapse: collapse; width: 100%;">
+<table style="border-collapse: collapse; width: 100%;">
  <tr>
    <th style="text-align: center;">モデル名</th>
    <th style="text-align: center;">CogVideoX-2B</th>
    <th style="text-align: center;">CogVideoX-5B</th>
-    <th style="text-align: center;">CogVideoX-5B-I2V </th>
+    <th style="text-align: center;">CogVideoX-5B-I2V</th>
+    <th style="text-align: center;">CogVideoX1.5-5B</th>
+    <th style="text-align: center;">CogVideoX1.5-5B-I2V</th>
+  </tr>
+  <tr>
+    <td style="text-align: center;">リリース日</td>
+    <th style="text-align: center;">2024年8月6日</th>
+    <th style="text-align: center;">2024年8月27日</th>
+    <th style="text-align: center;">2024年9月19日</th>
+    <th style="text-align: center;">2024年11月8日</th>
+    <th style="text-align: center;">2024年11月8日</th>
+  </tr>
+  <tr>
+    <td style="text-align: center;">ビデオ解像度</td>
+    <td colspan="3" style="text-align: center;">720 * 480</td>
+    <td colspan="1" style="text-align: center;">1360 * 768</td>
+    <td colspan="1" style="text-align: center;">256 <= W <=1360<br>256 <= H <=768<br> W,H % 16 == 0</td>
  </tr>
  <tr>
    <td style="text-align: center;">推論精度</td>
    <td style="text-align: center;"><b>FP16*(推奨)</b>, BF16, FP32, FP8*, INT8, INT4は非対応</td>
    <td colspan="2" style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32, FP8*, INT8, INT4は非対応</td>
-  </tr>
-  <tr>
-    <td style="text-align: center;">単一GPUのメモリ消費<br></td>
-    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GBから* </b><br><b>diffusers INT8(torchao): 3.6GBから*</b></td>
-    <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GBから* </b><br><b>diffusers INT8(torchao): 4.4GBから* </b></td>
-  </tr>
-  <tr>
-    <td style="text-align: center;">マルチGPUのメモリ消費</td>
-    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
-    <td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
-  </tr>
-  <tr>
-    <td style="text-align: center;">推論速度<br>(ステップ = 50, FP/BF16)</td>
-    <td style="text-align: center;">単一A100: 約90秒<br>単一H100: 約45秒</td>
-    <td colspan="2" style="text-align: center;">単一A100: 約180秒<br>単一H100: 約90秒</td>
-  </tr>
-  <tr>
-    <td style="text-align: center;">ファインチューニング精度</td>
-    <td style="text-align: center;"><b>FP16</b></td>
    <td colspan="2" style="text-align: center;"><b>BF16</b></td>
  </tr>
  <tr>
-    <td style="text-align: center;">ファインチューニング時のメモリ消費</td>
-    <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
-    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
-    <td style="text-align: center;">78 GB (bs=1, LORA)<br> 75GB (bs=1, SFT, 16GPU)<br></td>
+    <td style="text-align: center;">シングルGPUメモリ消費</td>
+    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB<br><b>diffusers FP16: 4GBから*</b><br><b>diffusers INT8(torchao): 3.6GBから*</b></td>
+    <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB<br><b>diffusers BF16: 5GBから*</b><br><b>diffusers INT8(torchao): 4.4GBから*</b></td>
+    <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 66GB<br></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">マルチGPUメモリ消費</td>
+    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
+    <td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
+    <td colspan="2" style="text-align: center;"><b>サポートなし</b><br></td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">推論速度<br>(ステップ数 = 50, FP/BF16)</td>
+    <td style="text-align: center;">単一A100: 約90秒<br>単一H100: 約45秒</td>
+    <td colspan="2" style="text-align: center;">単一A100: 約180秒<br>単一H100: 約90秒</td>
+    <td colspan="2" style="text-align: center;">単一A100: 約1000秒(5秒動画)<br>単一H100: 約550秒(5秒動画)</td>
  </tr>
  <tr>
    <td style="text-align: center;">プロンプト言語</td>
-    <td colspan="3" style="text-align: center;">英語*</td>
+    <td colspan="5" style="text-align: center;">英語*</td>
  </tr>
  <tr>
-    <td style="text-align: center;">プロンプトの最大トークン数</td>
+    <td style="text-align: center;">プロンプトトークン制限</td>
    <td colspan="3" style="text-align: center;">226トークン</td>
+    <td colspan="2" style="text-align: center;">224トークン</td>
  </tr>
  <tr>
    <td style="text-align: center;">ビデオの長さ</td>
    <td colspan="3" style="text-align: center;">6秒</td>
+    <td colspan="2" style="text-align: center;">5秒または10秒</td>
  </tr>
  <tr>
    <td style="text-align: center;">フレームレート</td>
-    <td colspan="3" style="text-align: center;">8フレーム/秒</td>
-  </tr>
-  <tr>
-    <td style="text-align: center;">ビデオ解像度</td>
-    <td colspan="3" style="text-align: center;">720 * 480、他の解像度は非対応(ファインチューニング含む)</td>
+    <td colspan="3" style="text-align: center;">8 フレーム / 秒</td>
+    <td colspan="2" style="text-align: center;">16 フレーム / 秒</td>
  </tr>
  <tr>
    <td style="text-align: center;">位置エンコーディング</td>
    <td style="text-align: center;">3d_sincos_pos_embed</td>
    <td style="text-align: center;">3d_sincos_pos_embed</td>
    <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
+    <td style="text-align: center;">3d_sincos_pos_embed</td>
+    <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
  </tr>
  <tr>
    <td style="text-align: center;">ダウンロードリンク (Diffusers)</td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
+    <td colspan="2" style="text-align: center;">近日公開</td>
  </tr>
  <tr>
    <td style="text-align: center;">ダウンロードリンク (SAT)</td>
-    <td colspan="3" style="text-align: center;"><a href="./sat/README_ja.md">SAT</a></td>
+    <td colspan="3" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
+    <td colspan="2" style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5b-SAT">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">🟣 WiseModel</a></td>
  </tr>
 </table>

--- a/README_zh.md
+++ b/README_zh.md
@ -1,10 +1,9 @@
 # CogVideo & CogVideoX

-[Read this in English](./README_zh.md)
+[Read this in English](./README.md)

 [日本語で読む](./README_ja.md)

-
 <div align="center">
 <img src=resources/logo.svg width="50%"/>
 </div>
@ -23,7 +22,9 @@

 ## 项目更新

- 🔥🔥 **News**: ```2024/10/13```: 成本更低，单卡4090可微调`CogVideoX-5B`
+- 🔥🔥 **News**: ```2024/11/08```: 我们发布 `CogVideoX1.5` 模型。CogVideoX1.5 是 CogVideoX 开源模型的升级版本。 
+CogVideoX1.5-5B 系列模型支持 **10秒** 长度的视频和更高的分辨率，其中 `CogVideoX1.5-5B-I2V` 支持 **任意分辨率** 的视频生成，SAT代码已经更新。`diffusers`版本还在适配中。SAT版本代码前往 [这里](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT) 下载。
+- 🔥**News**: ```2024/10/13```: 成本更低，单卡4090可微调 `CogVideoX-5B`
  的微调框架[cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory)已经推出，多种分辨率微调，欢迎使用。
 - 🔥 **News**: ```2024/10/10```: 我们更新了我们的技术报告,请点击 [这里](https://arxiv.org/pdf/2408.06072)
  查看，附上了更多的训练细节和demo，关于demo，点击[这里](https://yzy-thu.github.io/CogVideoX-demo/) 查看。
@ -58,10 +59,9 @@
    - [Inference](#inference)
    - [SAT](#sat)
    - [Tools](#tools)
- [开源项目规划](#开源项目规划)
- [模型协议](#模型协议)
 - [CogVideo(ICLR'23)模型介绍](#cogvideoiclr23)
 - [引用](#引用)
+- [模型协议](#模型协议)

 ## 快速开始

@ -157,62 +157,72 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
    <th style="text-align: center;">CogVideoX-2B</th>
    <th style="text-align: center;">CogVideoX-5B</th>
    <th style="text-align: center;">CogVideoX-5B-I2V </th>
+    <th style="text-align: center;">CogVideoX1.5-5B</th>
+    <th style="text-align: center;">CogVideoX1.5-5B-I2V</th>
+  </tr>
+  <tr>
+    <td style="text-align: center;">发布时间</td>
+    <th style="text-align: center;">2024年8月6日</th>
+    <th style="text-align: center;">2024年8月27日</th>
+    <th style="text-align: center;">2024年9月19日</th>
+    <th style="text-align: center;">2024年11月8日</th>
+    <th style="text-align: center;">2024年11月8日</th>
+  </tr>
+  <tr>
+    <td style="text-align: center;">视频分辨率</td>
+    <td colspan="3" style="text-align: center;">720 * 480</td>
+    <td colspan="1" style="text-align: center;">1360 * 768</td>
+    <td colspan="1" style="text-align: center;">256 <= W <=1360<br> 256 <= H <=768<br>  W,H % 16 == 0</td>
  </tr>
  <tr>
    <td style="text-align: center;">推理精度</td>
    <td style="text-align: center;"><b>FP16*(推荐)</b>, BF16, FP32，FP8*，INT8，不支持INT4</td>
    <td colspan="2" style="text-align: center;"><b>BF16(推荐)</b>, FP16, FP32，FP8*，INT8，不支持INT4</td>
+    <td colspan="2" style="text-align: center;"><b>BF16</b></td>
  </tr>
  <tr>
    <td style="text-align: center;">单GPU显存消耗<br></td>
    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GB起* </b><br><b>diffusers INT8(torchao): 3.6G起*</b></td>
    <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB起* </b><br><b>diffusers INT8(torchao): 4.4G起* </b></td>
+    <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 66GB <br></td>
  </tr>
  <tr>
    <td style="text-align: center;">多GPU推理显存消耗</td>
    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
    <td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
+    <td colspan="2" style="text-align: center;"><b>Not support</b><br></td>
  </tr>
  <tr>
    <td style="text-align: center;">推理速度<br>(Step = 50, FP/BF16)</td>
    <td style="text-align: center;">单卡A100: ~90秒<br>单卡H100: ~45秒</td>
    <td colspan="2" style="text-align: center;">单卡A100: ~180秒<br>单卡H100: ~90秒</td>
-  </tr>
-  <tr>
-    <td style="text-align: center;">微调精度</td>
-    <td style="text-align: center;"><b>FP16</b></td>
-    <td colspan="2" style="text-align: center;"><b>BF16</b></td>
-  </tr>
-  <tr>
-    <td style="text-align: center;">微调显存消耗</td>
-    <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
-    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
-    <td style="text-align: center;">78 GB (bs=1, LORA)<br> 75GB (bs=1, SFT, 16GPU)<br></td>
+    <td colspan="2" style="text-align: center;">单卡A100: ~1000秒(5秒视频)<br>单卡H100: ~550秒(5秒视频)</td>
  </tr>
  <tr>
    <td style="text-align: center;">提示词语言</td>
-    <td colspan="3" style="text-align: center;">English*</td>
+    <td colspan="5" style="text-align: center;">English*</td>
  </tr>
  <tr>
    <td style="text-align: center;">提示词长度上限</td>
    <td colspan="3" style="text-align: center;">226 Tokens</td>
+    <td colspan="2" style="text-align: center;">224 Tokens</td>
  </tr>
  <tr>
    <td style="text-align: center;">视频长度</td>
    <td colspan="3" style="text-align: center;">6 秒</td>
+    <td colspan="2" style="text-align: center;">5 秒 或 10 秒</td>
  </tr>
  <tr>
    <td style="text-align: center;">帧率</td>
    <td colspan="3" style="text-align: center;">8 帧 / 秒 </td>
+    <td colspan="2" style="text-align: center;">16 帧 / 秒 </td>
  </tr>
  <tr>
-    <td style="text-align: center;">视频分辨率</td>
-    <td colspan="3" style="text-align: center;">720 * 480，不支持其他分辨率(含微调)</td>
-  </tr>
-    <tr>
    <td style="text-align: center;">位置编码</td>
    <td style="text-align: center;">3d_sincos_pos_embed</td>
-   <td style="text-align: center;">3d_sincos_pos_embed</td>
+    <td style="text-align: center;">3d_sincos_pos_embed</td>
+    <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
+    <td style="text-align: center;">3d_sincos_pos_embed</td>
    <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
  </tr>
  <tr>
@ -220,10 +230,13 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
+    <td colspan="2" style="text-align: center;"> 即将推出 </td>
  </tr>
  <tr>
    <td style="text-align: center;">下载链接 (SAT)</td>
    <td colspan="3" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
+    <td colspan="2" style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5b-SAT">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">🟣 WiseModel</a></td>
+
  </tr>
 </table>

--- a/sat/README.md
+++ b/sat/README.md
@ -1,29 +1,39 @@
-# SAT CogVideoX-2B
+# SAT CogVideoX

-[中文阅读](./README_zh.md)
+[Read this in English.](./README_zh.md)

 [日本語で読む](./README_ja.md)

-This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the
-fine-tuning code for SAT weights.
+This folder contains inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights, along with fine-tuning code for SAT weights. 

-This code is the framework used by the team to train the model. It has few comments and requires careful study.
+This code framework was used by our team during model training. There are few comments, so careful study is required.

 ## Inference Model

-### 1. Ensure that you have correctly installed the dependencies required by this folder.
+### 1. Make sure you have installed all dependencies in this folder

-```shell
+```
 pip install -r requirements.txt
 ```

-### 2. Download the model weights
+### 2. Download the Model Weights

-### 2. Download model weights
+First, download the model weights from the SAT mirror.

-First, go to the SAT mirror to download the model weights. For the CogVideoX-2B model, please download as follows:
+#### CogVideoX1.5 Model

-```shell
+```
+git lfs install
+git clone https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT
+```
+
+This command downloads three models: Transformers, VAE, and T5 Encoder.
+
+#### CogVideoX Model
+
+For the CogVideoX-2B model, download as follows:
+
+```
 mkdir CogVideoX-2b-sat
 cd CogVideoX-2b-sat
 wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
@ -34,13 +44,12 @@ mv 'index.html?dl=1' transformer.zip
 unzip transformer.zip
 ```

-For the CogVideoX-5B model, please download the `transformers` file as follows link:
-(VAE files are the same as 2B)
+Download the `transformers` file for the CogVideoX-5B model (the VAE file is the same as for 2B):

 + [CogVideoX-5B](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list)
 + [CogVideoX-5B-I2V](https://cloud.tsinghua.edu.cn/d/5cc62a2d6e7d45c0a2f6/?p=%2F1&mode=list)

-Next, you need to format the model files as follows:
+Arrange the model files in the following structure:

 ```
 .
@ -52,20 +61,24 @@ Next, you need to format the model files as follows:
    └── 3d-vae.pt
 ```

-Due to large size of model weight file, using `git lfs` is recommended. Installation of `git lfs` can be
-found [here](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing)
+Since model weight files are large, it’s recommended to use `git lfs`.  
+See [here](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing) for `git lfs` installation.

-Next, clone the T5 model, which is not used for training and fine-tuning, but must be used.
-> T5 model is available on [Modelscope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) as well.
+```
+git lfs install
+```

-```shell
-git clone https://huggingface.co/THUDM/CogVideoX-2b.git
+Next, clone the T5 model, which is used as an encoder and doesn’t require training or fine-tuning.
+> You may also use the model file location on [Modelscope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b).
+
+```
+git clone https://huggingface.co/THUDM/CogVideoX-2b.git # Download model from Huggingface
+# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git # Download from Modelscope
 mkdir t5-v1_1-xxl
 mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
 ```

-By following the above approach, you will obtain a safetensor format T5 file. Ensure that there are no errors when
-loading it into Deepspeed in Finetune.
+This will yield a safetensor format T5 file that can be loaded without error during Deepspeed fine-tuning.

 ```
 ├── added_tokens.json
@ -80,11 +93,11 @@ loading it into Deepspeed in Finetune.
 0 directories, 8 files
 ```

-### 3. Modify the file in `configs/cogvideox_2b.yaml`.
+### 3. Modify `configs/cogvideox_*.yaml` file.

 ```yaml
 model:
-  scale_factor: 1.15258426
+  scale_factor: 1.55258426
  disable_first_stage_autocast: true
  log_keys:
    - txt
@ -160,14 +173,14 @@ model:
          ucg_rate: 0.1
          target: sgm.modules.encoders.modules.FrozenT5Embedder
          params:
-            model_dir: "t5-v1_1-xxl" # Absolute path to the CogVideoX-2b/t5-v1_1-xxl weights folder
+            model_dir: "t5-v1_1-xxl" # absolute path to CogVideoX-2b/t5-v1_1-xxl weight folder
            max_length: 226

  first_stage_config:
    target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
    params:
      cp_size: 1
-      ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # Absolute path to the CogVideoX-2b-sat/vae/3d-vae.pt folder
+      ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # absolute path to CogVideoX-2b-sat/vae/3d-vae.pt file
      ignore_keys: [ 'loss' ]

      loss_config:
@ -239,48 +252,46 @@ model:
          num_steps: 50
 ```

-### 4. Modify the file in `configs/inference.yaml`.
+### 4. Modify `configs/inference.yaml` file.

 ```yaml
 args:
  latent_channels: 16
  mode: inference
-  load: "{absolute_path/to/your}/transformer" # Absolute path to the CogVideoX-2b-sat/transformer folder
+  load: "{absolute_path/to/your}/transformer" # Absolute path to CogVideoX-2b-sat/transformer folder
  # load: "{your lora folder} such as zRzRzRzRzRzRzR/lora-disney-08-20-13-28" # This is for Full model without lora adapter

  batch_size: 1
-  input_type: txt # You can choose txt for pure text input, or change to cli for command line input
-  input_file: configs/test.txt # Pure text file, which can be edited
-  sampling_num_frames: 13  # Must be 13, 11 or 9
+  input_type: txt # You can choose "txt" for plain text input or change to "cli" for command-line input
+  input_file: configs/test.txt # Plain text file, can be edited
+  sampling_num_frames: 13  # For CogVideoX1.5-5B it must be 42 or 22. For CogVideoX-5B / 2B, it must be 13, 11, or 9.
  sampling_fps: 8
  fp16: True # For CogVideoX-2B
-  #  bf16: True # For CogVideoX-5B
+  # bf16: True # For CogVideoX-5B
  output_dir: outputs/
  force_inference: True
 ```

-+ Modify `configs/test.txt` if multiple prompts is required, in which each line makes a prompt.
-+ For better prompt formatting, refer to [convert_demo.py](../inference/convert_demo.py), for which you should set the
-  OPENAI_API_KEY as your environmental variable.
-+ Modify `input_type` in `configs/inference.yaml` if use command line as prompt iuput.
+ If using a text file to save multiple prompts, modify `configs/test.txt` as needed. One prompt per line. If you are unsure how to write prompts, use [this code](../inference/convert_demo.py) to call an LLM for refinement.
+ To use command-line input, modify:

-```yaml
+```
 input_type: cli
 ```

-This allows input from the command line as prompts.
+This allows you to enter prompts from the command line.

-Change `output_dir` if you wish to modify the address of the output video
+To modify the output video location, change:

-```yaml
+```
 output_dir: outputs/
 ```

-It is saved by default in the `.outputs/` folder.
+The default location is the `.outputs/` folder.

-### 5. Run the inference code to perform inference.
+### 5. Run the Inference Code to Perform Inference

-```shell
+```
 bash inference.sh
 ```

@ -288,95 +299,91 @@ bash inference.sh

 ### Preparing the Dataset

-The dataset format should be as follows:
+The dataset should be structured as follows:

 ```
 .
 ├── labels
-│   ├── 1.txt
-│   ├── 2.txt
-│   ├── ...
+│   ├── 1.txt
+│   ├── 2.txt
+│   ├── ...
 └── videos
    ├── 1.mp4
    ├── 2.mp4
    ├── ...
 ```

-Each text file shares the same name as its corresponding video, serving as the label for that video. Videos and labels
-should be matched one-to-one. Generally, a single video should not be associated with multiple labels.
+Each txt file should have the same name as the corresponding video file and contain the label for that video. The videos and labels should correspond one-to-one. Generally, avoid using one video with multiple labels.

-For style fine-tuning, please prepare at least 50 videos and labels with similar styles to ensure proper fitting.
+For style fine-tuning, prepare at least 50 videos and labels with a similar style to facilitate fitting.

-### Modifying Configuration Files
+### Modifying the Configuration File

-We support two fine-tuning methods: `Lora` and full-parameter fine-tuning. Please note that both methods only fine-tune
-the `transformer` part and do not modify the `VAE` section. `T5` is used solely as an Encoder. Please modify
-the `configs/sft.yaml` (for full-parameter fine-tuning) file as follows:
+We support two fine-tuning methods: `Lora` and full-parameter fine-tuning. Note that both methods only fine-tune the `transformer` part. The `VAE` part is not modified, and `T5` is only used as an encoder.
+Modify the files in `configs/sft.yaml` (full fine-tuning) as follows:

-```
-  # checkpoint_activations: True ## Using gradient checkpointing (Both checkpoint_activations in the config file need to be set to True)
+```yaml
+  # checkpoint_activations: True ## using gradient checkpointing (both `checkpoint_activations` in the config file need to be set to True)
  model_parallel_size: 1 # Model parallel size
-  experiment_name: lora-disney  # Experiment name (do not modify)
-  mode: finetune # Mode (do not modify)
-  load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer model path
-  no_load_rng: True # Whether to load random seed
+  experiment_name: lora-disney  # Experiment name (do not change)
+  mode: finetune # Mode (do not change)
+  load: "{your_CogVideoX-2b-sat_path}/transformer" ## Path to Transformer model
+  no_load_rng: True # Whether to load random number seed
  train_iters: 1000 # Training iterations
  eval_iters: 1 # Evaluation iterations
  eval_interval: 100    # Evaluation interval
  eval_batch_size: 1  # Evaluation batch size
  save: ckpts # Model save path 
-  save_interval: 100 # Model save interval
+  save_interval: 100 # Save interval
  log_interval: 20 # Log output interval
  train_data: [ "your train data path" ]
-  valid_data: [ "your val data path" ] # Training and validation datasets can be the same
-  split: 1,0,0 # Training, validation, and test set ratio
-  num_workers: 8 # Number of worker threads for data loader
-  force_train: True # Allow missing keys when loading checkpoint (T5 and VAE are loaded separately)
-  only_log_video_latents: True # Avoid memory overhead caused by VAE decode
+  valid_data: [ "your val data path" ] # Training and validation sets can be the same
+  split: 1,0,0 # Proportion for training, validation, and test sets
+  num_workers: 8 # Number of data loader workers
+  force_train: True # Allow missing keys when loading checkpoint (T5 and VAE loaded separately)
+  only_log_video_latents: True # Avoid memory usage from VAE decoding
  deepspeed:
    bf16:
-      enabled: False # For CogVideoX-2B set to False and for CogVideoX-5B set to True
+      enabled: False # For CogVideoX-2B Turn to False and For CogVideoX-5B Turn to True
    fp16:
-      enabled: True  # For CogVideoX-2B set to True and for CogVideoX-5B set to False
+      enabled: True  # For CogVideoX-2B Turn to True and For CogVideoX-5B Turn to False
 ```

-If you wish to use Lora fine-tuning, you also need to modify the `cogvideox_<model_parameters>_lora` file:
+``` To use Lora fine-tuning, you also need to modify `cogvideox_<model parameters>_lora` file:

-Here, take `CogVideoX-2B` as a reference:
+Here's an example using `CogVideoX-2B`:

 ```
 model:
-  scale_factor: 1.15258426
+  scale_factor: 1.55258426
  disable_first_stage_autocast: true
-  not_trainable_prefixes: [ 'all' ] ## Uncomment
+  not_trainable_prefixes: [ 'all' ] ## Uncomment to unlock
  log_keys:
-    - txt'
+    - txt

-  lora_config: ## Uncomment
+  lora_config: ## Uncomment to unlock
    target: sat.model.finetune.lora2.LoraMixin
    params:
      r: 256
 ```

-### Modifying Run Scripts
+### Modify the Run Script

-Edit `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` to select the configuration file. Below are two examples:
+Edit `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` and select the config file. Below are two examples:

-1. If you want to use the `CogVideoX-2B` model and the `Lora` method, you need to modify `finetune_single_gpu.sh`
-   or `finetune_multi_gpus.sh`:
+1. If you want to use the `CogVideoX-2B` model with `Lora`, modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` as follows:

 ```
 run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
 ```

-2. If you want to use the `CogVideoX-2B` model and the `full-parameter fine-tuning` method, you need to
-   modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh`:
+2. If you want to use the `CogVideoX-2B` model with full fine-tuning, modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` as follows:

 ```
 run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM"
 ```

-### Fine-Tuning and Evaluation
+### Fine-tuning and Validation

 Run the inference code to start fine-tuning.

@ -385,45 +392,42 @@ bash finetune_single_gpu.sh # Single GPU
 bash finetune_multi_gpus.sh # Multi GPUs
 ```

-### Using the Fine-Tuned Model
+### Using the Fine-tuned Model

-The fine-tuned model cannot be merged; here is how to modify the inference configuration file `inference.sh`:
+The fine-tuned model cannot be merged. Here’s how to modify the inference configuration file `inference.sh`

 ```
-run_cmd="$environs python sample_video.py --base configs/cogvideox_<model_parameters>_lora.yaml configs/inference.yaml --seed 42"
+run_cmd="$environs python sample_video.py --base configs/cogvideox_<model parameters>_lora.yaml configs/inference.yaml --seed 42"
 ```

-Then, execute the code:
+Then, run the code:

 ```
 bash inference.sh 
 ```

-### Converting to Huggingface Diffusers Supported Weights
+### Converting to Huggingface Diffusers-compatible Weights

-The SAT weight format is different from Huggingface's weight format and needs to be converted. Please run:
+The SAT weight format is different from Huggingface’s format and requires conversion. Run

-```shell
+```
 python ../tools/convert_weight_sat2hf.py
 ```

-### Exporting Huggingface Diffusers lora LoRA Weights from SAT Checkpoints
+### Exporting Lora Weights from SAT to Huggingface Diffusers

-After completing the training using the above steps, we get a SAT checkpoint with LoRA weights. You can find the file
-at `{args.save}/1000/1000/mp_rank_00_model_states.pt`.
+Support is provided for exporting Lora weights from SAT to Huggingface Diffusers format.
+ After training with the above steps, you’ll find the SAT model with Lora weights in {args.save}/1000/1000/mp_rank_00_model_states.pt

-The script for exporting LoRA weights can be found in the CogVideoX repository at `tools/export_sat_lora_weight.py`.
-After exporting, you can use `load_cogvideox_lora.py` for inference.
+The export script `export_sat_lora_weight.py` is located in the CogVideoX repository under `tools/`. After exporting, use `load_cogvideox_lora.py` for inference.

 Export command:

-```bash
-python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory {args.save}/export_hf_lora_weights_1/
+```
+python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory   {args.save}/export_hf_lora_weights_1/
 ```

-This training mainly modified the following model structures. The table below lists the corresponding structure mappings
-for converting to the HF (Hugging Face) format LoRA structure. As you can see, LoRA adds a low-rank weight to the
-model's attention structure.
+The following model structures were modified during training. Here is the mapping between SAT and HF Lora structures. Lora adds a low-rank weight to the attention structure of the model.

 ```
 'attention.query_key_value.matrix_A.0': 'attn1.to_q.lora_A.weight',
@ -436,5 +440,5 @@ model's attention structure.
 'attention.dense.matrix_B.0': 'attn1.to_out.0.lora_B.weight'
 ```

-Using export_sat_lora_weight.py, you can convert the SAT checkpoint into the HF LoRA format.
+Using `export_sat_lora_weight.py` will convert these to the HF format Lora structure.
 ![alt text](../resources/hf_lora_weights.png)
--- a/sat/README_ja.md
+++ b/sat/README_ja.md
@ -1,27 +1,37 @@
-# SAT CogVideoX-2B
+# SAT CogVideoX

-[Read this in English.](./README_zh)
+[Read this in English.](./README.md)

 [中文阅读](./README_zh.md)

-このフォルダには、[SAT](https://github.com/THUDM/SwissArmyTransformer) ウェイトを使用した推論コードと、SAT
-ウェイトのファインチューニングコードが含まれています。
-
-このコードは、チームがモデルをトレーニングするために使用したフレームワークです。コメントが少なく、注意深く研究する必要があります。
+このフォルダには、[SAT](https://github.com/THUDM/SwissArmyTransformer)の重みを使用した推論コードと、SAT重みのファインチューニングコードが含まれています。
+このコードは、チームがモデルを訓練する際に使用したフレームワークです。コメントが少ないため、注意深く確認する必要があります。

 ## 推論モデル

-### 1. このフォルダに必要な依存関係が正しくインストールされていることを確認してください。
+### 1. このフォルダ内の必要な依存関係がすべてインストールされていることを確認してください

-```shell
+```
 pip install -r requirements.txt
 ```

-### 2. モデルウェイトをダウンロードします
+### 2. モデルの重みをダウンロード
+ まず、SATミラーからモデルの重みをダウンロードしてください。

-まず、SAT ミラーに移動してモデルの重みをダウンロードします。 CogVideoX-2B モデルの場合は、次のようにダウンロードしてください。
+#### CogVideoX1.5 モデル

-```shell
+```
+git lfs install
+git clone https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT
+```
+
+これにより、Transformers、VAE、T5 Encoderの3つのモデルがダウンロードされます。
+
+#### CogVideoX モデル
+
+CogVideoX-2B モデルについては、以下のようにダウンロードしてください：
+
+```
 mkdir CogVideoX-2b-sat
 cd CogVideoX-2b-sat
 wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
@ -32,12 +42,12 @@ mv 'index.html?dl=1' transformer.zip
 unzip transformer.zip
 ```

-CogVideoX-5B モデルの `transformers` ファイルを以下のリンクからダウンロードしてください （VAE ファイルは 2B と同じです）：
+CogVideoX-5B モデルの `transformers` ファイルをダウンロードしてください（VAEファイルは2Bと同じです）：

 + [CogVideoX-5B](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list)
 + [CogVideoX-5B-I2V](https://cloud.tsinghua.edu.cn/d/5cc62a2d6e7d45c0a2f6/?p=%2F1&mode=list)

-次に、モデルファイルを以下の形式にフォーマットする必要があります：
+モデルファイルを以下のように配置してください：

 ```
 .
@ -49,24 +59,24 @@ CogVideoX-5B モデルの `transformers` ファイルを以下のリンクから
    └── 3d-vae.pt
 ```

-モデルの重みファイルが大きいため、`git lfs`を使用することをお勧めいたします。`git lfs`
-のインストールについては、[こちら](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing)をご参照ください。
+モデルの重みファイルが大きいため、`git lfs`の使用をお勧めします。
+`git lfs`のインストール方法は[こちら](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing)を参照してください。

-```shell
+```
 git lfs install
 ```

-次に、T5 モデルをクローンします。これはトレーニングやファインチューニングには使用されませんが、使用する必要があります。
-> モデルを複製する際には、[Modelscope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b)のモデルファイルの場所もご使用いただけます。
+次に、T5モデルをクローンします。このモデルはEncoderとしてのみ使用され、訓練やファインチューニングは必要ありません。
+> [Modelscope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b)上のモデルファイルも使用可能です。

-```shell
-git clone https://huggingface.co/THUDM/CogVideoX-2b.git #ハギングフェイス(huggingface.org)からモデルをダウンロードいただきます
-# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git #Modelscopeからモデルをダウンロードいただきます
+```
+git clone https://huggingface.co/THUDM/CogVideoX-2b.git # Huggingfaceからモデルをダウンロード
+# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git # Modelscopeからダウンロード
 mkdir t5-v1_1-xxl
 mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
 ```

-上記の方法に従うことで、safetensor 形式の T5 ファイルを取得できます。これにより、Deepspeed でのファインチューニング中にエラーが発生しないようにします。
+これにより、Deepspeedファインチューニング中にエラーなくロードできるsafetensor形式のT5ファイルが作成されます。

 ```
 ├── added_tokens.json
@ -81,11 +91,11 @@ mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
 0 directories, 8 files
 ```

-### 3. `configs/cogvideox_2b.yaml` ファイルを変更します。
+### 3. `configs/cogvideox_*.yaml`ファイルを編集

 ```yaml
 model:
-  scale_factor: 1.15258426
+  scale_factor: 1.55258426
  disable_first_stage_autocast: true
  log_keys:
    - txt
@ -123,7 +133,7 @@ model:
      num_attention_heads: 30

      transformer_args:
-        checkpoint_activations: True ## グラデーション チェックポイントを使用する
+        checkpoint_activations: True ## using gradient checkpointing
        vocab_size: 1
        max_sequence_length: 64
        layernorm_order: pre
@ -161,14 +171,14 @@ model:
          ucg_rate: 0.1
          target: sgm.modules.encoders.modules.FrozenT5Embedder
          params:
-            model_dir: "t5-v1_1-xxl" # CogVideoX-2b/t5-v1_1-xxlフォルダの絶対パス
+            model_dir: "t5-v1_1-xxl" # CogVideoX-2b/t5-v1_1-xxl 重みフォルダの絶対パス
            max_length: 226

  first_stage_config:
    target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
    params:
      cp_size: 1
-      ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # CogVideoX-2b-sat/vae/3d-vae.ptフォルダの絶対パス
+      ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # CogVideoX-2b-sat/vae/3d-vae.ptファイルの絶対パス
      ignore_keys: [ 'loss' ]

      loss_config:
@ -240,7 +250,7 @@ model:
          num_steps: 50
 ```

-### 4. `configs/inference.yaml` ファイルを変更します。
+### 4. `configs/inference.yaml`ファイルを編集

 ```yaml
 args:
@ -250,38 +260,36 @@ args:
  # load: "{your lora folder} such as zRzRzRzRzRzRzR/lora-disney-08-20-13-28" # This is for Full model without lora adapter

  batch_size: 1
-  input_type: txt #TXTのテキストファイルを入力として選択されたり、CLIコマンドラインを入力として変更されたりいただけます
-  input_file: configs/test.txt #テキストファイルのパスで、これに対して編集がさせていただけます
-  sampling_num_frames: 13  # Must be 13, 11 or 9
+  input_type: txt # "txt"でプレーンテキスト入力、"cli"でコマンドライン入力を選択可能
+  input_file: configs/test.txt # プレーンテキストファイル、編集可能
+  sampling_num_frames: 13  # CogVideoX1.5-5Bでは42または22、CogVideoX-5B / 2Bでは13, 11, または9
  sampling_fps: 8
-  fp16: True # For CogVideoX-2B
-  #  bf16: True # For CogVideoX-5B
+  fp16: True # CogVideoX-2B用
+  # bf16: True # CogVideoX-5B用
  output_dir: outputs/
  force_inference: True
 ```

-+ 複数のプロンプトを保存するために txt を使用する場合は、`configs/test.txt`
-  を参照して変更してください。1行に1つのプロンプトを記述します。プロンプトの書き方がわからない場合は、最初に [このコード](../inference/convert_demo.py)
-  を使用して LLM によるリファインメントを呼び出すことができます。
-+ コマンドラインを入力として使用する場合は、次のように変更します。
+ 複数のプロンプトを含むテキストファイルを使用する場合、`configs/test.txt`を適宜編集してください。1行につき1プロンプトです。プロンプトの書き方が分からない場合は、[こちらのコード](../inference/convert_demo.py)を使用してLLMで補正できます。
+ コマンドライン入力を使用する場合、以下のように変更します：

-```yaml
+```
 input_type: cli
 ```

 これにより、コマンドラインからプロンプトを入力できます。

-出力ビデオのディレクトリを変更したい場合は、次のように変更できます：
+出力ビデオの保存場所を変更する場合は、以下を編集してください：

-```yaml
+```
 output_dir: outputs/
 ```

-デフォルトでは `.outputs/` フォルダに保存されます。
+デフォルトでは`.outputs/`フォルダに保存されます。

-### 5. 推論コードを実行して推論を開始します。
+### 5. 推論コードを実行して推論を開始

-```shell
+```
 bash inference.sh
 ```

@ -289,7 +297,7 @@ bash inference.sh

 ### データセットの準備

-データセットの形式は次のようになります：
+データセットは以下の構造である必要があります：

 ```
 .
@ -303,123 +311,215 @@ bash inference.sh
    ├── ...
 ```

-各 txt ファイルは対応するビデオファイルと同じ名前であり、そのビデオのラベルを含んでいます。各ビデオはラベルと一対一で対応する必要があります。通常、1つのビデオに複数のラベルを持たせることはありません。
+各txtファイルは対応するビデオファイルと同じ名前で、ビデオのラベルを含んでいます。ビデオとラベルは一対一で対応させる必要があります。通常、1つのビデオに複数のラベルを使用することは避けてください。

-スタイルファインチューニングの場合、少なくとも50本のスタイルが似たビデオとラベルを準備し、フィッティングを容易にします。
+スタイルのファインチューニングの場合、スタイルが似たビデオとラベルを少なくとも50本準備し、フィッティングを促進します。

-### 設定ファイルの変更
+### 設定ファイルの編集

-`Lora` とフルパラメータ微調整の2つの方法をサポートしています。両方の微調整方法は、`transformer` 部分のみを微調整し、`VAE`
-部分には変更を加えないことに注意してください。`T5` はエンコーダーとしてのみ使用されます。以下のように `configs/sft.yaml` (
-フルパラメータ微調整用) ファイルを変更してください。
+``` `Lora`と全パラメータのファインチューニングの2種類をサポートしています。どちらも`transformer`部分のみをファインチューニングし、`VAE`部分は変更されず、`T5`はエンコーダーとしてのみ使用されます。
+``` 以下のようにして`configs/sft.yaml`（全量ファインチューニング）ファイルを編集してください：

 ```
-  # checkpoint_activations: True ## 勾配チェックポイントを使用する場合 (設定ファイル内の2つの checkpoint_activations を True に設定する必要があります)
+  # checkpoint_activations: True ## using gradient checkpointing (configファイル内の2つの`checkpoint_activations`を両方Trueに設定)
  model_parallel_size: 1 # モデル並列サイズ
-  experiment_name: lora-disney  # 実験名 (変更しないでください)
-  mode: finetune # モード (変更しないでください)
-  load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer モデルのパス
-  no_load_rng: True # 乱数シードを読み込むかどうか
+  experiment_name: lora-disney  # 実験名（変更不要）
+  mode: finetune # モード（変更不要）
+  load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformerモデルのパス
+  no_load_rng: True # 乱数シードをロードするかどうか
  train_iters: 1000 # トレーニングイテレーション数
-  eval_iters: 1 # 評価イテレーション数
-  eval_interval: 100    # 評価間隔
-  eval_batch_size: 1  # 評価バッチサイズ
+  eval_iters: 1 # 検証イテレーション数
+  eval_interval: 100    # 検証間隔
+  eval_batch_size: 1  # 検証バッチサイズ
  save: ckpts # モデル保存パス 
-  save_interval: 100 # モデル保存間隔
+  save_interval: 100 # 保存間隔
  log_interval: 20 # ログ出力間隔
  train_data: [ "your train data path" ]
-  valid_data: [ "your val data path" ] # トレーニングデータと評価データは同じでも構いません
-  split: 1,0,0 # トレーニングセット、評価セット、テストセットの割合
-  num_workers: 8 # データローダーのワーカースレッド数
-  force_train: True # チェックポイントをロードするときに欠落したキーを許可 (T5 と VAE は別々にロードされます)
-  only_log_video_latents: True # VAE のデコードによるメモリオーバーヘッドを回避
+  valid_data: [ "your val data path" ] # トレーニングセットと検証セットは同じでも構いません
+  split: 1,0,0 # トレーニングセット、検証セット、テストセットの割合
+  num_workers: 8 # データローダーのワーカー数
+  force_train: True # チェックポイントをロードする際に`missing keys`を許可（T5とVAEは別途ロード）
+  only_log_video_latents: True # VAEのデコードによるメモリ使用量を抑える
  deepspeed:
    bf16:
-      enabled: False # CogVideoX-2B の場合は False に設定し、CogVideoX-5B の場合は True に設定
+      enabled: False # CogVideoX-2B 用は False、CogVideoX-5B 用は True に設定
    fp16:
-      enabled: True  # CogVideoX-2B の場合は True に設定し、CogVideoX-5B の場合は False に設定
+      enabled: True  # CogVideoX-2B 用は True、CogVideoX-5B 用は False に設定
+```
+```yaml
+args:
+  latent_channels: 16
+  mode: inference
+  load: "{absolute_path/to/your}/transformer" # Absolute path to CogVideoX-2b-sat/transformer folder
+  # load: "{your lora folder} such as zRzRzRzRzRzRzR/lora-disney-08-20-13-28" # This is for Full model without lora adapter
+
+  batch_size: 1
+  input_type: txt # You can choose "txt" for plain text input or change to "cli" for command-line input
+  input_file: configs/test.txt # Plain text file, can be edited
+  sampling_num_frames: 13  # For CogVideoX1.5-5B it must be 42 or 22. For CogVideoX-5B / 2B, it must be 13, 11, or 9.
+  sampling_fps: 8
+  fp16: True # For CogVideoX-2B
+  # bf16: True # For CogVideoX-5B
+  output_dir: outputs/
+  force_inference: True
 ```

-Lora 微調整を使用したい場合は、`cogvideox_<model_parameters>_lora` ファイルも変更する必要があります。
-
-ここでは、`CogVideoX-2B` を参考にします。
+ If using a text file to save multiple prompts, modify `configs/test.txt` as needed. One prompt per line. If you are unsure how to write prompts, use [this code](../inference/convert_demo.py) to call an LLM for refinement.
+ To use command-line input, modify:

 ```
-model:
-  scale_factor: 1.15258426
-  disable_first_stage_autocast: true
-  not_trainable_prefixes: [ 'all' ] ## コメントを解除
-  log_keys:
-    - txt'
-
-  lora_config: ## コメントを解除
-    target: sat.model.finetune.lora2.LoraMixin
-    params:
-      r: 256
+input_type: cli
 ```

-### 実行スクリプトの変更
+This allows you to enter prompts from the command line.

-設定ファイルを選択するために `finetune_single_gpu.sh` または `finetune_multi_gpus.sh` を編集します。以下に2つの例を示します。
-
-1. `CogVideoX-2B` モデルを使用し、`Lora` 手法を利用する場合は、`finetune_single_gpu.sh` または `finetune_multi_gpus.sh`
-   を変更する必要があります。
+To modify the output video location, change:

 ```
-run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
+output_dir: outputs/
 ```

-2. `CogVideoX-2B` モデルを使用し、`フルパラメータ微調整` 手法を利用する場合は、`finetune_single_gpu.sh`
-   または `finetune_multi_gpus.sh` を変更する必要があります。
+The default location is the `.outputs/` folder.

-```
-run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM"
-```
-
-### 微調整と評価
-
-推論コードを実行して微調整を開始します。
-
-```
-bash finetune_single_gpu.sh # シングルGPU
-bash finetune_multi_gpus.sh # マルチGPU
-```
-
-### 微調整後のモデルの使用
-
-微調整されたモデルは統合できません。ここでは、推論設定ファイル `inference.sh` を変更する方法を示します。
-
-```
-run_cmd="$environs python sample_video.py --base configs/cogvideox_<model_parameters>_lora.yaml configs/inference.yaml --seed 42"
-```
-
-その後、次のコードを実行します。
+### 5. Run the Inference Code to Perform Inference

 ```
 bash inference.sh
 ```

-### Huggingface Diffusers サポートのウェイトに変換
+## Fine-tuning the Model

-SAT ウェイト形式は Huggingface のウェイト形式と異なり、変換が必要です。次のコマンドを実行してください：
+### Preparing the Dataset

-```shell
+The dataset should be structured as follows:
+
+```
+.
+├── labels
+│   ├── 1.txt
+│   ├── 2.txt
+│   ├── ...
+└── videos
+    ├── 1.mp4
+    ├── 2.mp4
+    ├── ...
+```
+
+Each txt file should have the same name as the corresponding video file and contain the label for that video. The videos and labels should correspond one-to-one. Generally, avoid using one video with multiple labels.
+
+For style fine-tuning, prepare at least 50 videos and labels with a similar style to facilitate fitting.
+
+### Modifying the Configuration File
+
+We support two fine-tuning methods: `Lora` and full-parameter fine-tuning. Note that both methods only fine-tune the `transformer` part. The `VAE` part is not modified, and `T5` is only used as an encoder.
+Modify the files in `configs/sft.yaml` (full fine-tuning) as follows:
+
+```yaml
+  # checkpoint_activations: True ## using gradient checkpointing (both `checkpoint_activations` in the config file need to be set to True)
+  model_parallel_size: 1 # Model parallel size
+  experiment_name: lora-disney  # Experiment name (do not change)
+  mode: finetune # Mode (do not change)
+  load: "{your_CogVideoX-2b-sat_path}/transformer" ## Path to Transformer model
+  no_load_rng: True # Whether to load random number seed
+  train_iters: 1000 # Training iterations
+  eval_iters: 1 # Evaluation iterations
+  eval_interval: 100    # Evaluation interval
+  eval_batch_size: 1  # Evaluation batch size
+  save: ckpts # Model save path 
+  save_interval: 100 # Save interval
+  log_interval: 20 # Log output interval
+  train_data: [ "your train data path" ]
+  valid_data: [ "your val data path" ] # Training and validation sets can be the same
+  split: 1,0,0 # Proportion for training, validation, and test sets
+  num_workers: 8 # Number of data loader workers
+  force_train: True # Allow missing keys when loading checkpoint (T5 and VAE loaded separately)
+  only_log_video_latents: True # Avoid memory usage from VAE decoding
+  deepspeed:
+    bf16:
+      enabled: False # For CogVideoX-2B Turn to False and For CogVideoX-5B Turn to True
+    fp16:
+      enabled: True  # For CogVideoX-2B Turn to True and For CogVideoX-5B Turn to False
+```
+
+``` To use Lora fine-tuning, you also need to modify `cogvideox_<model parameters>_lora` file:
+
+Here's an example using `CogVideoX-2B`:
+
+```yaml
+model:
+  scale_factor: 1.55258426
+  disable_first_stage_autocast: true
+  not_trainable_prefixes: [ 'all' ] ## Uncomment to unlock
+  log_keys:
+    - txt
+
+  lora_config: ## Uncomment to unlock
+    target: sat.model.finetune.lora2.LoraMixin
+    params:
+      r: 256
+```
+
+### Modify the Run Script
+
+Edit `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` and select the config file. Below are two examples:
+
+1. If you want to use the `CogVideoX-2B` model with `Lora`, modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` as follows:
+
+```
+run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
+```
+
+2. If you want to use the `CogVideoX-2B` model with full fine-tuning, modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` as follows:
+
+```
+run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM"
+```
+
+### Fine-tuning and Validation
+
+Run the inference code to start fine-tuning.
+
+```
+bash finetune_single_gpu.sh # Single GPU
+bash finetune_multi_gpus.sh # Multi GPUs
+```
+
+### Using the Fine-tuned Model
+
+The fine-tuned model cannot be merged. Here’s how to modify the inference configuration file `inference.sh`
+
+```
+run_cmd="$environs python sample_video.py --base configs/cogvideox_<model parameters>_lora.yaml configs/inference.yaml --seed 42"
+```
+
+Then, run the code:
+
+```
+bash inference.sh 
+```
+
+### Converting to Huggingface Diffusers-compatible Weights
+
+The SAT weight format is different from Huggingface’s format and requires conversion. Run
+
+```
 python ../tools/convert_weight_sat2hf.py
 ```

-### SATチェックポイントからHuggingface Diffusers lora LoRAウェイトをエクスポート
+### Exporting Lora Weights from SAT to Huggingface Diffusers

-上記のステップを完了すると、LoRAウェイト付きのSATチェックポイントが得られます。ファイルは `{args.save}/1000/1000/mp_rank_00_model_states.pt` にあります。
+Support is provided for exporting Lora weights from SAT to Huggingface Diffusers format.
+After training with the above steps, you’ll find the SAT model with Lora weights in {args.save}/1000/1000/mp_rank_00_model_states.pt

-LoRAウェイトをエクスポートするためのスクリプトは、CogVideoXリポジトリの `tools/export_sat_lora_weight.py` にあります。エクスポート後、`load_cogvideox_lora.py` を使用して推論を行うことができます。
+The export script `export_sat_lora_weight.py` is located in the CogVideoX repository under `tools/`. After exporting, use `load_cogvideox_lora.py` for inference.

-エクスポートコマンド:
+Export command:

-```bash
-python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory {args.save}/export_hf_lora_weights_1/
+```
+python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory   {args.save}/export_hf_lora_weights_1/
 ```

-このトレーニングでは主に以下のモデル構造が変更されました。以下の表は、HF (Hugging Face) 形式のLoRA構造に変換する際の対応関係を示しています。ご覧の通り、LoRAはモデルの注意メカニズムに低ランクの重みを追加しています。
+The following model structures were modified during training. Here is the mapping between SAT and HF Lora structures. Lora adds a low-rank weight to the attention structure of the model.

 ```
 'attention.query_key_value.matrix_A.0': 'attn1.to_q.lora_A.weight',
@ -432,7 +532,5 @@ python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_nam
 'attention.dense.matrix_B.0': 'attn1.to_out.0.lora_B.weight'
 ```

-export_sat_lora_weight.py を使用して、SATチェックポイントをHF LoRA形式に変換できます。
-
-
+Using `export_sat_lora_weight.py` will convert these to the HF format Lora structure.
 ![alt text](../resources/hf_lora_weights.png)
--- a/sat/README_zh.md
+++ b/sat/README_zh.md
@ -1,6 +1,6 @@
-# SAT CogVideoX-2B
+# SAT CogVideoX

-[Read this in English.](./README_zh)
+[Read this in English.](./README.md)

 [日本語で読む](./README_ja.md)

@ -20,6 +20,15 @@ pip install -r requirements.txt

 首先，前往 SAT 镜像下载模型权重。

+#### CogVideoX1.5 模型
+
+```shell
+git lfs install
+git clone https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT
+```
+此操作会下载 Transformers, VAE, T5 Encoder 这三个模型。
+
+#### CogVideoX 模型
 对于 CogVideoX-2B 模型，请按照如下方式下载:

 ```shell
@ -82,11 +91,11 @@ mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
 0 directories, 8 files
 ```

-### 3. 修改`configs/cogvideox_2b.yaml`中的文件。
+### 3. 修改`configs/cogvideox_*.yaml`中的文件。

 ```yaml
 model:
-  scale_factor: 1.15258426
+  scale_factor: 1.55258426
  disable_first_stage_autocast: true
  log_keys:
    - txt
@ -253,7 +262,7 @@ args:
  batch_size: 1
  input_type: txt #可以选择txt纯文字档作为输入，或者改成cli命令行作为输入
  input_file: configs/test.txt #纯文字档，可以对此做编辑
-  sampling_num_frames: 13  # Must be 13, 11 or 9
+  sampling_num_frames: 13  #CogVideoX1.5-5B 必须是 42 或 22。 CogVideoX-5B / 2B 必须是 13 11 或 9。
  sampling_fps: 8
  fp16: True # For CogVideoX-2B
  #  bf16: True # For CogVideoX-5B
@ -346,7 +355,7 @@ Encoder 使用。

 ```yaml
 model:
-  scale_factor: 1.15258426
+  scale_factor: 1.55258426
  disable_first_stage_autocast: true
  not_trainable_prefixes: [ 'all' ] ## 解除注释
  log_keys:
--- a/sat/configs/cogvideox1.5_5b.yaml
+++ b/sat/configs/cogvideox1.5_5b.yaml
@ -0,0 +1,149 @@
+model:
+  scale_factor: 0.7
+  disable_first_stage_autocast: true
+  latent_input: true
+  log_keys:
+    - txt
+
+  denoiser_config:
+    target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
+    params:
+      num_idx: 1000
+      quantize_c_noise: False
+
+      weighting_config:
+        target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
+      scaling_config:
+        target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
+      discretization_config:
+        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
+
+  network_config:
+    target: dit_video_concat.DiffusionTransformer
+    params:
+      time_embed_dim: 512
+      elementwise_affine: True
+      num_frames: 81
+      time_compressed_rate: 4
+      latent_width: 300
+      latent_height: 300
+      num_layers: 42
+      patch_size: [2, 2, 2]
+      in_channels: 16
+      out_channels: 16
+      hidden_size: 3072
+      adm_in_channels: 256
+      num_attention_heads: 48
+
+      transformer_args:
+        checkpoint_activations: True
+        vocab_size: 1
+        max_sequence_length: 64
+        layernorm_order: pre
+        skip_init: false
+        model_parallel_size: 1
+        is_decoder: false
+
+      modules:
+        pos_embed_config:
+          target: dit_video_concat.Rotary3DPositionEmbeddingMixin
+          params:
+            hidden_size_head: 64
+            text_length: 224
+
+        patch_embed_config:
+          target: dit_video_concat.ImagePatchEmbeddingMixin
+          params:
+            text_hidden_size: 4096
+
+        adaln_layer_config:
+          target: dit_video_concat.AdaLNMixin
+          params:
+            qk_ln: True
+
+        final_layer_config:
+          target: dit_video_concat.FinalLayerMixin
+
+  conditioner_config:
+    target: sgm.modules.GeneralConditioner
+    params:
+      emb_models:
+          - is_trainable: false
+            input_key: txt
+            ucg_rate: 0.1
+            target: sgm.modules.encoders.modules.FrozenT5Embedder
+            params:
+              model_dir: "google/t5-v1_1-xxl"
+              max_length: 224
+
+
+  first_stage_config:
+    target : vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
+    params:
+        cp_size: 1
+        ckpt_path: "cogvideox-5b-sat/vae/3d-vae.pt"
+        ignore_keys: ['loss']
+
+        loss_config:
+          target: torch.nn.Identity
+
+        regularizer_config:
+          target: vae_modules.regularizers.DiagonalGaussianRegularizer
+
+        encoder_config:
+          target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
+          params:
+            double_z: true
+            z_channels: 16
+            resolution: 256
+            in_channels: 3
+            out_ch: 3
+            ch: 128
+            ch_mult: [1, 2, 2, 4]
+            attn_resolutions: []
+            num_res_blocks: 3
+            dropout: 0.0
+            gather_norm: True
+
+        decoder_config:
+          target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
+          params:
+            double_z: True
+            z_channels: 16
+            resolution: 256
+            in_channels: 3
+            out_ch: 3
+            ch: 128
+            ch_mult: [1, 2, 2, 4]
+            attn_resolutions: []
+            num_res_blocks: 3
+            dropout: 0.0
+            gather_norm: True
+
+  loss_fn_config:
+    target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
+    params:
+      offset_noise_level: 0
+      sigma_sampler_config:
+        target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
+        params:
+          uniform_sampling: True
+          group_num: 40
+          num_idx: 1000
+          discretization_config:
+            target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
+
+  sampler_config:
+    target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
+    params:
+      num_steps: 50
+      verbose: True
+
+      discretization_config:
+        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
+      guider_config:
+        target: sgm.modules.diffusionmodules.guiders.DynamicCFG
+        params:
+          scale: 6
+          exp: 5
+          num_steps: 50
--- a/sat/configs/cogvideox1.5_5b_i2v.yaml
+++ b/sat/configs/cogvideox1.5_5b_i2v.yaml
@ -0,0 +1,160 @@
+model:
+  scale_factor: 0.7
+  disable_first_stage_autocast: true
+  latent_input: false
+  noised_image_input: true
+  noised_image_all_concat: false
+  noised_image_dropout: 0.05
+  augmentation_dropout: 0.15
+  log_keys:
+    - txt
+
+  denoiser_config:
+    target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
+    params:
+      num_idx: 1000
+      quantize_c_noise: False
+
+      weighting_config:
+        target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
+      scaling_config:
+        target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
+      discretization_config:
+        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
+
+  network_config:
+    target: dit_video_concat.DiffusionTransformer
+    params:
+#      space_interpolation: 1.875
+      ofs_embed_dim: 512
+      time_embed_dim: 512
+      elementwise_affine: True
+      num_frames: 81
+      time_compressed_rate: 4
+      latent_width: 300
+      latent_height: 300
+      num_layers: 42
+      patch_size: [2, 2, 2]
+      in_channels: 32
+      out_channels: 16
+      hidden_size: 3072
+      adm_in_channels: 256
+      num_attention_heads: 48
+
+      transformer_args:
+        checkpoint_activations: True
+        vocab_size: 1
+        max_sequence_length: 64
+        layernorm_order: pre
+        skip_init: false
+        model_parallel_size: 1
+        is_decoder: false
+
+      modules:
+        pos_embed_config:
+          target: dit_video_concat.Rotary3DPositionEmbeddingMixin
+          params:
+            hidden_size_head: 64
+            text_length: 224
+
+        patch_embed_config:
+          target: dit_video_concat.ImagePatchEmbeddingMixin
+          params:
+            text_hidden_size: 4096
+
+
+        adaln_layer_config:
+          target: dit_video_concat.AdaLNMixin
+          params:
+            qk_ln: True
+
+        final_layer_config:
+          target: dit_video_concat.FinalLayerMixin
+
+  conditioner_config:
+    target: sgm.modules.GeneralConditioner
+    params:
+      emb_models:
+
+          - is_trainable: false
+            input_key: txt
+            ucg_rate: 0.1
+            target: sgm.modules.encoders.modules.FrozenT5Embedder
+            params:
+              model_dir: "google/t5-v1_1-xxl"
+              max_length: 224
+
+
+  first_stage_config:
+    target : vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
+    params:
+        cp_size: 1
+        ckpt_path: "cogvideox-5b-i2v-sat/vae/3d-vae.pt"
+        ignore_keys: ['loss']
+
+        loss_config:
+          target: torch.nn.Identity
+
+        regularizer_config:
+          target: vae_modules.regularizers.DiagonalGaussianRegularizer
+
+        encoder_config:
+          target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
+          params:
+            double_z: true
+            z_channels: 16
+            resolution: 256
+            in_channels: 3
+            out_ch: 3
+            ch: 128
+            ch_mult: [1, 2, 2, 4]
+            attn_resolutions: []
+            num_res_blocks: 3
+            dropout: 0.0
+            gather_norm: True
+
+        decoder_config:
+          target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
+          params:
+            double_z: True
+            z_channels: 16
+            resolution: 256
+            in_channels: 3
+            out_ch: 3
+            ch: 128
+            ch_mult: [1, 2, 2, 4]
+            attn_resolutions: []
+            num_res_blocks: 3
+            dropout: 0.0
+            gather_norm: True
+
+  loss_fn_config:
+    target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
+    params:
+      fixed_frames: 0
+      offset_noise_level: 0.0
+      sigma_sampler_config:
+        target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
+        params:
+          uniform_sampling: True
+          group_num: 40
+          num_idx: 1000
+          discretization_config:
+            target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
+
+  sampler_config:
+    target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
+    params:
+      fixed_frames: 0
+      num_steps: 50
+      verbose: True
+
+      discretization_config:
+        target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
+
+      guider_config:
+        target: sgm.modules.diffusionmodules.guiders.DynamicCFG
+        params:
+          scale: 6
+          exp: 5
+          num_steps: 50
--- a/sat/configs/test.txt
+++ b/sat/configs/test.txt
@ -1,4 +1,4 @@
 In the haunting backdrop of a warIn the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.
 The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from its tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.
 A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.
-A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.
+A street artist, clad in a worn-out denim jacket and a colorful banana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.