Merge pull request #295 from THUDM/CogVideoX_dev

CogVideoX-5B-I2V model release
2025-06-16 04:59:16 +08:00 · 2024-09-19 00:06:31 +08:00 · 2024-09-19 00:06:31 +08:00 · f6b1ff42a7
commit f6b1ff42a7
parent 36de705978 bdd16a3efb
43 changed files with 3219 additions and 781 deletions
--- a/README.md
+++ b/README.md
@ -22,7 +22,16 @@ Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spac

 ## Update and News

- 🔥🔥 **News**: ```2024/8/29```: By adding `pipe.enable_sequential_cpu_offload()` and `pipe.vae.enable_slicing()` to the
+- 🔥🔥 **News**: ```2024/9/19```: We have open-sourced the CogVideoX series image-to-video model **CogVideoX-5B-I2V**.
+  This model allows inputting an image as a background combined with prompts to generate videos, providing greater
+  controllability. With this release, the CogVideoX series now supports three tasks: text-to-video, video extension, and
+  image-to-video generation. Feel free to try it out [online](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space).
+- 🔥 **News**: ```2024/9/16```: We have added an automated video generation tool! You can now use local open-source
+  models + FLUX + CogVideoX to automatically generate high-quality videos. Feel free
+  to [try it out](tools/llm_flux_cogvideox/llm_flux_cogvideox.py).
+- 🔥 **News**: ```2024/9/15```: CogVideoX LoRA fine-tuning weights have been exported and tested successfully with
+  the `diffusers` library. Please check the [tutorial](sat/README.md).
+- 🔥 **News**: ```2024/8/29```: By adding `pipe.enable_sequential_cpu_offload()` and `pipe.vae.enable_slicing()` to the
  inference code of CogVideoX-5B, VRAM usage can be reduced to `5GB`. Please check the
  updated [cli_demo](inference/cli_demo.py).
 - 🔥 **News**: ```2024/8/27```: The **CogVideoX-2B** model's open-source license has been changed to the **Apache 2.0
@ -35,7 +44,7 @@ Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spac
 - 🔥 **News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) now supports enhancing videos
  generated by
  CogVideoX, achieving higher resolution and higher quality video rendering. We welcome you to try it out by following
-  the [tutorial](tools/venhancer/README_zh.md).
+  the [tutorial](tools/venhancer/README.md).
 - 🔥 **News**: ```2024/8/15```: The `SwissArmyTransformer` dependency in CogVideoX has been upgraded to `0.4.12`.
  Fine-tuning
  no longer requires installing `SwissArmyTransformer` from source. Additionally, the `Tied VAE` technique has been
@ -169,107 +178,115 @@ models we currently offer, along with their foundational information.
    <th style="text-align: center;">Model Name</th>
    <th style="text-align: center;">CogVideoX-2B</th>
    <th style="text-align: center;">CogVideoX-5B</th>
+    <th style="text-align: center;">CogVideoX-5B-I2V</th>
  </tr>
  <tr>
    <td style="text-align: center;">Model Description</td>
    <td style="text-align: center;">Entry-level model, balancing compatibility. Low cost for running and secondary development.</td>
    <td style="text-align: center;">Larger model with higher video generation quality and better visual effects.</td>
+    <td style="text-align: center;">CogVideoX-5B image-to-video version.</td>
  </tr>
  <tr>
    <td style="text-align: center;">Inference Precision</td>
-    <td style="text-align: center;"><b>FP16* (Recommended)</b>, BF16, FP32, FP8*, INT8, no support for INT4</td>
-    <td style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, no support for INT4</td>
+    <td style="text-align: center;"><b>FP16*(recommended)</b>, BF16, FP32, FP8*, INT8, not supported: INT4</td>
+    <td colspan="2" style="text-align: center;"><b>BF16 (recommended)</b>, FP16, FP32, FP8*, INT8, not supported: INT4</td>
  </tr>
  <tr>
-    <td style="text-align: center;">Single GPU VRAM Consumption<br></td>
-    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: starting from 4GB*</b><br><b>diffusers INT8(torchao): starting from 3.6GB*</b></td>
-    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16: starting from 5GB*</b><br><b>diffusers INT8(torchao): starting from 4.4GB*</b></td>
+    <td style="text-align: center;">Single GPU Memory Usage<br></td>
+    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: from 4GB* </b><br><b>diffusers INT8 (torchao): from 3.6GB*</b></td>
+    <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16: from 5GB* </b><br><b>diffusers INT8 (torchao): from 4.4GB*</b></td>
  </tr>
  <tr>
-    <td style="text-align: center;">Multi-GPU Inference VRAM Consumption</td>
-    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b></td>
-    <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b></td>
+    <td style="text-align: center;">Multi-GPU Inference Memory Usage</td>
+    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
+    <td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
  </tr>
  <tr>
    <td style="text-align: center;">Inference Speed<br>(Step = 50, FP/BF16)</td>
    <td style="text-align: center;">Single A100: ~90 seconds<br>Single H100: ~45 seconds</td>
-    <td style="text-align: center;">Single A100: ~180 seconds<br>Single H100: ~90 seconds</td>
+    <td colspan="2" style="text-align: center;">Single A100: ~180 seconds<br>Single H100: ~90 seconds</td>
  </tr>
  <tr>
    <td style="text-align: center;">Fine-tuning Precision</td>
    <td style="text-align: center;"><b>FP16</b></td>
-    <td style="text-align: center;"><b>BF16</b></td>
+    <td colspan="2" style="text-align: center;"><b>BF16</b></td>
  </tr>
  <tr>
-    <td style="text-align: center;">Fine-tuning VRAM Consumption (per GPU)</td>
+    <td style="text-align: center;">Fine-tuning Memory Usage</td>
    <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
-    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)</td>
+    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
+    <td style="text-align: center;">78 GB (bs=1, LORA)<br> 75GB (bs=1, SFT, 16GPU)<br></td>
  </tr>
  <tr>
    <td style="text-align: center;">Prompt Language</td>
-    <td colspan="2" style="text-align: center;">English*</td>
+    <td colspan="3" style="text-align: center;">English*</td>
  </tr>
  <tr>
-    <td style="text-align: center;">Prompt Length Limit</td>
-    <td colspan="2" style="text-align: center;">226 Tokens</td>
+    <td style="text-align: center;">Maximum Prompt Length</td>
+    <td colspan="3" style="text-align: center;">226 Tokens</td>
  </tr>
  <tr>
    <td style="text-align: center;">Video Length</td>
-    <td colspan="2" style="text-align: center;">6 Seconds</td>
+    <td colspan="3" style="text-align: center;">6 Seconds</td>
  </tr>
  <tr>
    <td style="text-align: center;">Frame Rate</td>
-    <td colspan="2" style="text-align: center;">8 Frames per Second</td>
+    <td colspan="3" style="text-align: center;">8 Frames / Second</td>
  </tr>
  <tr>
    <td style="text-align: center;">Video Resolution</td>
-    <td colspan="2" style="text-align: center;">720 x 480, no support for other resolutions (including fine-tuning)</td>
+    <td colspan="3" style="text-align: center;">720 x 480, no support for other resolutions (including fine-tuning)</td>
  </tr>
-  <tr>
-    <td style="text-align: center;">Positional Encoding</td>
+    <tr>
+    <td style="text-align: center;">Position Encoding</td>
    <td style="text-align: center;">3d_sincos_pos_embed</td>
-    <td style="text-align: center;">3d_rope_pos_embed</td>
+    <td style="text-align: center;">3d_sincos_pos_embed</td>
+    <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
  </tr>
  <tr>
-    <td style="text-align: center;">Download Page (Diffusers)</td>
+    <td style="text-align: center;">Download Link (Diffusers)</td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
+    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
  </tr>
  <tr>
-    <td style="text-align: center;">Download Page (SAT)</td>
-    <td colspan="2" style="text-align: center;"><a href="./sat/README.md">SAT</a></td>
+    <td style="text-align: center;">Download Link (SAT)</td>
+    <td colspan="3" style="text-align: center;"><a href="./sat/README.md">SAT</a></td>
  </tr>
 </table>

 **Data Explanation**

-+ When testing using the `diffusers` library, all optimizations provided by the `diffusers` library were enabled. This
-  solution has not been tested for actual VRAM/memory usage on devices other than **NVIDIA A100 / H100**. Generally,
-  this solution can be adapted to all devices with **NVIDIA Ampere architecture** and above. If the optimizations are
-  disabled, VRAM usage will increase significantly, with peak VRAM usage being about 3 times higher than the table
-  shows. However, speed will increase by 3-4 times. You can selectively disable some optimizations, including:
+ While testing using the diffusers library, all optimizations included in the diffusers library were enabled. This
+  scheme has not been tested for actual memory usage on devices outside of **NVIDIA A100 / H100** architectures.
+  Generally, this scheme can be adapted to all **NVIDIA Ampere architecture** and above devices. If optimizations are
+  disabled, memory consumption will multiply, with peak memory usage being about 3 times the value in the table.
+  However, speed will increase by about 3-4 times. You can selectively disable some optimizations, including:

 ```
 pipe.enable_sequential_cpu_offload()
 pipe.vae.enable_slicing()
 pipe.vae.enable_tiling()
-``` 
+```

-+ When performing multi-GPU inference, the `enable_sequential_cpu_offload()` optimization needs to be disabled.
-+ Using INT8 models will reduce inference speed. This is to ensure that GPUs with lower VRAM can perform inference
-  normally while maintaining minimal video quality loss, though inference speed will decrease significantly.
-+ The 2B model is trained with `FP16` precision, and the 5B model is trained with `BF16` precision. We recommend using
-  the precision the model was trained with for inference.
+ For multi-GPU inference, the `enable_sequential_cpu_offload()` optimization needs to be disabled.
+ Using INT8 models will slow down inference, which is done to accommodate lower-memory GPUs while maintaining minimal
+  video quality loss, though inference speed will significantly decrease.
+ The CogVideoX-2B model was trained in `FP16` precision, and all CogVideoX-5B models were trained in `BF16` precision.
+  We recommend using the precision in which the model was trained for inference.
 + [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
-  used to quantize the text encoder, Transformer, and VAE modules to reduce CogVideoX's memory requirements. This makes
-  it possible to run the model on a free T4 Colab or GPUs with smaller VRAM! It is also worth noting that TorchAO
-  quantization is fully compatible with `torch.compile`, which can significantly improve inference speed. `FP8`
-  precision must be used on devices with `NVIDIA H100` or above, which requires installing
-  the `torch`, `torchao`, `diffusers`, and `accelerate` Python packages from source. `CUDA 12.4` is recommended.
-+ The inference speed test also used the above VRAM optimization scheme. Without VRAM optimization, inference speed
+  used to quantize the text encoder, transformer, and VAE modules to reduce the memory requirements of CogVideoX. This
+  allows the model to run on free T4 Colabs or GPUs with smaller memory! Also, note that TorchAO quantization is fully
+  compatible with `torch.compile`, which can significantly improve inference speed. FP8 precision must be used on
+  devices with NVIDIA H100 and above, requiring source installation of `torch`, `torchao`, `diffusers`, and `accelerate`
+  Python packages. CUDA 12.4 is recommended.
+ The inference speed tests also used the above memory optimization scheme. Without memory optimization, inference speed
  increases by about 10%. Only the `diffusers` version of the model supports quantization.
-+ The model only supports English input; other languages can be translated into English during refinement by a large
-  model.
+ The model only supports English input; other languages can be translated into English for use via large model
+  refinement.
+ The memory usage of model fine-tuning is tested in an `8 * H100` environment, and the program automatically
+  uses `Zero 2` optimization. If a specific number of GPUs is marked in the table, that number or more GPUs must be used
+  for fine-tuning.

 ## Friendly Links

@ -279,20 +296,31 @@ works have already been adapted for CogVideoX, and we invite everyone to use the
 + [Xorbits Inference](https://github.com/xorbitsai/inference): A powerful and comprehensive distributed inference
  framework, allowing you to easily deploy your own models or the latest cutting-edge open-source models with just one
  click.
-+ [ComfyUI-CogVideoXWrapper](https://github.com/kijai/ComfyUI-CogVideoXWrapper) Use the ComfyUI framework to integrate CogVideoX into your workflow.
+ [ComfyUI-CogVideoXWrapper](https://github.com/kijai/ComfyUI-CogVideoXWrapper) Use the ComfyUI framework to integrate
+  CogVideoX into your workflow.
 + [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys): VideoSys provides a user-friendly, high-performance
  infrastructure for video generation, with full pipeline support and continuous integration of the latest models and
  techniques.
 + [AutoDL Space](https://www.codewithgpu.com/i/THUDM/CogVideo/CogVideoX-5b-demo): A one-click deployment Huggingface
  Space image provided by community members.
-+ [Colab Space](https://github.com/camenduru/CogVideoX-5B-jupyter) Run the CogVideoX-5B model using Jupyter Notebook on
-  Colab.

 ## Project Structure

 This open-source repository will guide developers to quickly get started with the basic usage and fine-tuning examples
 of the **CogVideoX** open-source model.

+### Quick Start with Colab
+
+Here provide three projects that can be run directly on free Colab T4 instances:
+
+ [CogVideoX-5B-T2V-Colab.ipynb](https://colab.research.google.com/drive/1pCe5s0bC_xuXbBlpvIH1z0kfdTLQPzCS?usp=sharing):
+  CogVideoX-5B Text-to-Video Colab code.
+ [CogVideoX-5B-T2V-Int8-Colab.ipynb](https://colab.research.google.com/drive/1DUffhcjrU-uz7_cpuJO3E_D4BaJT7OPa?usp=sharing):
+  CogVideoX-5B Quantized Text-to-Video Inference Colab code, which takes about 30 minutes per run.
+ [CogVideoX-5B-I2V-Colab.ipynb](https://colab.research.google.com/drive/17CqYCqSwz39nZAX2YyonDxosVKUZGzcX?usp=sharing):
+  CogVideoX-5B Image-to-Video Colab code.
+
+
 ### Inference

 + [dcli_demo](inference/cli_demo.py): A more detailed inference code explanation, including the significance of
@ -303,27 +331,22 @@ of the **CogVideoX** open-source model.
 + [diffusers_vae_demo](inference/cli_vae_demo.py): Code for running VAE inference separately.
 + [space demo](inference/gradio_composite_demo): The same GUI code as used in the Huggingface Space, with frame
  interpolation and super-resolution tools integrated.
+
+<div style="text-align: center;">
+    <img src="resources/web_demo.png" style="width: 100%; height: auto;" />
+</div>
+
 + [convert_demo](inference/convert_demo.py): How to convert user input into long-form input suitable for CogVideoX.
  Since CogVideoX is trained on long texts, we need to transform the input text distribution to match the training data
  using an LLM. The script defaults to using GLM-4, but it can be replaced with GPT, Gemini, or any other large language
  model.
-+ [gradio_web_demo](inference/gradio_web_demo.py): A simple Gradio web application demonstrating how to use the
+ [gradio_web_demo](inference/gradio_composite_demo): A simple Gradio web application demonstrating how to use the
  CogVideoX-2B / 5B model to generate videos. Similar to our Huggingface Space, you can use this script to run a simple
-  web
-  application for video generation.
+  web application for video generation.

-```shell
-cd inference
-# For Linux and Windows users
-python gradio_web_demo.py
+### finetune

-# For macOS with Apple Silicon users, Intel not supported, this maybe 20x slower than RTX 4090
-PYTORCH_ENABLE_MPS_FALLBACK=1 python gradio_web_demo.py
-```
-
-<div style="text-align: center;">
-    <img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
-</div>
+ [finetune_demo](finetune/README.md): Fine-tuning scheme and details of the diffusers version of the CogVideoX model.

 ### sat

@ -335,8 +358,16 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python gradio_web_demo.py

 This folder contains some tools for model conversion / caption generation, etc.

-+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): Convert SAT model weights to Huggingface model weights.
-+ [caption_demo](tools/caption): Caption tool, a model that understands videos and outputs them in text.
+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): Converts SAT model weights to Huggingface model weights.
+ [caption_demo](tools/caption/README.md): Caption tool, a model that understands videos and outputs descriptions in
+  text.
+ [export_sat_lora_weight](tools/export_sat_lora_weight.py): SAT fine-tuning model export tool, exports the SAT Lora
+  Adapter in diffusers format.
+ [load_cogvideox_lora](tools/load_cogvideox_lora.py): Tool code for loading the diffusers version of fine-tuned Lora
+  Adapter.
+ [llm_flux_cogvideox](tools/llm_flux_cogvideox/llm_flux_cogvideox.py): Automatically generate videos using an
+  open-source local large language model + Flux + CogVideoX.
+
 ## CogVideo(ICLR'23)

 The official repo for the
@ -376,23 +407,7 @@ hands-on practice on text-to-video generation. *The original input is in Chinese
 }
 ```

-## Open Source Project Plan
-
- [x] CogVideoX Model Open Source
-    - [x] CogVideoX Model Inference Example (CLI / Web Demo)
-    - [x] CogVideoX Online Experience Example (Huggingface Space)
-    - [x] CogVideoX Open Source Model API Interface Example (Huggingface)
-    - [x] CogVideoX Model Fine-Tuning Example (SAT)
-    - [ ] CogVideoX Model Fine-Tuning Example (Huggingface Diffusers)
-    - [X] CogVideoX-5B Open Source (Adapted to CogVideoX-2B Suite)
-    - [X] CogVideoX Technical Report Released
-    - [X] CogVideoX Technical Explanation Video
- [ ] CogVideoX Peripheral Tools
-    - [X] Basic Video Super-Resolution / Frame Interpolation Suite
-    - [ ] Inference Framework Adaptation
-    - [ ] ComfyUI Full Ecosystem Tools
-
-We welcome your contributions! You can click [here](resources/contribute_zh.md) for more information.
+We welcome your contributions! You can click [here](resources/contribute.md) for more information.

 ## License Agreement

@ -401,5 +416,5 @@ The code in this repository is released under the [Apache 2.0 License](LICENSE).
 The CogVideoX-2B model (including its corresponding Transformers module and VAE module) is released under
 the [Apache 2.0 License](LICENSE).

-The CogVideoX-5B model (Transformers module) is released under
+The CogVideoX-5B model (Transformers module, include I2V and T2V) is released under
 the [CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE).
--- a/README_ja.md
+++ b/README_ja.md
@ -22,7 +22,13 @@

 ## 更新とニュース

- 🔥🔥 **ニュース**: ```2024/8/29```: `pipe.enable_sequential_cpu_offload()` と `pipe.vae.enable_slicing()`
+- 🔥🔥 **ニュース**: ```2024/9/19```: CogVideoXシリーズの画像生成ビデオモデル **CogVideoX-5B-I2V**
+  をオープンソース化しました。このモデルでは、背景として画像を入力し、プロンプトと組み合わせてビデオを生成でき、より強力なコントロール性を提供します。これで、CogVideoXシリーズは、テキスト生成ビデオ、ビデオ拡張、画像生成ビデオの3つのタスクをサポートしています。ぜひ [オンラインでお試しください](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space)。
+- 🔥 **ニュース**: ```2024/9/16```: 自動動画生成ツールを追加しました！オープンソースのローカルモデル + FLUX + CogVideoX
+  を使用して、高品質な動画を自動生成できます。ぜひ[お試しください](tools/llm_flux_cogvideox/llm_flux_cogvideox.py)。
+- 🔥 **ニュース**: ```2024/9/15```: CogVideoXのLoRAファインチューニングの重みがエクスポートされ、`diffusers`
+  ライブラリでのテストに成功しました。[チュートリアル](sat/README_ja.md) をご覧ください。
+- 🔥 **ニュース**: ```2024/8/29```: `pipe.enable_sequential_cpu_offload()` と `pipe.vae.enable_slicing()`
  をCogVideoX-5Bの推論コードに追加することで、VRAM使用量を`5GB`
  まで削減できます。更新された[cli_demo](inference/cli_demo.py)をご覧ください。
 - 🔥**ニュース**: ```2024/8/27```: **CogVideoX-2B** モデルのオープンソースライセンスが **Apache 2.0 ライセンス**
@ -146,92 +152,91 @@ pip install -r requirements.txt

 ## モデル紹介

-CogVideoXは[清影](https://chatglm.cn/video?lang=en?fr=osm_cogvideo) 同源のオープンソース版動画生成モデルです。
-以下の表は、提供されている動画生成モデルに関する基本情報を示しています。
+CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源のオープンソース版ビデオ生成モデルです。
+以下の表に、提供しているビデオ生成モデルの基本情報を示します:

-<table style="border-collapse: collapse; width: 100%;">
+<table  style="border-collapse: collapse; width: 100%;">
  <tr>
    <th style="text-align: center;">モデル名</th>
    <th style="text-align: center;">CogVideoX-2B</th>
    <th style="text-align: center;">CogVideoX-5B</th>
-  </tr>
-  <tr>
-    <td style="text-align: center;">モデル紹介</td>
-    <td style="text-align: center;">入門モデルで、互換性を重視。運用および二次開発のコストが低い。</td>
-    <td style="text-align: center;">動画生成品質が高く、視覚効果がより優れた大型モデル。</td>
+    <th style="text-align: center;">CogVideoX-5B-I2V </th>
  </tr>
  <tr>
    <td style="text-align: center;">推論精度</td>
-    <td style="text-align: center;"><b>FP16*(推奨)</b>, BF16, FP32, FP8*(E4M3, E5M2), INT8, INT4は非対応</td>
-    <td style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32, FP8*(E4M3, E5M2), INT8, INT4は非対応</td>
+    <td style="text-align: center;"><b>FP16*(推奨)</b>, BF16, FP32, FP8*, INT8, INT4は非対応</td>
+    <td colspan="2" style="text-align: center;"><b>BF16(推奨)</b>, FP16, FP32, FP8*, INT8, INT4は非対応</td>
  </tr>
  <tr>
-    <td style="text-align: center;">シングルGPU VRAM消費量<br></td>
-    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GBから*</b><br><b>diffusers INT8(torchao): 3.6GBから*</b></td>
-    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16: 5GBから*</b><br><b>diffusers INT8(torchao): 4.4GBから*</b></td>
+    <td style="text-align: center;">単一GPUのメモリ消費<br></td>
+    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GBから* </b><br><b>diffusers INT8(torchao): 3.6GBから*</b></td>
+    <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GBから* </b><br><b>diffusers INT8(torchao): 4.4GBから* </b></td>
  </tr>
  <tr>
-    <td style="text-align: center;">複数GPUの推論メモリ消費量</td>
-    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b></td>
-    <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b></td>
+    <td style="text-align: center;">マルチGPUのメモリ消費</td>
+    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
+    <td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
  </tr>
  <tr>
-    <td style="text-align: center;">推論速度<br>(Step = 50)</td>
-    <td style="text-align: center;">FP16: ~90* s</td>
-    <td style="text-align: center;">BF16: ~180* s</td>
+    <td style="text-align: center;">推論速度<br>(ステップ = 50, FP/BF16)</td>
+    <td style="text-align: center;">単一A100: 約90秒<br>単一H100: 約45秒</td>
+    <td colspan="2" style="text-align: center;">単一A100: 約180秒<br>単一H100: 約90秒</td>
  </tr>
  <tr>
-    <td style="text-align: center;">微調整精度</td>
+    <td style="text-align: center;">ファインチューニング精度</td>
    <td style="text-align: center;"><b>FP16</b></td>
-    <td style="text-align: center;"><b>BF16</b></td>
+    <td colspan="2" style="text-align: center;"><b>BF16</b></td>
  </tr>
  <tr>
-    <td style="text-align: center;">微調整時のメモリ消費量 (1GPUあたり)</td>
+    <td style="text-align: center;">ファインチューニング時のメモリ消費</td>
    <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
-    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)</td>
+    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
+    <td style="text-align: center;">78 GB (bs=1, LORA)<br> 75GB (bs=1, SFT, 16GPU)<br></td>
  </tr>
  <tr>
    <td style="text-align: center;">プロンプト言語</td>
-    <td colspan="2" style="text-align: center;">英語*</td>
+    <td colspan="3" style="text-align: center;">英語*</td>
  </tr>
  <tr>
-    <td style="text-align: center;">プロンプトの長さ上限</td>
-    <td colspan="2" style="text-align: center;">226トークン</td>
+    <td style="text-align: center;">プロンプトの最大トークン数</td>
+    <td colspan="3" style="text-align: center;">226トークン</td>
  </tr>
  <tr>
-    <td style="text-align: center;">動画の長さ</td>
-    <td colspan="2" style="text-align: center;">6秒</td>
+    <td style="text-align: center;">ビデオの長さ</td>
+    <td colspan="3" style="text-align: center;">6秒</td>
  </tr>
  <tr>
    <td style="text-align: center;">フレームレート</td>
-    <td colspan="2" style="text-align: center;">8フレーム/秒</td>
+    <td colspan="3" style="text-align: center;">8フレーム/秒</td>
  </tr>
  <tr>
-    <td style="text-align: center;">動画の解像度</td>
-    <td colspan="2" style="text-align: center;">720 * 480、他の解像度はサポートされていません（微調整も含む）</td>
+    <td style="text-align: center;">ビデオ解像度</td>
+    <td colspan="3" style="text-align: center;">720 * 480、他の解像度は非対応(ファインチューニング含む)</td>
  </tr>
  <tr>
-    <td style="text-align: center;">位置エンコード</td>
+    <td style="text-align: center;">位置エンコーディング</td>
    <td style="text-align: center;">3d_sincos_pos_embed</td>
-    <td style="text-align: center;">3d_rope_pos_embed</td>
+    <td style="text-align: center;">3d_sincos_pos_embed</td>
+    <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
  </tr>
  <tr>
    <td style="text-align: center;">ダウンロードリンク (Diffusers)</td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
+    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
  </tr>
  <tr>
    <td style="text-align: center;">ダウンロードリンク (SAT)</td>
-    <td colspan="2" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
+    <td colspan="3" style="text-align: center;"><a href="./sat/README_ja.md">SAT</a></td>
  </tr>
 </table>

 **データ解説**

-+ `diffusers` ライブラリを使用してテストする際、`diffusers` ライブラリに付属するすべての最適化を有効にしました。このソリューションは、
-  **NVIDIA A100 / H100** 以外のデバイスでの実際のVRAM/メモリ使用量についてはテストされていません。一般的に、このソリューションは
-  **NVIDIA Ampereアーキテクチャ**
-  以上のすべてのデバイスに適応できます。最適化を無効にすると、VRAM使用量が大幅に増加し、表の約3倍のピークVRAMを使用しますが、速度は3-4倍向上します。以下の最適化の一部を選択的に無効にすることができます:
+ diffusersライブラリを使用してテストする際には、`diffusers`ライブラリが提供する全ての最適化が有効になっています。この方法は
+  **NVIDIA A100 / H100**以外のデバイスでのメモリ/メモリ消費のテストは行っていません。通常、この方法は**NVIDIA
+  Ampereアーキテクチャ**
+  以上の全てのデバイスに適応できます。最適化を無効にすると、メモリ消費は倍増し、ピークメモリ使用量は表の3倍になりますが、速度は約3〜4倍向上します。以下の最適化を部分的に無効にすることが可能です:

 ```
 pipe.enable_sequential_cpu_offload()
@ -239,21 +244,21 @@ pipe.vae.enable_slicing()
 pipe.vae.enable_tiling()
 ```

-+ マルチGPU推論を行う際には、`enable_sequential_cpu_offload()` の最適化を無効にする必要があります。
-+ INT8モデルを使用すると推論速度が低下します。これは、ビデオ品質の損失を最小限に抑えながら、VRAMが少ないGPUでも正常に推論できるようにするためですが、推論速度は大幅に低下します。
-+ 2Bモデルは `FP16` 精度でトレーニングされ、5Bモデルは `BF16` 精度でトレーニングされています。推論には、モデルがトレーニングされた精度を使用することをお勧めします。
-+ [PytorchAO](https://github.com/pytorch/ao) および [Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
-  は、CogVideoXのメモリ要件を削減するためにテキストエンコーダー、トランスフォーマー、およびVAEモジュールを量子化するために使用できます。これにより、無料のT4
-  ColabまたはVRAMが少ないGPUでモデルを実行できるようになります。また、TorchAO量子化は `torch.compile`
-  と完全に互換性があり、推論速度を大幅に向上させることができます。`NVIDIA H100` 以上のデバイスでは `FP8`
-  精度を使用する必要があり、これには `torch`、`torchao`、`diffusers`、および `accelerate` Python
-  パッケージをソースからインストールする必要があります。`CUDA 12.4` が推奨されます。
-+
-
-推論速度テストでも、上記のVRAM最適化スキームを使用しました。VRAMの最適化を行わない場合、推論速度は約10％向上します。量子化をサポートするのは `diffusers`
-バージョンのモデルのみです。
-
-+ モデルは英語入力のみをサポートしており、他の言語は大規模なモデルでのリファイン時に英語に翻訳できます。
+ マルチGPUで推論する場合、`enable_sequential_cpu_offload()`最適化を無効にする必要があります。
+ INT8モデルを使用すると推論速度が低下しますが、これはメモリの少ないGPUで正常に推論を行い、ビデオ品質の損失を最小限に抑えるための措置です。推論速度は大幅に低下します。
+ CogVideoX-2Bモデルは`FP16`精度でトレーニングされており、CogVideoX-5Bモデルは`BF16`
+  精度でトレーニングされています。推論時にはモデルがトレーニングされた精度を使用することをお勧めします。
+ [PytorchAO](https://github.com/pytorch/ao)および[Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
+  は、CogVideoXのメモリ要件を削減するためにテキストエンコーダ、トランスフォーマ、およびVAEモジュールを量子化するために使用できます。これにより、無料のT4
+  Colabやより少ないメモリのGPUでモデルを実行することが可能になります。同様に重要なのは、TorchAOの量子化は`torch.compile`
+  と完全に互換性があり、推論速度を大幅に向上させることができる点です。`NVIDIA H100`およびそれ以上のデバイスでは`FP8`
+  精度を使用する必要があります。これには、`torch`、`torchao`、`diffusers`、`accelerate`
+  Pythonパッケージのソースコードからのインストールが必要です。`CUDA 12.4`の使用をお勧めします。
+ 推論速度テストも同様に、上記のメモリ最適化方法を使用しています。メモリ最適化を使用しない場合、推論速度は約10％向上します。
+  `diffusers`バージョンのモデルのみが量子化をサポートしています。
+ モデルは英語入力のみをサポートしており、他の言語は大規模モデルの改善を通じて英語に翻訳できます。
+ モデルのファインチューニングに使用されるメモリは`8 * H100`環境でテストされています。プログラムは自動的に`Zero 2`
+  最適化を使用しています。表に具体的なGPU数が記載されている場合、ファインチューニングにはその数以上のGPUが必要です。

 ## 友好的リンク

@ -261,17 +266,28 @@ pipe.vae.enable_tiling()

 + [Xorbits Inference](https://github.com/xorbitsai/inference):
  強力で包括的な分散推論フレームワークであり、ワンクリックで独自のモデルや最新のオープンソースモデルを簡単にデプロイできます。
-+ + [ComfyUI-CogVideoXWrapper](https://github.com/kijai/ComfyUI-CogVideoXWrapper) ComfyUIフレームワークを使用して、CogVideoXをワークフローに統合します。
+ [ComfyUI-CogVideoXWrapper](https://github.com/kijai/ComfyUI-CogVideoXWrapper)
+  ComfyUIフレームワークを使用して、CogVideoXをワークフローに統合します。
 + [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys): VideoSysは、使いやすく高性能なビデオ生成インフラを提供し、最新のモデルや技術を継続的に統合しています。
 + [AutoDLイメージ](https://www.codewithgpu.com/i/THUDM/CogVideo/CogVideoX-5b-demo): コミュニティメンバーが提供するHuggingface
  Spaceイメージのワンクリックデプロイメント。
-+ [Colab Space](https://github.com/camenduru/CogVideoX-5B-jupyter) ColabでJupyter Notebookを使用してCogVideoX-5Bモデルを実行します。

 ## プロジェクト構造

 このオープンソースリポジトリは、**CogVideoX** オープンソースモデルの基本的な使用方法と微調整の例を迅速に開始するためのガイドです。

-### 推論
+### Colabでのクイックスタート
+
+無料のColab T4上で直接実行できる3つのプロジェクトを提供しています。
+
+ [CogVideoX-5B-T2V-Colab.ipynb](https://colab.research.google.com/drive/1pCe5s0bC_xuXbBlpvIH1z0kfdTLQPzCS?usp=sharing):
+  CogVideoX-5B テキストからビデオへの生成用Colabコード。
+ [CogVideoX-5B-T2V-Int8-Colab.ipynb](https://colab.research.google.com/drive/1DUffhcjrU-uz7_cpuJO3E_D4BaJT7OPa?usp=sharing):
+  CogVideoX-5B テキストからビデオへの量子化推論用Colabコード。1回の実行に約30分かかります。
+ [CogVideoX-5B-I2V-Colab.ipynb](https://colab.research.google.com/drive/17CqYCqSwz39nZAX2YyonDxosVKUZGzcX?usp=sharing):
+  CogVideoX-5B 画像からビデオへの生成用Colabコード。
+
+### Inference

 + [cli_demo](inference/cli_demo.py): 推論コードの詳細な説明が含まれており、一般的なパラメータの意味についても言及しています。
 + [cli_demo_quantization](inference/cli_demo_quantization.py):
@ -279,23 +295,20 @@ pipe.vae.enable_tiling()
  モデルの実行をサポートすることもできます。
 + [diffusers_vae_demo](inference/cli_vae_demo.py): VAE推論コードの実行には現在71GBのメモリが必要ですが、将来的には最適化される予定です。
 + [space demo](inference/gradio_composite_demo): Huggingface Spaceと同じGUIコードで、フレーム補間や超解像ツールが組み込まれています。
+
+<div style="text-align: center;">
+    <img src="resources/web_demo.png" style="width: 100%; height: auto;" />
+</div>
+
 + [convert_demo](inference/convert_demo.py):
  ユーザー入力をCogVideoXに適した形式に変換する方法。CogVideoXは長いキャプションでトレーニングされているため、入力テキストをLLMを使用してトレーニング分布と一致させる必要があります。デフォルトではGLM-4を使用しますが、GPT、Geminiなどの他のLLMに置き換えることもできます。
 + [gradio_web_demo](inference/gradio_web_demo.py): CogVideoX-2B / 5B モデルを使用して動画を生成する方法を示す、シンプルな
  Gradio Web UI デモです。私たちの Huggingface Space と同様に、このスクリプトを使用して Web デモを起動することができます。

-```shell
-cd inference
-# For Linux and Windows users
-python gradio_web_demo.py
+### finetune

-# For macOS with Apple Silicon users, Intel not supported, this maybe 20x slower than RTX 4090
-PYTORCH_ENABLE_MPS_FALLBACK=1 python gradio_web_demo.py
-```
-
-<div style="text-align: center;">
-    <img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
-</div>
+ [train_cogvideox_lora](finetune/README_ja.md): CogVideoX diffusers 微調整方法の詳細な説明が含まれています。このコードを使用して、自分のデータセットで
+  CogVideoX を微調整することができます。

 ### sat

@ -306,8 +319,14 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python gradio_web_demo.py

 このフォルダには、モデル変換/キャプション生成などのツールが含まれています。

-+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): SATモデルのウェイトをHuggingfaceモデルのウェイトに変換します。
-+ [caption_demo](tools/caption): キャプションツール、ビデオを理解し、テキストで出力するモデル。
+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): SAT モデルの重みを Huggingface モデルの重みに変換します。
+ [caption_demo](tools/caption/README_ja.md): Caption ツール、ビデオを理解してテキストで出力するモデル。
+ [export_sat_lora_weight](tools/export_sat_lora_weight.py): SAT ファインチューニングモデルのエクスポートツール、SAT Lora
+  Adapter を diffusers 形式でエクスポートします。
+ [load_cogvideox_lora](tools/load_cogvideox_lora.py): diffusers 版のファインチューニングされた Lora Adapter
+  をロードするためのツールコード。
+ [llm_flux_cogvideox](tools/llm_flux_cogvideox/llm_flux_cogvideox.py): オープンソースのローカル大規模言語モデル +
+  Flux + CogVideoX を使用して自動的に動画を生成します。

 ## CogVideo(ICLR'23)

@ -347,23 +366,7 @@ CogVideoのデモは [https://models.aminer.cn/cogvideo](https://models.aminer.c
 }
 ```

-## オープンソースプロジェクト計画
-
- [x] CogVideoX モデルオープンソース化
-    - [x] CogVideoX モデル推論例 (CLI / Web デモ)
-    - [x] CogVideoX オンライン体験例 (Huggingface Space)
-    - [x] CogVideoX オープンソースモデルAPIインターフェース例 (Huggingface)
-    - [x] CogVideoX モデル微調整例 (SAT)
-    - [ ] CogVideoX モデル微調整例 (Huggingface Diffusers)
-    - [X] CogVideoX-5B オープンソース化 (CogVideoX-2B スイートに適応)
-    - [X] CogVideoX 技術報告公開
-    - [X] CogVideoX 技術解説ビデオ
- [ ] CogVideoX 周辺ツール
-    - [X] 基本的なビデオ超解像 / フレーム補間スイート
-    - [ ] 推論フレームワーク適応
-    - [ ] ComfyUI 完全エコシステムツール
-
-あなたの貢献をお待ちしています！詳細は[こちら](resources/contribute_zh.md)をクリックしてください。
+あなたの貢献をお待ちしています！詳細は[こちら](resources/contribute_ja.md)をクリックしてください。

 ## ライセンス契約

@ -372,5 +375,5 @@ CogVideoのデモは [https://models.aminer.cn/cogvideo](https://models.aminer.c
 CogVideoX-2B モデル (対応するTransformersモジュールやVAEモジュールを含む) は
 [Apache 2.0 License](LICENSE) の下で公開されています。

-CogVideoX-5B モデル (Transformersモジュール) は
+CogVideoX-5B モデル（Transformers モジュール、画像生成ビデオとテキスト生成ビデオのバージョンを含む） は
 [CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE) の下で公開されています。
--- a/README_zh.md
+++ b/README_zh.md
@ -23,7 +23,13 @@

 ## 项目更新

- 🔥🔥 **News**: ```2024/8/29```: 使用`pipe.enable_sequential_cpu_offload()` 和 `pipe.vae.enable_slicing()`
+- 🔥🔥 **News**: ```2024/9/19```: 我们开源 CogVideoX 系列图生视频模型 **CogVideoX-5B-I2V**
+  。该模型可以将一张图像作为背景输入，结合提示词一起生成视频，具有更强的可控性。
+  至此，CogVideoX系列模型已经支持文本生成视频，视频续写，图片生成视频三种任务。欢迎前往在线[体验](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space)。
+- 🔥 **News**: ```2024/9/16```: 我们添加自动化生成视频工具，你可以使用本地开源模型 + FLUX + CogVideoX
+  实现自动生成优质视频，欢迎[体验](tools/llm_flux_cogvideox/llm_flux_cogvideox.py)
+- 🔥 **News**: ```2024/9/15```: CogVideoX LoRA微调权重导出并在`diffusers`库中测试通过，请查看[教程](sat/README_zh.md)。
+- 🔥 **News**: ```2024/8/29```: 使用`pipe.enable_sequential_cpu_offload()` 和 `pipe.vae.enable_slicing()`
  加入到CogVideoX-5B的推理代码中，可以将显存占用下降至`5GB`，请查看[cli_demo](inference/cli_demo.py)的更新。
 - 🔥 **News**: ```2024/8/27```: **CogVideoX-2B** 模型开源协议已经修改为**Apache 2.0 协议**。
 - 🔥 **News**: ```2024/8/27```:  我们开源 CogVideoX 系列更大的模型 **CogVideoX-5B**
@ -151,82 +157,83 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
    <th style="text-align: center;">模型名</th>
    <th style="text-align: center;">CogVideoX-2B</th>
    <th style="text-align: center;">CogVideoX-5B</th>
-  </tr>
-  <tr>
-    <td style="text-align: center;">模型介绍</td>
-    <td style="text-align: center;">入门级模型，兼顾兼容性。运行，二次开发成本低。</td>
-    <td style="text-align: center;">视频生成质量更高，视觉效果更好的更大尺寸模型。</td>
+    <th style="text-align: center;">CogVideoX-5B-I2V </th>
  </tr>
  <tr>
    <td style="text-align: center;">推理精度</td>
    <td style="text-align: center;"><b>FP16*(推荐)</b>, BF16, FP32，FP8*，INT8，不支持INT4</td>
-    <td style="text-align: center;"><b>BF16(推荐)</b>, FP16, FP32，FP8*，INT8，不支持INT4</td>
+    <td colspan="2" style="text-align: center;"><b>BF16(推荐)</b>, FP16, FP32，FP8*，INT8，不支持INT4</td>
  </tr>
  <tr>
    <td style="text-align: center;">单GPU显存消耗<br></td>
    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GB起* </b><br><b>diffusers INT8(torchao): 3.6G起*</b></td>
-    <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB起* </b><br><b>diffusers INT8(torchao): 4.4G起* </b></td>
+    <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB起* </b><br><b>diffusers INT8(torchao): 4.4G起* </b></td>
  </tr>
  <tr>
    <td style="text-align: center;">多GPU推理显存消耗</td>
    <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
-    <td style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
+    <td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
  </tr>
  <tr>
-    <td style="text-align: center;">推理速度<br>(Step = 50)</td>
-    <td style="text-align: center;">FP16: ~90* s</td>
-    <td style="text-align: center;">BF16: ~180* s</td>
+    <td style="text-align: center;">推理速度<br>(Step = 50, FP/BF16)</td>
+    <td style="text-align: center;">单卡A100: ~90秒<br>单卡H100: ~45秒</td>
+    <td colspan="2" style="text-align: center;">单卡A100: ~180秒<br>单卡H100: ~90秒</td>
  </tr>
  <tr>
    <td style="text-align: center;">微调精度</td>
    <td style="text-align: center;"><b>FP16</b></td>
-    <td style="text-align: center;"><b>BF16</b></td>
+    <td colspan="2" style="text-align: center;"><b>BF16</b></td>
  </tr>
  <tr>
-    <td style="text-align: center;">微调显存消耗(每卡)</td>
+    <td style="text-align: center;">微调显存消耗</td>
    <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
    <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
+    <td style="text-align: center;">78 GB (bs=1, LORA)<br> 75GB (bs=1, SFT, 16GPU)<br></td>
  </tr>
  <tr>
    <td style="text-align: center;">提示词语言</td>
-    <td colspan="2" style="text-align: center;">English*</td>
+    <td colspan="3" style="text-align: center;">English*</td>
  </tr>
  <tr>
    <td style="text-align: center;">提示词长度上限</td>
-    <td colspan="2" style="text-align: center;">226 Tokens</td>
+    <td colspan="3" style="text-align: center;">226 Tokens</td>
  </tr>
  <tr>
    <td style="text-align: center;">视频长度</td>
-    <td colspan="2" style="text-align: center;">6 秒</td>
+    <td colspan="3" style="text-align: center;">6 秒</td>
  </tr>
  <tr>
    <td style="text-align: center;">帧率</td>
-    <td colspan="2" style="text-align: center;">8 帧 / 秒 </td>
+    <td colspan="3" style="text-align: center;">8 帧 / 秒 </td>
  </tr>
  <tr>
    <td style="text-align: center;">视频分辨率</td>
-    <td colspan="2" style="text-align: center;">720 * 480，不支持其他分辨率(含微调)</td>
+    <td colspan="3" style="text-align: center;">720 * 480，不支持其他分辨率(含微调)</td>
  </tr>
    <tr>
    <td style="text-align: center;">位置编码</td>
    <td style="text-align: center;">3d_sincos_pos_embed</td>
-    <td style="text-align: center;">3d_rope_pos_embed<br></td>
+   <td style="text-align: center;">3d_sincos_pos_embed</td>
+    <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
  </tr>
  <tr>
    <td style="text-align: center;">下载链接 (Diffusers)</td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
+    <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
  </tr>
  <tr>
    <td style="text-align: center;">下载链接 (SAT)</td>
-    <td colspan="2" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
+    <td colspan="3" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
  </tr>
 </table>

 **数据解释**

-+ 使用 diffusers 库进行测试时，启用了全部`diffusers`库自带的优化，该方案未测试在非**NVIDIA A100 / H100** 外的设备上的实际显存 / 内存占用。通常，该方案可以适配于所有 **NVIDIA 安培架构**
-以上的设备。若关闭优化，显存占用会成倍增加，峰值显存约为表格的3倍。但速度提升3-4倍左右。你可以选择性的关闭部分优化，这些优化包括:
+ 使用 diffusers 库进行测试时，启用了全部`diffusers`库自带的优化，该方案未测试在非**NVIDIA A100 / H100**
+  外的设备上的实际显存 / 内存占用。通常，该方案可以适配于所有 **NVIDIA 安培架构**
+  以上的设备。若关闭优化，显存占用会成倍增加，峰值显存约为表格的3倍。但速度提升3-4倍左右。你可以选择性的关闭部分优化，这些优化包括:
+
 ```
 pipe.enable_sequential_cpu_offload()
 pipe.vae.enable_slicing()
@ -235,7 +242,7 @@ pipe.vae.enable_tiling()

 + 多GPU推理时，需要关闭 `enable_sequential_cpu_offload()` 优化。
 + 使用 INT8 模型会导致推理速度降低，此举是为了满足显存较低的显卡能正常推理并保持较少的视频质量损失，推理速度大幅降低。
-+ 2B 模型采用 `FP16` 精度训练， 5B模型采用 `BF16` 精度训练。我们推荐使用模型训练的精度进行推理。
+ CogVideoX-2B 模型采用 `FP16` 精度训练， 搜有 CogVideoX-5B 模型采用 `BF16` 精度训练。我们推荐使用模型训练的精度进行推理。
 + [PytorchAO](https://github.com/pytorch/ao) 和 [Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
  可以用于量化文本编码器、Transformer 和 VAE 模块，以降低 CogVideoX 的内存需求。这使得在免费的 T4 Colab 或更小显存的 GPU
  上运行模型成为可能！同样值得注意的是，TorchAO 量化完全兼容 `torch.compile`，这可以显著提高推理速度。在 `NVIDIA H100`
@ -243,6 +250,7 @@ pipe.vae.enable_tiling()
  包。建议使用 `CUDA 12.4`。
 + 推理速度测试同样采用了上述显存优化方案，不采用显存优化的情况下，推理速度提升约10%。 只有`diffusers`版本模型支持量化。
 + 模型仅支持英语输入，其他语言可以通过大模型润色时翻译为英语。
+ 模型微调所占用的显存是在 `8 * H100` 环境下进行测试，程序已经自动使用`Zero 2` 优化。表格中若有标注具体GPU数量则必须使用大于等于该数量的GPU进行微调。

 ## 友情链接

@ -253,13 +261,22 @@ pipe.vae.enable_tiling()
 + [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys): VideoSys 提供了易用且高性能的视频生成基础设施，支持完整的管道，并持续集成最新的模型和技术。
 + [AutoDL镜像](https://www.codewithgpu.com/i/THUDM/CogVideo/CogVideoX-5b-demo): 由社区成员提供的一键部署Huggingface
  Space镜像。
-+ [Colab Space](https://github.com/camenduru/CogVideoX-5B-jupyter) 使用 Jupyter Notebook 运行 CogVideoX-5B 模型的 Colab
-  代码。

 ## 完整项目代码结构

 本开源仓库将带领开发者快速上手 **CogVideoX** 开源模型的基础调用方式、微调示例。

+### Colab 快速使用
+
+这里提供了三个能直接在免费的 Colab T4上 运行的项目
+
+ [CogVideoX-5B-T2V-Colab.ipynb](https://colab.research.google.com/drive/1pCe5s0bC_xuXbBlpvIH1z0kfdTLQPzCS?usp=sharing):
+  CogVideoX-5B 文字生成视频 Colab 代码。
+ [CogVideoX-5B-T2V-Int8-Colab.ipynb](https://colab.research.google.com/drive/1DUffhcjrU-uz7_cpuJO3E_D4BaJT7OPa?usp=sharing):
+  CogVideoX-5B 文字生成视频量化推理 Colab 代码，运行一次大约需要30分钟。
+ [CogVideoX-5B-I2V-Colab.ipynb](https://colab.research.google.com/drive/17CqYCqSwz39nZAX2YyonDxosVKUZGzcX?usp=sharing):
+  CogVideoX-5B 图片生成视频 Colab 代码。
+
 ### inference

 + [cli_demo](inference/cli_demo.py): 更详细的推理代码讲解，常见参数的意义，在这里都会提及。
@ -268,25 +285,20 @@ pipe.vae.enable_tiling()
  仅测试通过，且必须将 `torch-nightly`,`torchao`源代码安装，不建议在生产环境中使用。
 + [diffusers_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码。
 + [space demo](inference/gradio_composite_demo): Huggingface Space同款的 GUI 代码，植入了插帧，超分工具。
-+ [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合
-  CogVideoX的长输入。因为CogVideoX是在长文本上训练的，所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM-4，也可以替换为GPT、Gemini等任意大语言模型。
-+ [gradio_web_demo](inference/gradio_web_demo.py): 一个简单的gradio网页应用，展示如何使用 CogVideoX-2B / 5B 模型生成视频。
-  与我们的
-  Huggingface Space 类似，你可以使用此脚本运行一个简单的网页应用，用于生成视频。
-
-```shell
-cd inference
-# For Linux and Windows users
-python gradio_web_demo.py
-
-# For macOS with Apple Silicon users, Intel not supported, this maybe 20x slower than RTX 4090
-PYTORCH_ENABLE_MPS_FALLBACK=1 python gradio_web_demo.py
-```

 <div style="text-align: center;">
-    <img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
+    <img src="resources/web_demo.png" style="width: 100%; height: auto;" />
 </div>

+ [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合
+  CogVideoX的长输入。因为CogVideoX是在长文本上训练的，所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM-4，也可以替换为GPT、Gemini等任意大语言模型。
+ [gradio_web_demo](inference/gradio_composite_demo/app.py): 与 Huggingface Space 完全相同的代码实现，快速部署 CogVideoX
+  GUI体验。
+
+### finetune
+
+ [train_cogvideox_lora](finetune/README_zh.md): diffusers版本 CogVideoX 模型微调方案和细节。
+
 ### sat

 + [sat_demo](sat/README_zh.md): 包含了 SAT 权重的推理代码和微调代码，推荐基于 CogVideoX
@ -298,6 +310,11 @@ PYTORCH_ENABLE_MPS_FALLBACK=1 python gradio_web_demo.py

 + [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): 将 SAT 模型权重转换为 Huggingface 模型权重。
 + [caption_demo](tools/caption/README_zh.md):  Caption 工具，对视频理解并用文字输出的模型。
+ [export_sat_lora_weight](tools/export_sat_lora_weight.py):  SAT微调模型导出工具，将
+  SAT Lora Adapter 导出为 diffusers 格式。
+ [load_cogvideox_lora](tools/load_cogvideox_lora.py): 载入diffusers版微调Lora Adapter的工具代码。
+ [llm_flux_cogvideox](tools/llm_flux_cogvideox/llm_flux_cogvideox.py): 使用开源本地大语言模型 + Flux +
+  CogVideoX实现自动化生成视频。

 ## CogVideo(ICLR'23)

@ -337,22 +354,6 @@ CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.amine
 }
 ```

-## 开源项目规划
-
- [x] CogVideoX 模型开源
-    - [x] CogVideoX 模型推理示例 (CLI / Web Demo)
-    - [x] CogVideoX 在线体验示例 (Huggingface Space)
-    - [x] CogVideoX 开源模型API接口示例 (Huggingface)
-    - [x] CogVideoX 模型微调示例 (SAT)
-    - [ ] CogVideoX 模型微调示例 (Huggingface Diffusers)
-    - [X] CogVideoX-5B 开源 (适配 CogVideoX-2B 套件)
-    - [X] CogVideoX 技术报告公开
-    - [X] CogVideoX 技术讲解视频
- [ ] CogVideoX 周边工具
-    - [X] 视频超分 / 插帧基础套件
-    - [ ] 推理框架适配
-    - [ ] ComfyUI 完整生态工具
-
 我们欢迎您的贡献，您可以点击[这里](resources/contribute_zh.md)查看更多信息。

 ## 模型协议
@ -361,6 +362,6 @@ CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.amine

 CogVideoX-2B 模型 (包括其对应的Transformers模块，VAE模块) 根据 [Apache 2.0 协议](LICENSE) 许可证发布。

-CogVideoX-5B 模型 (Transformers 模块)
+CogVideoX-5B 模型 (Transformers 模块，包括图生视频，文生视频版本)
 根据 [CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE)
 许可证发布。
--- a/finetune/README.md
+++ b/finetune/README.md
@ -0,0 +1,152 @@
+# CogVideoX diffusers Fine-tuning Guide
+
+[中文阅读](./README_zh.md)
+
+[日本語で読む](./README_ja.md)
+
+This feature is not fully complete yet. If you want to check the fine-tuning for the SAT version, please
+see [here](../sat/README_zh.md). The dataset format is different from this version.
+
+## Hardware Requirements
+
+ CogVideoX-2B LoRA: 1 * A100
+ CogVideoX-2B SFT:  8 * A100
+ CogVideoX-5B/5B-I2V is not supported yet.
+
+## Install Dependencies
+
+Since the related code has not been merged into the diffusers release, you need to base your fine-tuning on the
+diffusers branch. Please follow the steps below to install dependencies:
+
+```shell
+git clone https://github.com/huggingface/diffusers.git
+cd diffusers
+git checkout cogvideox-lora-and-training
+pip install -e .
+```
+
+## Prepare the Dataset
+
+First, you need to prepare the dataset. The dataset format should be as follows, with `videos.txt` containing the list
+of videos in the `videos` directory:
+
+```
+.
+├── prompts.txt
+├── videos
+└── videos.txt
+```
+
+You can download
+the [Disney Steamboat Willie](https://huggingface.co/datasets/Wild-Heart/Disney-VideoGeneration-Dataset) dataset from
+here.
+
+This video fine-tuning dataset is used as a test for fine-tuning.
+
+## Configuration Files and Execution
+
+The `accelerate` configuration files are as follows:
+
+ `accelerate_config_machine_multi.yaml`: Suitable for multi-GPU use
+ `accelerate_config_machine_single.yaml`: Suitable for single-GPU use
+
+The configuration for the `finetune` script is as follows:
+
+```shell
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True  
+# This command sets the PyTorch CUDA memory allocation strategy to expandable segments to prevent OOM (Out of Memory) errors.
+
+accelerate launch --config_file accelerate_config_machine_single.yaml --multi_gpu # Launch training using Accelerate with the specified config file for multi-GPU.
+
+  train_cogvideox_lora.py   # This is the training script for LoRA fine-tuning of the CogVideoX model.
+
+  --pretrained_model_name_or_path THUDM/CogVideoX-2b   # Path to the pretrained model you want to fine-tune, pointing to the CogVideoX-2b model.
+
+  --cache_dir ~/.cache   # Directory for caching models downloaded from Hugging Face.
+
+  --enable_tiling   # Enable VAE tiling to reduce memory usage by processing images in smaller chunks.
+
+  --enable_slicing   # Enable VAE slicing to split the image into slices along the channel to save memory.
+
+  --instance_data_root ~/disney/   # Root directory for instance data, i.e., the dataset used for training.
+
+  --caption_column prompts.txt   # Specify the column or file containing instance prompts (text descriptions), in this case, the `prompts.txt` file.
+
+  --video_column videos.txt   # Specify the column or file containing video paths, in this case, the `videos.txt` file.
+
+  --validation_prompt "Mickey with the captain and friends:::Mickey and the bear"   # Validation prompts; multiple prompts are separated by the specified delimiter (e.g., `:::`).
+
+  --validation_prompt_separator :::   # The separator for validation prompts, set to `:::` here.
+
+  --num_validation_videos 1   # Number of videos to generate during validation, set to 1.
+
+  --validation_epochs 2   # Number of epochs after which validation will be run, set to every 2 epochs.
+
+  --seed 3407   # Set a random seed to ensure reproducibility, set to 3407.
+
+  --rank 128   # Dimension of the LoRA update matrix, controls the size of the LoRA layers, set to 128.
+
+  --mixed_precision bf16   # Use mixed precision training, set to `bf16` (bfloat16) to reduce memory usage and speed up training.
+
+  --output_dir cogvideox-lora-single-gpu   # Output directory for storing model predictions and checkpoints.
+
+  --height 480   # Height of the input videos, all videos will be resized to 480 pixels.
+
+  --width 720   # Width of the input videos, all videos will be resized to 720 pixels.
+
+  --fps 8   # Frame rate of the input videos, all videos will be processed at 8 frames per second.
+
+  --max_num_frames 49   # Maximum number of frames per input video, videos will be truncated to 49 frames.
+
+  --skip_frames_start 0   # Number of frames to skip from the start of each video, set to 0 to not skip any frames.
+
+  --skip_frames_end 0   # Number of frames to skip from the end of each video, set to 0 to not skip any frames.
+
+  --train_batch_size 1   # Training batch size per device, set to 1.
+
+  --num_train_epochs 10   # Total number of training epochs, set to 10.
+
+  --checkpointing_steps 500   # Save checkpoints every 500 steps.
+
+  --gradient_accumulation_steps 1   # Gradient accumulation steps, perform an update every 1 step.
+
+  --learning_rate 1e-4   # Initial learning rate, set to 1e-4.
+
+  --optimizer AdamW   # Optimizer type, using AdamW optimizer.
+
+  --adam_beta1 0.9   # Beta1 parameter for the Adam optimizer, set to 0.9.
+
+  --adam_beta2 0.95   # Beta2 parameter for the Adam optimizer, set to 0.95.
+```
+
+## Running the Script to Start Fine-tuning
+
+Single GPU fine-tuning:
+
+```shell
+bash finetune_single_gpu.sh
+```
+
+Multi-GPU fine-tuning:
+
+```shell
+bash finetune_multi_gpus_1.sh # Needs to be run on each node
+```
+
+## Loading the Fine-tuned Model
+
+ Please refer to [cli_demo.py](../inference/cli_demo.py) for how to load the fine-tuned model.
+
+## Best Practices
+
+ Includes 70 training videos with a resolution of `200 x 480 x 720` (frames x height x width). By skipping frames in
+  the data preprocessing, we created two smaller datasets with 49 and 16 frames to speed up experimentation, as the
+  maximum frame limit recommended by the CogVideoX team is 49 frames. We split the 70 videos into three groups of 10,
+  25, and 50 videos, with similar conceptual nature.
+ Using 25 or more videos works best when training new concepts and styles.
+ It works better to train using identifier tokens specified with `--id_token`. This is similar to Dreambooth training,
+  but regular fine-tuning without such tokens also works.
+ The original repository used `lora_alpha` set to 1. We found this value ineffective across multiple runs, likely due
+  to differences in the backend and training setup. Our recommendation is to set `lora_alpha` equal to rank or rank //
+  2.
+ We recommend using a rank of 64 or higher.
--- a/finetune/README_ja.md
+++ b/finetune/README_ja.md
@ -0,0 +1,142 @@
+# CogVideoX diffusers 微調整方法
+
+[Read this in English.](./README_zh)
+
+[中文阅读](./README_zh.md)
+
+
+この機能はまだ完全に完成していません。SATバージョンの微調整を確認したい場合は、[こちら](../sat/README_ja.md)を参照してください。本バージョンとは異なるデータセット形式を使用しています。
+
+## ハードウェア要件
+
+ CogVideoX-2B LORA: 1 * A100
+ CogVideoX-2B SFT:  8 * A100
+ CogVideoX-5B/5B-I2V まだサポートしていません
+
+## 依存関係のインストール
+
+関連コードはまだdiffusersのリリース版に統合されていないため、diffusersブランチを使用して微調整を行う必要があります。以下の手順に従って依存関係をインストールしてください：
+
+```shell
+git clone https://github.com/huggingface/diffusers.git
+cd diffusers
+git checkout cogvideox-lora-and-training
+pip install -e .
+```
+
+## データセットの準備
+
+まず、データセットを準備する必要があります。データセットの形式は以下のようになります。
+
+```
+.
+├── prompts.txt
+├── videos
+└── videos.txt
+```
+
+[ディズニースチームボートウィリー](https://huggingface.co/datasets/Wild-Heart/Disney-VideoGeneration-Dataset)をここからダウンロードできます。
+
+ビデオ微調整データセットはテスト用として使用されます。
+
+## 設定ファイルと実行
+
+`accelerate` 設定ファイルは以下の通りです:
+
+ accelerate_config_machine_multi.yaml 複数GPU向け
+ accelerate_config_machine_single.yaml 単一GPU向け
+
+`finetune` スクリプト設定ファイルの例：
+
+```shell
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True  
+# このコマンドは、OOM（メモリ不足）エラーを防ぐために、CUDAメモリ割り当てを拡張セグメントに設定します。
+
+accelerate launch --config_file accelerate_config_machine_single.yaml --multi_gpu # 複数のGPUで `accelerate` を使用してトレーニングを開始します。指定された設定ファイルを使用します。
+
+  train_cogvideox_lora.py   # LoRA微調整用に CogVideoX モデルをトレーニングするスクリプトです。
+
+  --pretrained_model_name_or_path THUDM/CogVideoX-2b   # 事前学習済みモデルのパスです。
+
+  --cache_dir ~/.cache   # Hugging Faceからダウンロードされたモデルとデータセットのキャッシュディレクトリです。
+
+  --enable_tiling   # VAEタイル化機能を有効にし、メモリ使用量を削減します。
+
+  --enable_slicing   # VAEスライス機能を有効にして、チャネルでのスライス処理を行い、メモリを節約します。
+
+  --instance_data_root ~/disney/   # インスタンスデータのルートディレクトリです。
+
+  --caption_column prompts.txt   # テキストプロンプトが含まれているファイルや列を指定します。
+
+  --video_column videos.txt   # ビデオパスが含まれているファイルや列を指定します。
+
+  --validation_prompt "Mickey with the captain and friends:::Mickey and the bear"   # 検証用のプロンプトを指定します。複数のプロンプトを指定するには `:::` 区切り文字を使用します。
+
+  --validation_prompt_separator :::   # 検証プロンプトの区切り文字を `:::` に設定します。
+
+  --num_validation_videos 1   # 検証中に生成するビデオの数を1に設定します。
+
+  --validation_epochs 2   # 何エポックごとに検証を行うかを2に設定します。
+
+  --seed 3407   # ランダムシードを3407に設定し、トレーニングの再現性を確保します。
+
+  --rank 128   # LoRAの更新マトリックスの次元を128に設定します。
+
+  --mixed_precision bf16   # 混合精度トレーニングを `bf16` (bfloat16) に設定します。
+
+  --output_dir cogvideox-lora-single-gpu   # 出力ディレクトリを指定します。
+
+  --height 480   # 入力ビデオの高さを480ピクセルに設定します。
+
+  --width 720   # 入力ビデオの幅を720ピクセルに設定します。
+
+  --fps 8   # 入力ビデオのフレームレートを8 fpsに設定します。
+
+  --max_num_frames 49   # 入力ビデオの最大フレーム数を49に設定します。
+
+  --skip_frames_start 0   # 各ビデオの最初のフレームをスキップしません。
+
+  --skip_frames_end 0   # 各ビデオの最後のフレームをスキップしません。
+
+  --train_batch_size 1   # トレーニングバッチサイズを1に設定します。
+
+  --num_train_epochs 10   # トレーニングのエポック数を10に設定します。
+
+  --checkpointing_steps 500   # 500ステップごとにチェックポイントを保存します。
+
+  --gradient_accumulation_steps 1   # 1ステップごとに勾配を蓄積して更新します。
+
+  --learning_rate 1e-4   # 初期学習率を1e-4に設定します。
+
+  --optimizer AdamW   # AdamWオプティマイザーを使用します。
+
+  --adam_beta1 0.9   # Adamのbeta1パラメータを0.9に設定します。
+
+  --adam_beta2 0.95   # Adamのbeta2パラメータを0.95に設定します。
+```
+
+## 微調整を開始
+
+単一GPU微調整：
+
+```shell
+bash finetune_single_gpu.sh
+```
+
+複数GPU微調整：
+
+```shell
+bash finetune_multi_gpus_1.sh # 各ノードで実行する必要があります。
+```
+
+## 微調整済みモデルのロード
+
+ 微調整済みのモデルをロードする方法については、[cli_demo.py](../inference/cli_demo.py) を参照してください。
+
+## ベストプラクティス
+
+ 解像度が `200 x 480 x 720`（フレーム数 x 高さ x 幅）のトレーニングビデオが70本含まれています。データ前処理でフレームをスキップすることで、49フレームと16フレームの小さなデータセットを作成しました。これは実験を加速するためのもので、CogVideoXチームが推奨する最大フレーム数制限は49フレームです。
+ 25本以上のビデオが新しい概念やスタイルのトレーニングに最適です。
+ 現在、`--id_token` を指定して識別トークンを使用してトレーニングする方が効果的です。これはDreamboothトレーニングに似ていますが、通常の微調整でも機能します。
+ 元のリポジトリでは `lora_alpha` を1に設定していましたが、複数の実行でこの値が効果的でないことがわかりました。モデルのバックエンドやトレーニング設定によるかもしれません。私たちの提案は、lora_alphaをrankと同じか、rank // 2に設定することです。
+ Rank 64以上の設定を推奨します。
--- a/finetune/README_zh.md
+++ b/finetune/README_zh.md
@ -0,0 +1,175 @@
+# CogVideoX diffusers 微调方案
+
+[Read this in English](./README_zh.md)
+
+[日本語で読む](./README_ja.md)
+
+本功能尚未完全完善，如果您想查看SAT版本微调，请查看[这里](../sat/README_zh.md)。其数据集格式与本版本不同。
+
+## 硬件要求
+
+ CogVideoX-2B LORA: 1 * A100
+ CogVideoX-2B SFT:  8 * A100
+ CogVideoX-5B/5B-I2V 暂未支持
+
+## 安装依赖
+
+由于相关代码还没有被合并到diffusers发行版，你需要基于diffusers分支进行微调。请按照以下步骤安装依赖：
+
+```shell
+git clone https://github.com/huggingface/diffusers.git
+cd diffusers
+git checkout cogvideox-lora-and-training
+pip install -e .
+```
+
+## 准备数据集
+
+首先，你需要准备数据集，数据集格式如下，其中，videos.txt 存放 videos 中的视频。
+
+```
+.
+├── prompts.txt
+├── videos
+└── videos.txt
+```
+
+你可以从这里下载 [迪士尼汽船威利号](https://huggingface.co/datasets/Wild-Heart/Disney-VideoGeneration-Dataset)
+
+视频微调数据集作为测试微调。
+
+## 配置文件和运行
+
+`accelerate` 配置文件如下:
+
+ accelerate_config_machine_multi.yaml 适合多GPU使用
+ accelerate_config_machine_single.yaml 适合单GPU使用
+
+`finetune` 脚本配置文件如下:
+
+```shell
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True  
+# 这条命令设置了 PyTorch 的 CUDA 内存分配策略，将显存扩展为段式内存管理，以防止 OOM（Out of Memory）错误。
+
+accelerate launch --config_file accelerate_config_machine_single.yaml --multi_gpu \
+# 使用 Accelerate 启动训练，指定配置文件 `accelerate_config_machine_single.yaml`，并使用多 GPU。
+
+  train_cogvideox_lora.py \
+  # 这是你要执行的训练脚本，用于 LoRA 微调 CogVideoX 模型。
+
+  --pretrained_model_name_or_path THUDM/CogVideoX-2b \
+  # 预训练模型的路径，指向你要微调的 CogVideoX-5b 模型。
+
+  --cache_dir ~/.cache \
+  # 模型缓存的目录，用于存储从 Hugging Face 下载的模型和数据集。
+
+  --enable_tiling \
+  # 启用 VAE tiling 功能，通过将图像划分成更小的区块处理，减少显存占用。
+
+  --enable_slicing \
+  # 启用 VAE slicing 功能，将图像在通道上切片处理，以节省显存。
+
+  --instance_data_root ~/disney/ \
+  # 实例数据的根目录，训练时使用的数据集文件夹。
+
+  --caption_column prompts.txt \
+  # 用于指定包含实例提示（文本描述）的列或文件，在本例中为 `prompts.txt` 文件。
+
+  --video_column videos.txt \
+  # 用于指定包含视频路径的列或文件，在本例中为 `videos.txt` 文件。
+
+  --validation_prompt "Mickey with the captain and friends:::Mickey and the bear" \
+  # 用于验证的提示语，多个提示语用指定分隔符（例如 `:::`）分开。
+
+  --validation_prompt_separator ::: \
+  # 验证提示语的分隔符，在此设置为 `:::`。
+
+  --num_validation_videos 1 \
+  # 验证期间生成的视频数量，设置为 1。
+
+  --validation_epochs 2 \
+  # 每隔多少个 epoch 运行一次验证，设置为每 2 个 epoch 验证一次。
+
+  --seed 3407 \
+  # 设置随机数种子，确保训练的可重复性，设置为 3407。
+
+  --rank 128 \
+  # LoRA 更新矩阵的维度，控制 LoRA 层的参数大小，设置为 128。
+
+  --mixed_precision bf16 \
+  # 使用混合精度训练，设置为 `bf16`（bfloat16），可以减少显存占用并加速训练。
+
+  --output_dir cogvideox-lora-single-gpu \
+  # 输出目录，存放模型预测结果和检查点。
+
+  --height 480 \
+  # 输入视频的高度，所有视频将被调整到 480 像素。
+
+  --width 720 \
+  # 输入视频的宽度，所有视频将被调整到 720 像素。
+
+  --fps 8 \
+  # 输入视频的帧率，所有视频将以每秒 8 帧处理。
+
+  --max_num_frames 49 \
+  # 输入视频的最大帧数，视频将被截取到最多 49 帧。
+
+  --skip_frames_start 0 \
+  # 每个视频从头部开始跳过的帧数，设置为 0，表示不跳过帧。
+
+  --skip_frames_end 0 \
+  # 每个视频从尾部跳过的帧数，设置为 0，表示不跳过尾帧。
+
+  --train_batch_size 1 \
+  # 训练的批次大小，每个设备的训练批次设置为 1。
+
+  --num_train_epochs 10 \
+  # 训练的总 epoch 数，设置为 10。
+
+  --checkpointing_steps 500 \
+  # 每经过 500 步保存一次检查点。
+
+  --gradient_accumulation_steps 1 \
+  # 梯度累积步数，表示每进行 1 步才进行一次梯度更新。
+
+  --learning_rate 1e-4 \
+  # 初始学习率，设置为 1e-4。
+
+  --optimizer AdamW \
+  # 优化器类型，选择 AdamW 优化器。
+
+  --adam_beta1 0.9 \
+  # Adam 优化器的 beta1 参数，设置为 0.9。
+
+  --adam_beta2 0.95 \
+  # Adam 优化器的 beta2 参数，设置为 0.95。
+```
+
+## 运行脚本，开始微调
+
+单卡微调：
+
+```shell
+bash finetune_single_gpu.sh
+```
+
+多卡微调：
+
+```shell
+bash finetune_multi_gpus_1.sh #需要在每个节点运行
+```
+
+## 载入微调的模型
+
+ 请关注[cli_demo.py](../inference/cli_demo.py) 以了解如何加载微调的模型。
+
+## 最佳实践
+
+ 包含70个分辨率为 `200 x 480 x 720`（帧数 x 高 x
+  宽）的训练视频。通过数据预处理中的帧跳过，我们创建了两个较小的49帧和16帧数据集，以加快实验速度，因为CogVideoX团队建议的最大帧数限制是49帧。我们将70个视频分成三组，分别为10、25和50个视频。这些视频的概念性质相似。
+ 25个及以上的视频在训练新概念和风格时效果最佳。
+ 现使用可以通过 `--id_token` 指定的标识符token进行训练效果更好。这类似于 Dreambooth 训练，但不使用这种token的常规微调也可以工作。
+ 原始仓库使用 `lora_alpha` 设置为 1。我们发现这个值在多次运行中效果不佳，可能是因为模型后端和训练设置的不同。我们的建议是将
+  lora_alpha 设置为与 rank 相同或 rank // 2。
+ 建议使用 rank 为 64 及以上的设置。
+
--- a/finetune/accelerate_config_machine_multi.yaml
+++ b/finetune/accelerate_config_machine_multi.yaml
@ -0,0 +1,26 @@
+compute_environment: LOCAL_MACHINE
+debug: true
+deepspeed_config:
+  deepspeed_hostfile: hostfile.txt
+  deepspeed_multinode_launcher: pdsh
+  gradient_accumulation_steps: 1
+  gradient_clipping: 1.0
+  offload_optimizer_device: none
+  offload_param_device: none
+  zero3_init_flag: true
+  zero_stage: 3
+distributed_type: DEEPSPEED
+downcast_bf16: 'yes'
+enable_cpu_affinity: true
+main_process_ip: 10.250.128.19
+main_process_port: 12355
+main_training_function: main
+mixed_precision: bf16
+num_machines: 4
+num_processes: 32
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/finetune/accelerate_config_machine_single.yaml
+++ b/finetune/accelerate_config_machine_single.yaml
@ -0,0 +1,24 @@
+compute_environment: LOCAL_MACHINE
+debug: false
+deepspeed_config:
+  gradient_accumulation_steps: 1
+  gradient_clipping: 1.0
+  offload_optimizer_device: none
+  offload_param_device: none
+  zero3_init_flag: false
+  zero_stage: 2
+distributed_type: DEEPSPEED
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+machine_rank: 0
+main_training_function: main
+dynamo_backend: 'no'
+mixed_precision: 'no'
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/finetune/finetune_multi_gpus_1.sh
+++ b/finetune/finetune_multi_gpus_1.sh
@ -0,0 +1,40 @@
+#!/bin/bash
+
+export MODEL_PATH="THUDM/CogVideoX-2b"
+export CACHE_PATH="~/.cache"
+export DATASET_PATH="Disney-VideoGeneration-Dataset"
+export OUTPUT_PATH="cogvideox-lora-multi-gpu"
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+export CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
+
+accelerate launch --config_file accelerate_config_machine_single.yaml --multi_gpu  --machine_rank 0 \
+  train_cogvideox_lora.py \
+  --pretrained_model_name_or_path $MODEL_PATH \
+  --cache_dir $CACHE_PATH \
+  --enable_tiling \
+  --enable_slicing \
+  --instance_data_root $DATASET_PATH \
+  --caption_column prompts.txt \
+  --video_column videos.txt \
+  --validation_prompt "Mickey with the captain and friends:::Mickey and the bear" \
+  --validation_prompt_separator ::: \
+  --num_validation_videos 1 \
+  --validation_epochs 2 \
+  --seed 3407 \
+  --rank 128 \
+  --mixed_precision bf16 \
+  --output_dir $OUTPUT_PATH \
+  --height 480 \
+  --width 720 \
+  --fps 8 \
+  --max_num_frames 49 \
+  --skip_frames_start 0 \
+  --skip_frames_end 0 \
+  --train_batch_size 1 \
+  --num_train_epochs 10 \
+  --checkpointing_steps 500 \
+  --gradient_accumulation_steps 1 \
+  --learning_rate 1e-4 \
+  --optimizer AdamW \
+  --adam_beta1 0.9 \
+  --adam_beta2 0.95
--- a/finetune/finetune_single_gpu.sh
+++ b/finetune/finetune_single_gpu.sh
@ -0,0 +1,40 @@
+#!/bin/bash
+
+export MODEL_PATH="THUDM/CogVideoX-2b"
+export CACHE_PATH="~/.cache"
+export DATASET_PATH="Disney-VideoGeneration-Dataset"
+export OUTPUT_PATH="cogvideox-lora-single-gpu"
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+export CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
+
+accelerate launch --config_file accelerate_config_machine_single.yaml --multi_gpu \
+  train_cogvideox_lora.py \
+  --pretrained_model_name_or_path $MODEL_PATH \
+  --cache_dir $CACHE_PATH \
+  --enable_tiling \
+  --enable_slicing \
+  --instance_data_root $DATASET_PATH \
+  --caption_column prompts.txt \
+  --video_column videos.txt \
+  --validation_prompt "Mickey with the captain and friends:::Mickey and the bear" \
+  --validation_prompt_separator ::: \
+  --num_validation_videos 1 \
+  --validation_epochs 2 \
+  --seed 3407 \
+  --rank 128 \
+  --mixed_precision bf16 \
+  --output_dir $OUTPUT_PATH \
+  --height 480 \
+  --width 720 \
+  --fps 8 \
+  --max_num_frames 49 \
+  --skip_frames_start 0 \
+  --skip_frames_end 0 \
+  --train_batch_size 1 \
+  --num_train_epochs 10 \
+  --checkpointing_steps 500 \
+  --gradient_accumulation_steps 1 \
+  --learning_rate 1e-4 \
+  --optimizer AdamW \
+  --adam_beta1 0.9 \
+  --adam_beta2 0.95
--- a/finetune/hostfile.txt
+++ b/finetune/hostfile.txt
@ -0,0 +1,2 @@
+node1 slots=8
+node2 slots=8
--- a/finetune/train_cogvideox_lora.py
+++ b/finetune/train_cogvideox_lora.py
--- a/inference/cli_demo.py
+++ b/inference/cli_demo.py
@ -21,26 +21,30 @@ import argparse
 from typing import Literal

 import torch
-from diffusers import (CogVideoXPipeline,
-                       CogVideoXDDIMScheduler,
-                       CogVideoXDPMScheduler,
-                       CogVideoXImageToVideoPipeline,
-                       CogVideoXVideoToVideoPipeline)
+from diffusers import (
+    CogVideoXPipeline,
+    CogVideoXDDIMScheduler,
+    CogVideoXDPMScheduler,
+    CogVideoXImageToVideoPipeline,
+    CogVideoXVideoToVideoPipeline,
+)

 from diffusers.utils import export_to_video, load_image, load_video


 def generate_video(
-        prompt: str,
-        model_path: str,
-        output_path: str = "./output.mp4",
-        image_or_video_path: str = "",
-        num_inference_steps: int = 50,
-        guidance_scale: float = 6.0,
-        num_videos_per_prompt: int = 1,
-        dtype: torch.dtype = torch.bfloat16,
-        generate_type: str = Literal["t2v", "i2v", "v2v"],  # i2v: image to video, v2v: video to video
-        seed: int = 42,
+    prompt: str,
+    model_path: str,
+    lora_path: str = None,
+    lora_rank: int = 128,
+    output_path: str = "./output.mp4",
+    image_or_video_path: str = "",
+    num_inference_steps: int = 50,
+    guidance_scale: float = 6.0,
+    num_videos_per_prompt: int = 1,
+    dtype: torch.dtype = torch.bfloat16,
+    generate_type: str = Literal["t2v", "i2v", "v2v"],  # i2v: image to video, v2v: video to video
+    seed: int = 42,
 ):
    """
    Generates a video based on the given prompt and saves it to the specified path.
@ -48,12 +52,14 @@ def generate_video(
    Parameters:
    - prompt (str): The description of the video to be generated.
    - model_path (str): The path of the pre-trained model to be used.
+    - lora_path (str): The path of the LoRA weights to be used.
+    - lora_rank (int): The rank of the LoRA weights.
    - output_path (str): The path where the generated video will be saved.
    - num_inference_steps (int): Number of steps for the inference process. More steps can result in better quality.
    - guidance_scale (float): The scale for classifier-free guidance. Higher values can lead to better alignment with the prompt.
    - num_videos_per_prompt (int): Number of videos to generate per prompt.
    - dtype (torch.dtype): The data type for computation (default is torch.bfloat16).
-    - generate_type (str): The type of video generation (e.g., 't2v', 'i2v', 'v2v').
+    - generate_type (str): The type of video generation (e.g., 't2v', 'i2v', 'v2v').·
    - seed (int): The seed for reproducibility.
    """

@ -73,9 +79,16 @@ def generate_video(
        pipe = CogVideoXVideoToVideoPipeline.from_pretrained(model_path, torch_dtype=dtype)
        video = load_video(image_or_video_path)

+    # If you're using with lora, add this code
+    if lora_path:
+        pipe.load_lora_weights(lora_path, weight_name="pytorch_lora_weights.safetensors", adapter_name="test_1")
+        pipe.fuse_lora(lora_scale=1 / lora_rank)
+
    # 2. Set Scheduler.
    # Can be changed to `CogVideoXDPMScheduler` or `CogVideoXDDIMScheduler`.
-    # We recommend using `CogVideoXDDIMScheduler` for CogVideoX-2B and `CogVideoXDPMScheduler` for CogVideoX-5B.
+    # We recommend using `CogVideoXDDIMScheduler` for CogVideoX-2B.
+    # using `CogVideoXDPMScheduler` for CogVideoX-5B / CogVideoX-5B-I2V.
+
    # pipe.scheduler = CogVideoXDDIMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
    pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")

@ -83,27 +96,29 @@ def generate_video(
    # turn off if you have multiple GPUs or enough GPU memory(such as H100) and it will cost less time in inference
    # and enable to("cuda")

-    # pipe.enable_sequential_cpu_offload()
-    pipe.to("cuda")
+    # pipe.to("cuda")
+
+    pipe.enable_sequential_cpu_offload()
+
    pipe.vae.enable_slicing()
    pipe.vae.enable_tiling()

    # 4. Generate the video frames based on the prompt.
    # `num_frames` is the Number of frames to generate.
-    # This is the default value for 6 seconds video and 8 fps,so 48 frames and will plus 1 frame for the first frame and 49 frames.
+    # This is the default value for 6 seconds video and 8 fps and will plus 1 frame for the first frame and 49 frames.
    if generate_type == "i2v":
        video_generate = pipe(
            prompt=prompt,
            image=image,  # The path of the image to be used as the background of the video
            num_videos_per_prompt=num_videos_per_prompt,  # Number of videos to generate per prompt
            num_inference_steps=num_inference_steps,  # Number of inference steps
-            num_frames=49,  # Number of frames to generate，changed to 49 for diffusers version `0.31.0` and after.
-            use_dynamic_cfg=True,  ## This id used for DPM Sechduler, for DDIM scheduler, it should be False
+            num_frames=49,  # Number of frames to generate，changed to 49 for diffusers version `0.30.3` and after.
+            use_dynamic_cfg=True,  # This id used for DPM Sechduler, for DDIM scheduler, it should be False
            guidance_scale=guidance_scale,
            generator=torch.Generator().manual_seed(seed),  # Set the seed for reproducibility
        ).frames[0]
    elif generate_type == "t2v":
-        video_generate  = pipe(
+        video_generate = pipe(
            prompt=prompt,
            num_videos_per_prompt=num_videos_per_prompt,
            num_inference_steps=num_inference_steps,
@ -130,19 +145,31 @@ def generate_video(
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Generate a video from a text prompt using CogVideoX")
    parser.add_argument("--prompt", type=str, required=True, help="The description of the video to be generated")
-    parser.add_argument("--image_or_video_path", type=str, default=None,
-                        help="The path of the image to be used as the background of the video")
-    parser.add_argument("--model_path", type=str, default="THUDM/CogVideoX-5b",
-                        help="The path of the pre-trained model to be used")
-    parser.add_argument("--output_path", type=str, default="./output.mp4",
-                        help="The path where the generated video will be saved")
+    parser.add_argument(
+        "--image_or_video_path",
+        type=str,
+        default=None,
+        help="The path of the image to be used as the background of the video",
+    )
+    parser.add_argument(
+        "--model_path", type=str, default="THUDM/CogVideoX-5b", help="The path of the pre-trained model to be used"
+    )
+    parser.add_argument("--lora_path", type=str, default=None, help="The path of the LoRA weights to be used")
+    parser.add_argument("--lora_rank", type=int, default=128, help="The rank of the LoRA weights")
+    parser.add_argument(
+        "--output_path", type=str, default="./output.mp4", help="The path where the generated video will be saved"
+    )
    parser.add_argument("--guidance_scale", type=float, default=6.0, help="The scale for classifier-free guidance")
-    parser.add_argument("--num_inference_steps", type=int, default=50, help="Number of steps for the inference process")
+    parser.add_argument(
+        "--num_inference_steps", type=int, default=50, help="Number of steps for the inference process"
+    )
    parser.add_argument("--num_videos_per_prompt", type=int, default=1, help="Number of videos to generate per prompt")
-    parser.add_argument("--generate_type", type=str, default="t2v",
-                        help="The type of video generation (e.g., 't2v', 'i2v', 'v2v')")
-    parser.add_argument("--dtype", type=str, default="bfloat16",
-                        help="The data type for computation (e.g., 'float16' or 'bfloat16')")
+    parser.add_argument(
+        "--generate_type", type=str, default="t2v", help="The type of video generation (e.g., 't2v', 'i2v', 'v2v')"
+    )
+    parser.add_argument(
+        "--dtype", type=str, default="bfloat16", help="The data type for computation (e.g., 'float16' or 'bfloat16')"
+    )
    parser.add_argument("--seed", type=int, default=42, help="The seed for reproducibility")

    args = parser.parse_args()
@ -150,8 +177,10 @@ if __name__ == "__main__":
    generate_video(
        prompt=args.prompt,
        model_path=args.model_path,
-        image_or_video_path=args.image_or_video_path,
+        lora_path=args.lora_path,
+        lora_rank=args.lora_rank,
        output_path=args.output_path,
+        image_or_video_path=args.image_or_video_path,
        num_inference_steps=args.num_inference_steps,
        guidance_scale=args.guidance_scale,
        num_videos_per_prompt=args.num_videos_per_prompt,
--- a/inference/cli_demo_quantization.py
+++ b/inference/cli_demo_quantization.py
@ -6,7 +6,7 @@ Note:
 Must install the `torchao`，`torch`,`diffusers`,`accelerate` library FROM SOURCE to use the quantization feature.
 Only NVIDIA GPUs like H100 or higher are supported om FP-8 quantization.

-ALL quantization schemes must using with NVIDIA GPUs.
+ALL quantization schemes must use with NVIDIA GPUs.

 # Run the script:

@ -83,7 +83,8 @@ def generate_video(

    # Using with compile will run faster. First time infer will cost ~30min to compile.
    # pipe.transformer.to(memory_format=torch.channels_last)
-    # for FP8 should remove  pipe.enable_model_cpu_offload()
+
+    # for FP8 should remove pipe.enable_model_cpu_offload()
    pipe.enable_model_cpu_offload()

    # This is not for FP8 and INT8 and should remove this line
@ -95,7 +96,7 @@ def generate_video(
        num_videos_per_prompt=num_videos_per_prompt,
        num_inference_steps=num_inference_steps,
        num_frames=49,
-        use_dynamic_cfg=True,  ## This id used for DPM Sechduler, for DDIM scheduler, it should be False
+        use_dynamic_cfg=True,
        guidance_scale=guidance_scale,
        generator=torch.Generator(device="cuda").manual_seed(42),
    ).frames[0]
--- a/inference/cli_vae_demo.py
+++ b/inference/cli_vae_demo.py
@ -1,18 +1,22 @@
 """
 This script is designed to demonstrate how to use the CogVideoX-2b VAE model for video encoding and decoding.
 It allows you to encode a video into a latent representation, decode it back into a video, or perform both operations sequentially.
-Before running the script, make sure to clone the CogVideoX Hugging Face model repository and set the `{your local diffusers path}` argument to the path of the cloned repository.
+Before running the script, make sure to clone the CogVideoX Hugging Face model repository and set the
+`{your local diffusers path}` argument to the path of the cloned repository.

 Command 1: Encoding Video
-Encodes the video located at ../resources/videos/1.mp4 using the CogVideoX-2b VAE model.
-Memory Usage: ~34GB of GPU memory for encoding.
-If you do not have enough GPU memory, we provide a pre-encoded tensor file (encoded.pt) in the resources folder and you can still run the decoding command.
+Encodes the video located at ../resources/videos/1.mp4 using the CogVideoX-5b VAE model.
+Memory Usage: ~18GB of GPU memory for encoding.
+
+If you do not have enough GPU memory, we provide a pre-encoded tensor file (encoded.pt) in the resources folder,
+and you can still run the decoding command.
+
 $ python cli_vae_demo.py --model_path {your local diffusers path}/CogVideoX-2b/vae/ --video_path ../resources/videos/1.mp4 --mode encode

 Command 2: Decoding Video

 Decodes the latent representation stored in encoded.pt back into a video.
-Memory Usage: ~19GB of GPU memory for decoding.
+Memory Usage: ~4GB of GPU memory for decoding.
 $ python cli_vae_demo.py --model_path {your local diffusers path}/CogVideoX-2b/vae/ --encoded_path ./encoded.pt --mode decode

 Command 3: Encoding and Decoding Video
@ -24,9 +28,9 @@ $ python cli_vae_demo.py --model_path {your local diffusers path}/CogVideoX-2b/v
 import argparse
 import torch
 import imageio
-import numpy as np
 from diffusers import AutoencoderKLCogVideoX
 from torchvision import transforms
+import numpy as np


 def encode_video(model_path, video_path, dtype, device):
@ -42,7 +46,12 @@ def encode_video(model_path, video_path, dtype, device):
    Returns:
    - torch.Tensor: The encoded video frames.
    """
+
    model = AutoencoderKLCogVideoX.from_pretrained(model_path, torch_dtype=dtype).to(device)
+
+    model.enable_slicing()
+    model.enable_tiling()
+
    video_reader = imageio.get_reader(video_path, "ffmpeg")

    frames = [transforms.ToTensor()(frame) for frame in video_reader]
@ -80,13 +89,13 @@ def save_video(tensor, output_path):
    Saves the video frames to a video file.

    Parameters:
-    - tensor (torch.Tensor): The video frames tensor.
+    - tensor (torch.Tensor): The video frames' tensor.
    - output_path (str): The path to save the output video.
    """
+    tensor = tensor.to(dtype=torch.float32)
    frames = tensor[0].squeeze(0).permute(1, 2, 3, 0).cpu().numpy()
    frames = np.clip(frames, 0, 1) * 255
    frames = frames.astype(np.uint8)
-
    writer = imageio.get_writer(output_path + "/output.mp4", fps=8)
    for frame in frames:
        writer.append_data(frame)
@ -103,7 +112,7 @@ if __name__ == "__main__":
        "--mode", type=str, choices=["encode", "decode", "both"], required=True, help="Mode: encode, decode, or both"
    )
    parser.add_argument(
-        "--dtype", type=str, default="float16", help="The data type for computation (e.g., 'float16' or 'float32')"
+        "--dtype", type=str, default="bfloat16", help="The data type for computation (e.g., 'float16' or 'bfloat16')"
    )
    parser.add_argument(
        "--device", type=str, default="cuda", help="The device to use for computation (e.g., 'cuda' or 'cpu')"
@ -111,7 +120,7 @@ if __name__ == "__main__":
    args = parser.parse_args()

    device = torch.device(args.device)
-    dtype = torch.float16 if args.dtype == "float16" else torch.float32
+    dtype = torch.float16 if args.dtype == "float16" else torch.bfloat16

    if args.mode == "encode":
        assert args.video_path, "Video path must be provided for encoding."
--- a/inference/gradio_composite_demo/app.py
+++ b/inference/gradio_composite_demo/app.py
@ -3,7 +3,7 @@ THis is the main file for the gradio web demo. It uses the CogVideoX-5B model to
 set environment variable OPENAI_API_KEY to use the OpenAI API to enhance the prompt.

 Usage:
-    OpenAI_API_KEY=your_openai_api_key OpenAI_BASE_URL=https://api.openai.com/v1 python inference/gradio_web_demo.py
+    OpenAI_API_KEY=your_openai_api_key OPENAI_BASE_URL=https://api.openai.com/v1 python inference/gradio_web_demo.py
 """

 import math
@ -12,9 +12,20 @@ import random
 import threading
 import time

+import cv2
+import tempfile
+import imageio_ffmpeg
 import gradio as gr
 import torch
-from diffusers import CogVideoXPipeline, CogVideoXDDIMScheduler, CogVideoXDPMScheduler
+from PIL import Image
+from diffusers import (
+    CogVideoXPipeline,
+    CogVideoXDPMScheduler,
+    CogVideoXVideoToVideoPipeline,
+    CogVideoXImageToVideoPipeline,
+    CogVideoXTransformer3DModel,
+)
+from diffusers.utils import load_video, load_image
 from datetime import datetime, timedelta

 from diffusers.image_processor import VaeImageProcessor
@ -31,18 +42,33 @@ snapshot_download(repo_id="AlexWortega/RIFE", local_dir="model_rife")

 pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16).to(device)
 pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
+pipe_video = CogVideoXVideoToVideoPipeline.from_pretrained(
+    "THUDM/CogVideoX-5b",
+    transformer=pipe.transformer,
+    vae=pipe.vae,
+    scheduler=pipe.scheduler,
+    tokenizer=pipe.tokenizer,
+    text_encoder=pipe.text_encoder,
+    torch_dtype=torch.bfloat16,
+).to(device)

-# Unnecessary
+pipe_image = CogVideoXImageToVideoPipeline.from_pretrained(
+    "THUDM/CogVideoX-5b-I2V",
+    transformer=CogVideoXTransformer3DModel.from_pretrained(
+        "THUDM/CogVideoX-5b-I2V", subfolder="transformer", torch_dtype=torch.bfloat16
+    ),
+    vae=pipe.vae,
+    scheduler=pipe.scheduler,
+    tokenizer=pipe.tokenizer,
+    text_encoder=pipe.text_encoder,
+    torch_dtype=torch.bfloat16,
+).to(device)

-pipe.enable_model_cpu_offload()
-pipe.enable_sequential_cpu_offload()
-pipe.vae.enable_slicing()
-pipe.vae.enable_tiling()
-
-# Compile

 # pipe.transformer.to(memory_format=torch.channels_last)
 # pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)
+# pipe_image.transformer.to(memory_format=torch.channels_last)
+# pipe_image.transformer = torch.compile(pipe_image.transformer, mode="max-autotune", fullgraph=True)

 os.makedirs("./output", exist_ok=True)
 os.makedirs("./gradio_tmp", exist_ok=True)
@ -64,6 +90,80 @@ Video descriptions must have the same num of words as examples below. Extra word
 """


+def resize_if_unfit(input_video, progress=gr.Progress(track_tqdm=True)):
+    width, height = get_video_dimensions(input_video)
+
+    if width == 720 and height == 480:
+        processed_video = input_video
+    else:
+        processed_video = center_crop_resize(input_video)
+    return processed_video
+
+
+def get_video_dimensions(input_video_path):
+    reader = imageio_ffmpeg.read_frames(input_video_path)
+    metadata = next(reader)
+    return metadata["size"]
+
+
+def center_crop_resize(input_video_path, target_width=720, target_height=480):
+    cap = cv2.VideoCapture(input_video_path)
+
+    orig_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
+    orig_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
+    orig_fps = cap.get(cv2.CAP_PROP_FPS)
+    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+
+    width_factor = target_width / orig_width
+    height_factor = target_height / orig_height
+    resize_factor = max(width_factor, height_factor)
+
+    inter_width = int(orig_width * resize_factor)
+    inter_height = int(orig_height * resize_factor)
+
+    target_fps = 8
+    ideal_skip = max(0, math.ceil(orig_fps / target_fps) - 1)
+    skip = min(5, ideal_skip)  # Cap at 5
+
+    while (total_frames / (skip + 1)) < 49 and skip > 0:
+        skip -= 1
+
+    processed_frames = []
+    frame_count = 0
+    total_read = 0
+
+    while frame_count < 49 and total_read < total_frames:
+        ret, frame = cap.read()
+        if not ret:
+            break
+
+        if total_read % (skip + 1) == 0:
+            resized = cv2.resize(frame, (inter_width, inter_height), interpolation=cv2.INTER_AREA)
+
+            start_x = (inter_width - target_width) // 2
+            start_y = (inter_height - target_height) // 2
+            cropped = resized[start_y : start_y + target_height, start_x : start_x + target_width]
+
+            processed_frames.append(cropped)
+            frame_count += 1
+
+        total_read += 1
+
+    cap.release()
+
+    with tempfile.NamedTemporaryFile(suffix=".mp4", delete=False) as temp_file:
+        temp_video_path = temp_file.name
+        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
+        out = cv2.VideoWriter(temp_video_path, fourcc, target_fps, (target_width, target_height))
+
+        for frame in processed_frames:
+            out.write(frame)
+
+        out.release()
+
+    return temp_video_path
+
+
 def convert_prompt(prompt: str, retry_times: int = 3) -> str:
    if not os.environ.get("OPENAI_API_KEY"):
        return prompt
@ -103,7 +203,7 @@ def convert_prompt(prompt: str, retry_times: int = 3) -> str:
                    "content": f'Create an imaginative video descriptive caption or modify an earlier caption in ENGLISH for the user input: "{text}"',
                },
            ],
-            model="glm-4-0520",
+            model="glm-4-plus",
            temperature=0.01,
            top_p=0.7,
            stream=False,
@ -116,6 +216,9 @@ def convert_prompt(prompt: str, retry_times: int = 3) -> str:

 def infer(
    prompt: str,
+    image_input: str,
+    video_input: str,
+    video_strenght: float,
    num_inference_steps: int,
    guidance_scale: float,
    seed: int = -1,
@ -123,16 +226,44 @@ def infer(
 ):
    if seed == -1:
        seed = random.randint(0, 2**8 - 1)
-    video_pt = pipe(
-        prompt=prompt,
-        num_videos_per_prompt=1,
-        num_inference_steps=num_inference_steps,
-        num_frames=49,
-        use_dynamic_cfg=True,
-        output_type="pt",
-        guidance_scale=guidance_scale,
-        generator=torch.Generator(device="cpu").manual_seed(seed),
-    ).frames
+
+    if video_input is not None:
+        video = load_video(video_input)[:49]  # Limit to 49 frames
+        video_pt = pipe_video(
+            video=video,
+            prompt=prompt,
+            num_inference_steps=num_inference_steps,
+            num_videos_per_prompt=1,
+            strength=video_strenght,
+            use_dynamic_cfg=True,
+            output_type="pt",
+            guidance_scale=guidance_scale,
+            generator=torch.Generator(device="cpu").manual_seed(seed),
+        ).frames
+    elif image_input is not None:
+        image_input = Image.fromarray(image_input).resize(size=(720, 480))  # Convert to PIL
+        image = load_image(image_input)
+        video_pt = pipe_image(
+            image=image,
+            prompt=prompt,
+            num_inference_steps=num_inference_steps,
+            num_videos_per_prompt=1,
+            use_dynamic_cfg=True,
+            output_type="pt",
+            guidance_scale=guidance_scale,
+            generator=torch.Generator(device="cpu").manual_seed(seed),
+        ).frames
+    else:
+        video_pt = pipe(
+            prompt=prompt,
+            num_videos_per_prompt=1,
+            num_inference_steps=num_inference_steps,
+            num_frames=49,
+            use_dynamic_cfg=True,
+            output_type="pt",
+            guidance_scale=guidance_scale,
+            generator=torch.Generator(device="cpu").manual_seed(seed),
+        ).frames

    return (video_pt, seed)

@ -163,6 +294,8 @@ def delete_old_files():


 threading.Thread(target=delete_old_files, daemon=True).start()
+examples_videos = [["example_videos/horse.mp4"], ["example_videos/kitten.mp4"], ["example_videos/train_running.mp4"]]
+examples_images = [["example_images/beach.png"], ["example_images/street.png"], ["example_images/camping.png"]]

 with gr.Blocks() as demo:
    gr.Markdown("""
@ -170,17 +303,30 @@ with gr.Blocks() as demo:
               CogVideoX-5B Huggingface Space🤗
           </div>
           <div style="text-align: center;">
-               <a href="https://huggingface.co/THUDM/CogVideoX-5B">🤗 5B Model Hub</a> |
+               <a href="https://huggingface.co/THUDM/CogVideoX-5B">🤗 5B(T2V) Model Hub</a> |
+               <a href="https://huggingface.co/THUDM/CogVideoX-5B-I2V">🤗 5B(I2V) Model Hub</a> |
               <a href="https://github.com/THUDM/CogVideo">🌐 Github</a> |
               <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
           </div>
-
+           <div style="text-align: center;display: flex;justify-content: center;align-items: center;margin-top: 1em;margin-bottom: .5em;">
+              <span>If the Space is too busy, duplicate it to use privately</span>
+              <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space?duplicate=true"><img src="https://huggingface.co/datasets/huggingface/badges/resolve/main/duplicate-this-space-lg.svg" width="160" style="
+                margin-left: .75em;
+            "></a>
+           </div>
           <div style="text-align: center; font-size: 15px; font-weight: bold; color: red; margin-bottom: 20px;">
            ⚠️ This demo is for academic research and experiential use only. 
            </div>
           """)
    with gr.Row():
        with gr.Column():
+            with gr.Accordion("I2V: Image Input (cannot be used simultaneously with video input)", open=False):
+                image_input = gr.Image(label="Input Image (will be cropped to 720 * 480)")
+                examples_component_images = gr.Examples(examples_images, inputs=[examples_images], cache_examples=False)
+            with gr.Accordion("V2V: Video Input (cannot be used simultaneously with image input)", open=False):
+                video_input = gr.Video(label="Input Video (will be cropped to 49 frames, 6 seconds at 8fps)")
+                strength = gr.Slider(0.1, 1.0, value=0.8, step=0.01, label="Strength")
+                examples_component_videos = gr.Examples(examples_videos, inputs=[examples_videos], cache_examples=False)
            prompt = gr.Textbox(label="Prompt (Less than 200 Words)", placeholder="Enter your prompt here", lines=5)

            with gr.Row():
@ -188,7 +334,6 @@ with gr.Blocks() as demo:
                    "✨Upon pressing the enhanced prompt button, we will use [GLM-4 Model](https://github.com/THUDM/GLM-4) to polish the prompt and overwrite the original one."
                )
                enhance_button = gr.Button("✨ Enhance Prompt(Optional)")
-
            with gr.Group():
                with gr.Column():
                    with gr.Row():
@ -196,7 +341,7 @@ with gr.Blocks() as demo:
                            label="Inference Seed (Enter a positive number, -1 for random)", value=-1
                        )
                    with gr.Row():
-                        enable_scale = gr.Checkbox(label="Super-Resolution (720 × 480 -> 1440 × 960)", value=False)
+                        enable_scale = gr.Checkbox(label="Super-Resolution (720 × 480 -> 2880 × 1920)", value=False)
                        enable_rife = gr.Checkbox(label="Frame Interpolation (8fps -> 16fps)", value=False)
                    gr.Markdown(
                        "✨In this demo, we use [RIFE](https://github.com/hzwer/ECCV2022-RIFE) for frame interpolation and [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN) for upscaling(Super-Resolution).<br>&nbsp;&nbsp;&nbsp;&nbsp;The entire process is based on open-source solutions."
@ -214,7 +359,7 @@ with gr.Blocks() as demo:
    gr.Markdown("""
    <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
        <div style="text-align: center; font-size: 32px; font-weight: bold; margin-bottom: 20px;">
-            🎥 Video Gallery
+            🎥 Video Gallery(For 5B)
        </div>
        <tr>
            <td style="width: 25%; vertical-align: top; font-size: 0.9em;">
@ -275,13 +420,25 @@ with gr.Blocks() as demo:
    </table>
        """)

-    def generate(prompt, seed_value, scale_status, rife_status, progress=gr.Progress(track_tqdm=True)):
+    def generate(
+        prompt,
+        image_input,
+        video_input,
+        video_strength,
+        seed_value,
+        scale_status,
+        rife_status,
+        progress=gr.Progress(track_tqdm=True)
+    ):
        latents, seed = infer(
            prompt,
+            image_input,
+            video_input,
+            video_strength,
            num_inference_steps=50,  # NOT Changed
            guidance_scale=7.0,  # NOT Changed
            seed=seed_value,
-            # progress=progress,
+            progress=progress,
        )
        if scale_status:
            latents = utils.upscale_batch_and_concatenate(upscale_model, latents, device)
@ -311,11 +468,13 @@ with gr.Blocks() as demo:

    generate_button.click(
        generate,
-        inputs=[prompt, seed_param, enable_scale, enable_rife],
+        inputs=[prompt, image_input, video_input, strength, seed_param, enable_scale, enable_rife],
        outputs=[video_output, download_video_button, download_gif_button, seed_text],
    )

    enhance_button.click(enhance_prompt_func, inputs=[prompt], outputs=[prompt])
+    video_input.upload(resize_if_unfit, inputs=[video_input], outputs=[video_input])

 if __name__ == "__main__":
+    demo.queue(max_size=15)
    demo.launch()
--- a/inference/gradio_composite_demo/example_images/beach.png
+++ b/inference/gradio_composite_demo/example_images/beach.png
--- a/inference/gradio_composite_demo/example_images/camping.png
+++ b/inference/gradio_composite_demo/example_images/camping.png
--- a/inference/gradio_composite_demo/example_images/street.png
+++ b/inference/gradio_composite_demo/example_images/street.png
--- a/inference/gradio_composite_demo/example_videos/horse.mp4
+++ b/inference/gradio_composite_demo/example_videos/horse.mp4
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:3c857bbc0d197c0751db9d6da9b5c85eafd163511ff9b0e10be65adf8ef9e352
+size 453387
--- a/inference/gradio_composite_demo/example_videos/kitten.mp4
+++ b/inference/gradio_composite_demo/example_videos/kitten.mp4
--- a/inference/gradio_composite_demo/example_videos/train_running.mp4
+++ b/inference/gradio_composite_demo/example_videos/train_running.mp4
--- a/inference/gradio_composite_demo/requirements.txt
+++ b/inference/gradio_composite_demo/requirements.txt
@ -1,21 +1,19 @@
-spaces==0.29.3
-safetensors>=0.4.4
-spandrel>=0.3.4
+spaces>=0.29.3
+safetensors>=0.4.5
+spandrel>=0.4.0
 tqdm>=4.66.5
-opencv-python>=4.10.0.84
 scikit-video>=1.1.11
-diffusers>=0.30.1
+git+https://github.com/huggingface/diffusers.git@main
 transformers>=4.44.0
-accelerate>=0.33.0
+accelerate>=0.34.2
+opencv-python>=4.10.0.84
 sentencepiece>=0.2.0
-SwissArmyTransformer>=0.4.12
 numpy==1.26.0
 torch>=2.4.0
 torchvision>=0.19.0
-gradio>=4.42.0
-streamlit>=1.37.1
-imageio==2.34.2
-imageio-ffmpeg==0.5.1
-openai>=1.42.0
-moviepy==1.0.3
+gradio>=4.44.0
+imageio>=2.34.2
+imageio-ffmpeg>=0.5.1
+openai>=1.45.0
+moviepy>=1.0.3
 pillow==9.5.0
--- a/inference/gradio_composite_demo/rife_model.py
+++ b/inference/gradio_composite_demo/rife_model.py
@ -8,9 +8,8 @@ import numpy as np
 import logging
 import skvideo.io
 from rife.RIFE_HDv3 import Model
-from huggingface_hub import hf_hub_download, snapshot_download
-logger = logging.getLogger(__name__)

+logger = logging.getLogger(__name__)
 device = "cuda" if torch.cuda.is_available() else "cpu"


@ -19,9 +18,8 @@ def pad_image(img, scale):
    tmp = max(32, int(32 / scale))
    ph = ((h - 1) // tmp + 1) * tmp
    pw = ((w - 1) // tmp + 1) * tmp
-    padding = (0,  pw - w, 0, ph - h)
- 
-    return F.pad(img, padding), padding
+    padding = (0, 0, pw - w, ph - h)
+    return F.pad(img, padding)


 def make_inference(model, I0, I1, upscale_amount, n):
@ -38,22 +36,15 @@ def make_inference(model, I0, I1, upscale_amount, n):

@torch.inference_mode()
 def ssim_interpolation_rife(model, samples, exp=1, upscale_amount=1, output_device="cpu"):
-    print(f"samples dtype:{samples.dtype}")
-    print(f"samples shape:{samples.shape}")
+
    output = []
    # [f, c, h, w]
    for b in range(samples.shape[0]):
        frame = samples[b : b + 1]
        _, _, h, w = frame.shape
-        
        I0 = samples[b : b + 1]
        I1 = samples[b + 1 : b + 2] if b + 2 < samples.shape[0] else samples[-1:]
-         
-        I0, padding = pad_image(I0, upscale_amount)
-        I0 = I0.to(torch.float)
-        I1, _ = pad_image(I1, upscale_amount)
-        I1 = I1.to(torch.float)
-         
+        I1 = pad_image(I1, upscale_amount)
        # [c, h, w]
        I0_small = F.interpolate(I0, (32, 32), mode="bilinear", align_corners=False)
        I1_small = F.interpolate(I1, (32, 32), mode="bilinear", align_corners=False)
@ -61,24 +52,14 @@ def ssim_interpolation_rife(model, samples, exp=1, upscale_amount=1, output_devi
        ssim = ssim_matlab(I0_small[:, :3], I1_small[:, :3])

        if ssim > 0.996:
-            I1 = samples[b : b + 1]
-            # print(f'upscale_amount:{upscale_amount}')
-            # print(f'ssim:{upscale_amount}')
-            # print(f'I0 shape:{I0.shape}')
-            # print(f'I1 shape:{I1.shape}')
-            I1, padding = pad_image(I1, upscale_amount)
-            # print(f'I0 shape:{I0.shape}')
-            # print(f'I1 shape:{I1.shape}')
+            I1 = I0
+            I1 = pad_image(I1, upscale_amount)
            I1 = make_inference(model, I0, I1, upscale_amount, 1)
-            
-            # print(f'I0 shape:{I0.shape}')
-            # print(f'I1[0] shape:{I1[0].shape}') 
-            I1 = I1[0]
-            
-            # print(f'I1[0] unpadded shape:{I1.shape}') 
-            I1_small = F.interpolate(I1, (32, 32), mode="bilinear", align_corners=False)
+
+            I1_small = F.interpolate(I1[0], (32, 32), mode="bilinear", align_corners=False)
            ssim = ssim_matlab(I0_small[:, :3], I1_small[:, :3])
-            frame = I1[padding[0]:, padding[2]:,  padding[3]:,padding[1]:]
+            frame = I1[0]
+            I1 = I1[0]

        tmp_output = []
        if ssim < 0.2:
@ -88,13 +69,9 @@ def ssim_interpolation_rife(model, samples, exp=1, upscale_amount=1, output_devi
        else:
            tmp_output = make_inference(model, I0, I1, upscale_amount, 2**exp - 1) if exp else []

-        frame, _ = pad_image(frame, upscale_amount)
-        print(f'frame shape:{frame.shape}')
-        print(f'tmp_output[0] shape:{tmp_output[0].shape}')
+        frame = pad_image(frame, upscale_amount)
        tmp_output = [frame] + tmp_output
- 
-        for i, frame in enumerate(tmp_output): 
-            frame = F.interpolate(frame, size=(h, w))
+        for i, frame in enumerate(tmp_output):
            output.append(frame.to(output_device))
    return output

@ -117,26 +94,14 @@ def frame_generator(video_capture):


 def rife_inference_with_path(model, video_path):
-    # Open the video file
    video_capture = cv2.VideoCapture(video_path)
-    fps = video_capture.get(cv2.CAP_PROP_FPS)  # Get the frames per second
-    tot_frame = int(video_capture.get(cv2.CAP_PROP_FRAME_COUNT))  # Total frames in the video
+    tot_frame = video_capture.get(cv2.CAP_PROP_FRAME_COUNT)
    pt_frame_data = []
    pt_frame = skvideo.io.vreader(video_path)
-      # Cyclic reading of the video frames
-    while video_capture.isOpened():
-        ret, frame = video_capture.read()
-
-        if not ret:
-            break
-
-        # BGR to RGB
-        frame_rgb = frame[..., ::-1]
-        frame_rgb = frame_rgb.copy()
-        tensor = torch.from_numpy(frame_rgb).float().to("cpu", non_blocking=True).float() / 255.0
+    for frame in pt_frame:
        pt_frame_data.append(
-            tensor.permute(2, 0, 1)
-        )  # to [c, h, w,]
+            torch.from_numpy(np.transpose(frame, (2, 0, 1))).to("cpu", non_blocking=True).float() / 255.0
+        )

    pt_frame = torch.from_numpy(np.stack(pt_frame_data))
    pt_frame = pt_frame.to(device)
@ -152,23 +117,13 @@ def rife_inference_with_path(model, video_path):


 def rife_inference_with_latents(model, latents):
-    pbar = utils.ProgressBar(latents.shape[1], desc="RIFE inference")
    rife_results = []
    latents = latents.to(device)
    for i in range(latents.size(0)):
        #  [f, c, w, h]
        latent = latents[i]
-
        frames = ssim_interpolation_rife(model, latent)
        pt_image = torch.stack([frames[i].squeeze(0) for i in range(len(frames))])  # (to [f, c, w, h])
        rife_results.append(pt_image)

    return torch.stack(rife_results)
-
-
-if __name__ == "__main__":
-    snapshot_download(repo_id="AlexWortega/RIFE", local_dir="model_rife")
-    model = load_rife_model("model_rife")
- 
-    video_path = rife_inference_with_path(model, "/mnt/ceph/develop/jiawei/CogVideo/sat/configs/outputs/1_In_the_heart_of_a_bustling_city,_a_young_woman_with_long,_flowing_brown_hair_and_a_radiant_smile_stands_out._She's_donne/0/000000.mp4")
-    print(video_path)
--- a/inference/gradio_web_demo.py
+++ b/inference/gradio_web_demo.py
@ -2,6 +2,10 @@
 THis is the main file for the gradio web demo. It uses the CogVideoX-2B model to generate videos gradio web demo.
 set environment variable OPENAI_API_KEY to use the OpenAI API to enhance the prompt.

+This demo only supports the text-to-video generation model.
+If you wish to use the image-to-video or video-to-video generation models,
+please use the gradio_composite_demo to implement the full GUI functionality.
+
 Usage:
    OpenAI_API_KEY=your_openai_api_key OpenAI_BASE_URL=https://api.openai.com/v1 python inference/gradio_web_demo.py
 """
@ -18,11 +22,8 @@ from datetime import datetime, timedelta
 from openai import OpenAI
 import moviepy.editor as mp

-dtype = torch.bfloat16
+pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16).to("cuda")

-pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=dtype)
-pipe.enable_model_cpu_offload()
-pipe.enable_sequential_cpu_offload()
 pipe.vae.enable_slicing()
 pipe.vae.enable_tiling()

@ -46,6 +47,7 @@ Video descriptions must have the same num of words as examples below. Extra word
 def convert_prompt(prompt: str, retry_times: int = 3) -> str:
    if not os.environ.get("OPENAI_API_KEY"):
        return prompt
+
    client = OpenAI()
    text = prompt.strip()

@ -82,7 +84,7 @@ def convert_prompt(prompt: str, retry_times: int = 3) -> str:
                    "content": f'Create an imaginative video descriptive caption or modify an earlier caption in ENGLISH for the user input: "{text}"',
                },
            ],
-            model="glm-4-0520",
+            model="glm-4-plus",
            temperature=0.01,
            top_p=0.7,
            stream=False,
@ -144,19 +146,9 @@ threading.Thread(target=delete_old_files, daemon=True).start()
 with gr.Blocks() as demo:
    gr.Markdown("""
           <div style="text-align: center; font-size: 32px; font-weight: bold; margin-bottom: 20px;">
-               CogVideoX-2B Huggingface Space🤗
-           </div>
-           <div style="text-align: center;">
-               <a href="https://huggingface.co/THUDM/CogVideoX-2B">🤗 2B Model Hub</a> |
-               <a href="https://github.com/THUDM/CogVideo">🌐 Github</a> |
-               <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
-           </div>
+               CogVideoX Gradio Simple Space🤗
+            """)

-           <div style="text-align: center; font-size: 15px; font-weight: bold; color: red; margin-bottom: 20px;">
-            ⚠️ This demo is for academic research and experiential use only. 
-            Users should strictly adhere to local laws and ethics.
-            </div>
-           """)
    with gr.Row():
        with gr.Column():
            prompt = gr.Textbox(label="Prompt (Less than 200 Words)", placeholder="Enter your prompt here", lines=5)
@ -172,7 +164,6 @@ with gr.Blocks() as demo:
                    "**Optional Parameters** (default values are recommended)<br>"
                    "Increasing the number of inference steps will produce more detailed videos, but it will slow down the process.<br>"
                    "50 steps are recommended for most cases.<br>"
-                    "For the 5B model, 50 steps will take approximately 350 seconds."
                )
                with gr.Row():
                    num_inference_steps = gr.Number(label="Inference Steps", value=50)
@ -185,42 +176,6 @@ with gr.Blocks() as demo:
                download_video_button = gr.File(label="📥 Download Video", visible=False)
                download_gif_button = gr.File(label="📥 Download GIF", visible=False)

-    gr.Markdown("""
-    <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
-         <div style="text-align: center; font-size: 24px; font-weight: bold; margin-bottom: 20px;">
-               Demo Videos with 50 Inference Steps and 6.0 Guidance Scale.
-         </div>
-        <tr>
-            <td style="width: 25%; vertical-align: top; font-size: 0.8em;">
-                <p>A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.</p>
-            </td>
-            <td style="width: 25%; vertical-align: top;">
-                <video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="100%" controls autoplay></video>
-            </td>
-            <td style="width: 25%; vertical-align: top; font-size: 0.8em;">
-                <p>The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from its tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.</p>
-            </td>
-            <td style="width: 25%; vertical-align: top;">
-                <video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="100%" controls autoplay></video>
-            </td>
-        </tr>
-        <tr>
-            <td style="width: 25%; vertical-align: top; font-size: 0.8em;">
-                <p>A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.</p>
-            </td>
-            <td style="width: 25%; vertical-align: top;">
-                <video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="100%" controls autoplay></video>
-            </td>
-            <td style="width: 25%; vertical-align: top; font-size: 0.8em;">
-                <p>In the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.</p>
-            </td>
-            <td style="width: 25%; vertical-align: top;">
-                <video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="100%" controls autoplay></video>
-            </td>
-        </tr>
-    </table>
-    """)
-
    def generate(prompt, num_inference_steps, guidance_scale, model_choice, progress=gr.Progress(track_tqdm=True)):
        tensor = infer(prompt, num_inference_steps, guidance_scale, progress=progress)
        video_path = save_video(tensor)
--- a/inference/streamlit_web_demo.py
+++ b/inference/streamlit_web_demo.py
@ -1,219 +0,0 @@
-"""
-This script is used to create a Streamlit web application for generating videos using the CogVideoX model.
-
-Run the script using Streamlit:
-    $ export OPENAI_API_KEY=your OpenAI Key or ZhiupAI Key
-    $ export OPENAI_BASE_URL=https://open.bigmodel.cn/api/paas/v4/  # using with ZhipuAI, Not using this when using OpenAI
-    $ streamlit run web_demo.py
-"""
-
-import base64
-import json
-import os
-import time
-from datetime import datetime
-from typing import List
-
-import imageio
-import numpy as np
-import streamlit as st
-import torch
-from convert_demo import convert_prompt
-from diffusers import CogVideoXPipeline
-
-
-model_path: str = "THUDM/CogVideoX-2b"
-
-
-# Load the model at the start
-@st.cache_resource
-def load_model(model_path: str, dtype: torch.dtype, device: str) -> CogVideoXPipeline:
-    """
-    Load the CogVideoX model.
-
-    Args:
-    - model_path (str): Path to the model.
-    - dtype (torch.dtype): Data type for model.
-    - device (str): Device to load the model on.
-
-    Returns:
-    - CogVideoXPipeline: Loaded model pipeline.
-    """
-    pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype)
-    pipe.enable_model_cpu_offload()
-    return pipe
-
-
-# Define a function to generate video based on the provided prompt and model path
-def generate_video(
-    pipe: CogVideoXPipeline,
-    prompt: str,
-    num_inference_steps: int = 50,
-    guidance_scale: float = 6.0,
-    num_videos_per_prompt: int = 1,
-    device: str = "cuda",
-    dtype: torch.dtype = torch.float16,
-) -> List[np.ndarray]:
-    """
-    Generate a video based on the provided prompt and model path.
-
-    Args:
-    - pipe (CogVideoXPipeline): The pipeline for generating videos.
-    - prompt (str): Text prompt for video generation.
-    - num_inference_steps (int): Number of inference steps.
-    - guidance_scale (float): Guidance scale for generation.
-    - num_videos_per_prompt (int): Number of videos to generate per prompt.
-    - device (str): Device to run the generation on.
-    - dtype (torch.dtype): Data type for the model.
-
-    Returns:
-    - List[np.ndarray]: Generated video frames.
-    """
-    prompt_embeds, _ = pipe.encode_prompt(
-        prompt=prompt,
-        negative_prompt=None,
-        do_classifier_free_guidance=True,
-        num_videos_per_prompt=num_videos_per_prompt,
-        max_sequence_length=226,
-        device=device,
-        dtype=dtype,
-    )
-    pipe.enable_model_cpu_offload()
-    pipe.enable_sequential_cpu_offload()
-    pipe.vae.enable_slicing()
-    pipe.vae.enable_tiling()
-    # Generate video
-    video = pipe(
-        num_inference_steps=num_inference_steps,
-        guidance_scale=guidance_scale,
-        prompt_embeds=prompt_embeds,
-        negative_prompt_embeds=torch.zeros_like(prompt_embeds),
-    ).frames[0]
-    return video
-
-
-def save_video(video: List[np.ndarray], path: str, fps: int = 8) -> None:
-    """
-    Save the generated video to a file.
-
-    Args:
-    - video (List[np.ndarray]): Video frames.
-    - path (str): Path to save the video.
-    - fps (int): Frames per second for the video.
-    """
-    # Remove the first frame
-    video = video[1:]
-
-    writer = imageio.get_writer(path, fps=fps, codec="libx264")
-    for frame in video:
-        np_frame = np.array(frame)
-        writer.append_data(np_frame)
-
-    writer.close()
-
-
-def save_metadata(
-    prompt: str,
-    converted_prompt: str,
-    num_inference_steps: int,
-    guidance_scale: float,
-    num_videos_per_prompt: int,
-    path: str,
-) -> None:
-    """
-    Save metadata to a JSON file.
-
-    Args:
-    - prompt (str): Original prompt.
-    - converted_prompt (str): Converted prompt.
-    - num_inference_steps (int): Number of inference steps.
-    - guidance_scale (float): Guidance scale.
-    - num_videos_per_prompt (int): Number of videos per prompt.
-    - path (str): Path to save the metadata.
-    """
-    metadata = {
-        "prompt": prompt,
-        "converted_prompt": converted_prompt,
-        "num_inference_steps": num_inference_steps,
-        "guidance_scale": guidance_scale,
-        "num_videos_per_prompt": num_videos_per_prompt,
-    }
-    with open(path, "w") as f:
-        json.dump(metadata, f, indent=4)
-
-
-def main() -> None:
-    """
-    Main function to run the Streamlit web application.
-    """
-    st.set_page_config(page_title="CogVideoX-Demo", page_icon="🎥", layout="wide")
-    st.write("# CogVideoX 🎥")
-    dtype: torch.dtype = torch.float16
-    device: str = "cuda"
-
-    global pipe
-    pipe = load_model(model_path, dtype, device)
-
-    with st.sidebar:
-        st.info("It will take some time to generate a video (~90 seconds per videos in 50 steps).", icon="ℹ️")
-        num_inference_steps: int = st.number_input("Inference Steps", min_value=1, max_value=100, value=50)
-        guidance_scale: float = st.number_input("Guidance Scale", min_value=0.0, max_value=20.0, value=6.0)
-        num_videos_per_prompt: int = st.number_input("Videos per Prompt", min_value=1, max_value=10, value=1)
-
-        share_links_container = st.empty()
-
-    prompt: str = st.chat_input("Prompt")
-
-    if prompt:
-        # Not Necessary, Suggestions
-        with st.spinner("Refining prompts..."):
-            converted_prompt = convert_prompt(prompt=prompt, retry_times=1)
-            if converted_prompt is None:
-                st.error("Failed to Refining the prompt, Using origin one.")
-
-        st.info(f"**Origin prompt:**  \n{prompt}  \n  \n**Convert prompt:**  \n{converted_prompt}")
-        torch.cuda.empty_cache()
-
-        with st.spinner("Generating Video..."):
-            start_time = time.time()
-            video_paths = []
-
-            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-            output_dir = f"./output/{timestamp}"
-            os.makedirs(output_dir, exist_ok=True)
-
-            metadata_path = os.path.join(output_dir, "config.json")
-            save_metadata(
-                prompt, converted_prompt, num_inference_steps, guidance_scale, num_videos_per_prompt, metadata_path
-            )
-
-            for i in range(num_videos_per_prompt):
-                video_path = os.path.join(output_dir, f"output_{i + 1}.mp4")
-
-                video = generate_video(
-                    pipe, converted_prompt or prompt, num_inference_steps, guidance_scale, 1, device, dtype
-                )
-                save_video(video, video_path, fps=8)
-                video_paths.append(video_path)
-                with open(video_path, "rb") as video_file:
-                    video_bytes: bytes = video_file.read()
-                    st.video(video_bytes, autoplay=True, loop=True, format="video/mp4")
-                torch.cuda.empty_cache()
-
-            used_time: float = time.time() - start_time
-            st.success(f"Videos generated in {used_time:.2f} seconds.")
-
-            # Create download links in the sidebar
-            with share_links_container:
-                st.sidebar.write("### Download Links:")
-                for video_path in video_paths:
-                    video_name = os.path.basename(video_path)
-                    with open(video_path, "rb") as f:
-                        video_bytes: bytes = f.read()
-                    b64_video = base64.b64encode(video_bytes).decode()
-                    href = f'<a href="data:video/mp4;base64,{b64_video}" download="{video_name}">Download {video_name}</a>'
-                    st.sidebar.markdown(href, unsafe_allow_html=True)
-
-
-if __name__ == "__main__":
-    main()
--- a/requirements.txt
+++ b/requirements.txt
@ -1,15 +1,14 @@
-diffusers>=0.30.1 #git+https://github.com/huggingface/diffusers.git@main#egg=diffusers is suggested
-transformers>=4.44.2  # The development team is working on version 4.44.2
-accelerate>=0.33.0 #git+https://github.com/huggingface/accelerate.git@main#egg=accelerate is suggested
-sentencepiece>=0.2.0 # T5 used
-SwissArmyTransformer>=0.4.12
+diffusers>=0.30.3
+accelerate>=0.34.2
+transformers>=4.44.2
 numpy==1.26.0
-torch>=2.4.0 # Tested in 2.2 2.3 2.4 and 2.5, The development team is working on version 2.4.0.
-torchvision>=0.19.0 # The development team is working on version 0.19.0.
-gradio>=4.42.0 # For HF gradio demo
-streamlit>=1.38.0 # For streamlit web demo
-imageio==2.34.2 # For diffusers inference export video
-imageio-ffmpeg==0.5.1 # For diffusers inference export video
-openai>=1.42.0 # For prompt refiner
-moviepy==1.0.3 # For export video
+torch>=2.4.0
+torchvision>=0.19.0
+sentencepiece>=0.2.0
+SwissArmyTransformer>=0.4.12
+gradio>=4.44.0
+imageio>=2.35.1
+imageio-ffmpeg>=0.5.1
+openai>=1.45.0
+moviepy>=1.0.3
 pillow==9.5.0
--- a/resources/contribute.md
+++ b/resources/contribute.md
@ -8,16 +8,15 @@ will acknowledge your contribution on the project homepage.

 ## Model Algorithms

- Support for model quantization inference (Int4, Int8, etc. quantization engineering)
- Support for multi-card inference / model inference concurrency engineering
- Support for non-CUDA architecture inference devices
+- Support for model quantization inference (Int4 quantization project)
+- Optimization of model fine-tuning data loading (replacing the existing decord tool)

-## Model Engineering / Secondary Development
+## Model Engineering

- Model fine-tuning examples / best prompt practices
- Video super-resolution/frame interpolation for enhancing video generation quality.
- Any peripheral tools for the model
- Any minimal complete open-source projects using the CogVideoX open-source model
+- Model fine-tuning examples / Best prompt practices
+- Inference adaptation on different devices (e.g., MLX framework)
+- Any tools related to the model
+- Any minimal fully open-source project using the CogVideoX open-source model

 ## Code Standards

--- a/resources/contribute_ja.md
+++ b/resources/contribute_ja.md
@ -0,0 +1,47 @@
+# コントリビューションガイド
+
+本プロジェクトにはまだ多くの未完成の部分があります。
+
+以下の分野でリポジトリへの貢献をお待ちしています。上記の作業を完了し、PRを提出してコミュニティと共有する意志がある場合、レビュー後、プロジェクトのホームページで貢献を認識します。
+
+## モデルアルゴリズム
+
+- モデル量子化推論のサポート (Int4量子化プロジェクト)
+- モデルのファインチューニングデータロードの最適化（既存のdecordツールの置き換え）
+
+## モデルエンジニアリング
+
+- モデルのファインチューニング例 / 最適なプロンプトの実践
+- 異なるデバイスでの推論適応（例: MLXフレームワーク）
+- モデルに関連するツール
+- CogVideoXオープンソースモデルを使用した、完全にオープンソースの最小プロジェクト
+
+## コード標準
+
+良いコードスタイルは一種の芸術です。本プロジェクトにはコードスタイルを標準化するための `pyproject.toml`
+設定ファイルを用意しています。以下の仕様に従ってコードを整理してください。
+
+1. `ruff` ツールをインストールする
+
+```shell
+pip install ruff
+```
+
+次に、`ruff` ツールを実行します
+
+```shell
+ruff check tools sat inference
+```
+
+コードスタイルを確認します。問題がある場合は、`ruff format` コマンドを使用して自動修正できます。
+
+```shell
+ruff format tools sat inference
+```
+
+コードが標準に準拠したら、エラーはなくなるはずです。
+
+## 命名規則
+
+1. 英語名を使用してください。ピンインや他の言語の名前を使用しないでください。すべてのコメントは英語で記載してください。
+2. PEP8仕様に厳密に従い、単語をアンダースコアで区切ってください。a、b、cのような名前は使用しないでください。
--- a/resources/contribute_zh.md
+++ b/resources/contribute_zh.md
@ -4,14 +4,13 @@

 ## 模型算法

- 模型量化推理支持 (Int4,Int8等量化工程)
- 模型多卡推理支持 / 模型推理并发工程
- 非 CUDA 架构 推理设备支持
+- 模型量化推理支持 (Int4量化工程)
+- 模型微调数据载入优化支持(替换现有的decord工具)

-## 模型工程 / 模型二次开发
+## 模型工程

 - 模型微调示例 / 最佳提示词实践
- 视频超分/插帧，用于美化视频生成效果。
+- 不同设备上的推理适配(MLX等框架)
 - 任何模型周边工具
 - 任何使用CogVideoX开源模型制作的最小完整开源项目

@ -31,10 +30,10 @@ pip install ruff
 ruff check tools sat inference
 ```

-检查代码风格，如果有问题，您可以通过`ruff formate`命令自动修复。
+检查代码风格，如果有问题，您可以通过`ruff format .`命令自动修复。

 ```shell
-ruff formate tools sat inference
+ruff format tools sat inference
 ```

 如果您的代码符合规范，应该不会出现任何的错误。
--- a/resources/gradio_demo.png
+++ b/resources/gradio_demo.png
--- a/resources/web_demo.png
+++ b/resources/web_demo.png
--- a/sat/README.md
+++ b/sat/README.md
@ -22,6 +22,7 @@ pip install -r requirements.txt
 ### 2. Download model weights

 First, go to the SAT mirror to download the model weights. For the CogVideoX-2B model, please download as follows:
+
 ```shell
 mkdir CogVideoX-2b-sat
 cd CogVideoX-2b-sat
@ -32,16 +33,14 @@ wget https://cloud.tsinghua.edu.cn/f/556a3e1329e74f1bac45/?dl=1
 mv 'index.html?dl=1' transformer.zip
 unzip transformer.zip
 ```
-For the CogVideoX-5B model, please download as follows (VAE files are the same):
-```shell
-mkdir CogVideoX-5b-sat
-cd CogVideoX-5b-sat
-wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
-mv 'index.html?dl=1' vae.zip
-unzip vae.zip
-```
-Then, you need to go to [Tsinghua Cloud Disk](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list) to download our model and unzip it.
-After sorting, the complete model structure of the two models should be as follows:
+
+For the CogVideoX-5B model, please download the `transformers` file as follows link:
+(VAE files are the same as 2B)
+
+ [CogVideoX-5B](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list)
+ [CogVideoX-5B-I2V](https://cloud.tsinghua.edu.cn/d/5cc62a2d6e7d45c0a2f6/?p=%2F1&mode=list)
+
+Next, you need to format the model files as follows:

 ```
 .
@ -53,7 +52,8 @@ After sorting, the complete model structure of the two models should be as follo
    └── 3d-vae.pt
 ```

-Due to large size of model weight file, using `git lfs` is recommended. Installation of `git lfs` can be found [here](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing) 
+Due to large size of model weight file, using `git lfs` is recommended. Installation of `git lfs` can be
+found [here](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing)

 Next, clone the T5 model, which is not used for training and fine-tuning, but must be used.
 > T5 model is available on [Modelscope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) as well.
@ -160,14 +160,14 @@ model:
          ucg_rate: 0.1
          target: sgm.modules.encoders.modules.FrozenT5Embedder
          params:
-            model_dir: "{absolute_path/to/your/t5-v1_1-xxl}/t5-v1_1-xxl" # Absolute path to the CogVideoX-2b/t5-v1_1-xxl weights folder
+            model_dir: "t5-v1_1-xxl" # Absolute path to the CogVideoX-2b/t5-v1_1-xxl weights folder
            max_length: 226

  first_stage_config:
    target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
    params:
      cp_size: 1
-      ckpt_path: "{absolute_path/to/your/t5-v1_1-xxl}/CogVideoX-2b-sat/vae/3d-vae.pt" # Absolute path to the CogVideoX-2b-sat/vae/3d-vae.pt folder
+      ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # Absolute path to the CogVideoX-2b-sat/vae/3d-vae.pt folder
      ignore_keys: [ 'loss' ]

      loss_config:
@ -254,13 +254,14 @@ args:
  sampling_num_frames: 13  # Must be 13, 11 or 9
  sampling_fps: 8
  fp16: True # For CogVideoX-2B
-#  bf16: True # For CogVideoX-5B
-  output_dir: outputs/ 
+  #  bf16: True # For CogVideoX-5B
+  output_dir: outputs/
  force_inference: True
 ```

-+ Modify `configs/test.txt` if multiple prompts is required, in which each line makes a prompt.  
-+ For better prompt formatting, refer to [convert_demo.py](../inference/convert_demo.py), for which you should set the OPENAI_API_KEY as your environmental variable.
+ Modify `configs/test.txt` if multiple prompts is required, in which each line makes a prompt.
+ For better prompt formatting, refer to [convert_demo.py](../inference/convert_demo.py), for which you should set the
+  OPENAI_API_KEY as your environmental variable.
 + Modify `input_type` in `configs/inference.yaml` if use command line as prompt iuput.

 ```yaml
@ -408,28 +409,32 @@ python ../tools/convert_weight_sat2hf.py

 ### Exporting Huggingface Diffusers lora LoRA Weights from SAT Checkpoints

-After completing the training using the above steps, we get a SAT checkpoint with LoRA weights. You can find the file at `{args.save}/1000/1000/mp_rank_00_model_states.pt`.
+After completing the training using the above steps, we get a SAT checkpoint with LoRA weights. You can find the file
+at `{args.save}/1000/1000/mp_rank_00_model_states.pt`.

-The script for exporting LoRA weights can be found in the CogVideoX repository at `tools/export_sat_lora_weight.py`. After exporting, you can use `load_cogvideox_lora.py` for inference.
+The script for exporting LoRA weights can be found in the CogVideoX repository at `tools/export_sat_lora_weight.py`.
+After exporting, you can use `load_cogvideox_lora.py` for inference.
+
+Export command:

-#### Export command:
 ```bash
 python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory {args.save}/export_hf_lora_weights_1/
 ```

-This training mainly modified the following model structures. The table below lists the corresponding structure mappings for converting to the HF (Hugging Face) format LoRA structure. As you can see, LoRA adds a low-rank weight to the model's attention structure.
+This training mainly modified the following model structures. The table below lists the corresponding structure mappings
+for converting to the HF (Hugging Face) format LoRA structure. As you can see, LoRA adds a low-rank weight to the
+model's attention structure.

 ```
-
-    'attention.query_key_value.matrix_A.0': 'attn1.to_q.lora_A.weight',
-    'attention.query_key_value.matrix_A.1': 'attn1.to_k.lora_A.weight',
-    'attention.query_key_value.matrix_A.2': 'attn1.to_v.lora_A.weight',
-    'attention.query_key_value.matrix_B.0': 'attn1.to_q.lora_B.weight',
-    'attention.query_key_value.matrix_B.1': 'attn1.to_k.lora_B.weight',
-    'attention.query_key_value.matrix_B.2': 'attn1.to_v.lora_B.weight',
-    'attention.dense.matrix_A.0': 'attn1.to_out.0.lora_A.weight',
-    'attention.dense.matrix_B.0': 'attn1.to_out.0.lora_B.weight'
+'attention.query_key_value.matrix_A.0': 'attn1.to_q.lora_A.weight',
+'attention.query_key_value.matrix_A.1': 'attn1.to_k.lora_A.weight',
+'attention.query_key_value.matrix_A.2': 'attn1.to_v.lora_A.weight',
+'attention.query_key_value.matrix_B.0': 'attn1.to_q.lora_B.weight',
+'attention.query_key_value.matrix_B.1': 'attn1.to_k.lora_B.weight',
+'attention.query_key_value.matrix_B.2': 'attn1.to_v.lora_B.weight',
+'attention.dense.matrix_A.0': 'attn1.to_out.0.lora_A.weight',
+'attention.dense.matrix_B.0': 'attn1.to_out.0.lora_B.weight'
 ```
-  
+
 Using export_sat_lora_weight.py, you can convert the SAT checkpoint into the HF LoRA format.
 ![alt text](../resources/hf_lora_weights.png)
--- a/sat/README_ja.md
+++ b/sat/README_ja.md
@ -32,18 +32,12 @@ mv 'index.html?dl=1' transformer.zip
 unzip transformer.zip
 ```

-CogVideoX-5B モデルの場合は、次のようにダウンロードしてください (VAE ファイルは同じです)。
+CogVideoX-5B モデルの `transformers` ファイルを以下のリンクからダウンロードしてください （VAE ファイルは 2B と同じです）：

-```shell
-mkdir CogVideoX-5b-sat
-cd CogVideoX-5b-sat
-wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
-mv 'index.html?dl=1' vae.zip
-unzip vae.zip
-```
+ [CogVideoX-5B](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list)
+ [CogVideoX-5B-I2V](https://cloud.tsinghua.edu.cn/d/5cc62a2d6e7d45c0a2f6/?p=%2F1&mode=list)

-次に、[Tsinghua Cloud Disk](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list) に移動してモデルをダウンロードし、解凍する必要があります。
-整理すると、2 つのモデルの完全なモデル構造は次のようになります。 モデル構造は次のようになります：
+次に、モデルファイルを以下の形式にフォーマットする必要があります：

 ```
 .
@ -55,8 +49,9 @@ unzip vae.zip
    └── 3d-vae.pt
 ```

+モデルの重みファイルが大きいため、`git lfs`を使用することをお勧めいたします。`git lfs`
+のインストールについては、[こちら](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing)をご参照ください。

-モデルの重みファイルが大きいため、`git lfs`を使用することをお勧めいたします。`git lfs`のインストールについては、[こちら](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing)をご参照ください。
 ```shell
 git lfs install
 ```
@ -166,14 +161,14 @@ model:
          ucg_rate: 0.1
          target: sgm.modules.encoders.modules.FrozenT5Embedder
          params:
-            model_dir: "{absolute_path/to/your/t5-v1_1-xxl}/t5-v1_1-xxl" # CogVideoX-2b/t5-v1_1-xxlフォルダの絶対パス
+            model_dir: "t5-v1_1-xxl" # CogVideoX-2b/t5-v1_1-xxlフォルダの絶対パス
            max_length: 226

  first_stage_config:
    target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
    params:
      cp_size: 1
-      ckpt_path: "{absolute_path/to/your/t5-v1_1-xxl}/CogVideoX-2b-sat/vae/3d-vae.pt" # CogVideoX-2b-sat/vae/3d-vae.ptフォルダの絶対パス
+      ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # CogVideoX-2b-sat/vae/3d-vae.ptフォルダの絶対パス
      ignore_keys: [ 'loss' ]

      loss_config:
@ -244,6 +239,7 @@ model:
          exp: 5
          num_steps: 50
 ```
+
 ### 4. `configs/inference.yaml` ファイルを変更します。

 ```yaml
@ -259,7 +255,7 @@ args:
  sampling_num_frames: 13  # Must be 13, 11 or 9
  sampling_fps: 8
  fp16: True # For CogVideoX-2B
-#  bf16: True # For CogVideoX-5B
+  #  bf16: True # For CogVideoX-5B
  output_dir: outputs/
  force_inference: True
 ```
@ -417,25 +413,23 @@ python ../tools/convert_weight_sat2hf.py

 LoRAウェイトをエクスポートするためのスクリプトは、CogVideoXリポジトリの `tools/export_sat_lora_weight.py` にあります。エクスポート後、`load_cogvideox_lora.py` を使用して推論を行うことができます。

-#### エクスポートコマンド:
+エクスポートコマンド:
+
 ```bash
 python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory {args.save}/export_hf_lora_weights_1/
 ```

 このトレーニングでは主に以下のモデル構造が変更されました。以下の表は、HF (Hugging Face) 形式のLoRA構造に変換する際の対応関係を示しています。ご覧の通り、LoRAはモデルの注意メカニズムに低ランクの重みを追加しています。

-
-
 ```
-
-    'attention.query_key_value.matrix_A.0': 'attn1.to_q.lora_A.weight',
-    'attention.query_key_value.matrix_A.1': 'attn1.to_k.lora_A.weight',
-    'attention.query_key_value.matrix_A.2': 'attn1.to_v.lora_A.weight',
-    'attention.query_key_value.matrix_B.0': 'attn1.to_q.lora_B.weight',
-    'attention.query_key_value.matrix_B.1': 'attn1.to_k.lora_B.weight',
-    'attention.query_key_value.matrix_B.2': 'attn1.to_v.lora_B.weight',
-    'attention.dense.matrix_A.0': 'attn1.to_out.0.lora_A.weight',
-    'attention.dense.matrix_B.0': 'attn1.to_out.0.lora_B.weight'
+'attention.query_key_value.matrix_A.0': 'attn1.to_q.lora_A.weight',
+'attention.query_key_value.matrix_A.1': 'attn1.to_k.lora_A.weight',
+'attention.query_key_value.matrix_A.2': 'attn1.to_v.lora_A.weight',
+'attention.query_key_value.matrix_B.0': 'attn1.to_q.lora_B.weight',
+'attention.query_key_value.matrix_B.1': 'attn1.to_k.lora_B.weight',
+'attention.query_key_value.matrix_B.2': 'attn1.to_v.lora_B.weight',
+'attention.dense.matrix_A.0': 'attn1.to_out.0.lora_A.weight',
+'attention.dense.matrix_B.0': 'attn1.to_out.0.lora_B.weight'
 ```
  
 export_sat_lora_weight.py を使用して、SATチェックポイントをHF LoRA形式に変換できます。
--- a/sat/README_zh.md
+++ b/sat/README_zh.md
@ -18,7 +18,10 @@ pip install -r requirements.txt

 ### 2. 下载模型权重

-首先，前往 SAT 镜像下载模型权重。对于 CogVideoX-2B 模型，请按照如下方式下载:
+首先，前往 SAT 镜像下载模型权重。
+
+对于 CogVideoX-2B 模型，请按照如下方式下载:
+
 ```shell
 mkdir CogVideoX-2b-sat
 cd CogVideoX-2b-sat
@ -29,28 +32,27 @@ wget https://cloud.tsinghua.edu.cn/f/556a3e1329e74f1bac45/?dl=1
 mv 'index.html?dl=1' transformer.zip
 unzip transformer.zip
 ```
-对于 CogVideoX-5B 模型，请按照如下方式下载(VAE文件相同):
-```shell
-mkdir CogVideoX-5b-sat
-cd CogVideoX-5b-sat
-wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
-mv 'index.html?dl=1' vae.zip
-unzip vae.zip
-```
-然后，您需要前往[清华云盘](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list)下载我们的模型，并进行解压。
-整理之后， 两个模型的完整模型结构应该如下:
+
+请按如下链接方式下载 CogVideoX-5B 模型的 `transformers` 文件（VAE 文件与 2B 相同）：
+
+ [CogVideoX-5B](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list)
+ [CogVideoX-5B-I2V](https://cloud.tsinghua.edu.cn/d/5cc62a2d6e7d45c0a2f6/?p=%2F1&mode=list)
+
+接着，你需要将模型文件排版成如下格式：

 ```
 .
 ├── transformer
-│   ├── 1000 (or 1)
-│   │   └── mp_rank_00_model_states.pt
-│   └── latest
+│   ├── 1000 (or 1)
+│   │   └── mp_rank_00_model_states.pt
+│   └── latest
 └── vae
    └── 3d-vae.pt
 ```

-由于模型的权重档案较大，建议使用`git lfs`。`git lfs`安装参见[这里](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing)
+由于模型的权重档案较大，建议使用`git lfs`。`git lfs`
+安装参见[这里](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing)
+
 ```shell
 git lfs install
 ```
@ -160,14 +162,14 @@ model:
          ucg_rate: 0.1
          target: sgm.modules.encoders.modules.FrozenT5Embedder
          params:
-            model_dir: "{absolute_path/to/your/t5-v1_1-xxl}/t5-v1_1-xxl" # CogVideoX-2b/t5-v1_1-xxl权重文件夹的绝对路径
+            model_dir: "t5-v1_1-xxl" # CogVideoX-2b/t5-v1_1-xxl 权重文件夹的绝对路径
            max_length: 226

  first_stage_config:
    target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
    params:
      cp_size: 1
-      ckpt_path: "{absolute_path/to/your/t5-v1_1-xxl}/CogVideoX-2b-sat/vae/3d-vae.pt" # CogVideoX-2b-sat/vae/3d-vae.pt文件夹的绝对路径
+      ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # CogVideoX-2b-sat/vae/3d-vae.pt文件夹的绝对路径
      ignore_keys: [ 'loss' ]

      loss_config:
@ -254,7 +256,7 @@ args:
  sampling_num_frames: 13  # Must be 13, 11 or 9
  sampling_fps: 8
  fp16: True # For CogVideoX-2B
-#  bf16: True # For CogVideoX-5B
+  #  bf16: True # For CogVideoX-5B
  output_dir: outputs/
  force_inference: True
 ```
@ -292,9 +294,9 @@ bash inference.sh
 ```
 .
 ├── labels
-│   ├── 1.txt
-│   ├── 2.txt
-│   ├── ...
+│   ├── 1.txt
+│   ├── 2.txt
+│   ├── ...
 └── videos
    ├── 1.mp4
    ├── 2.mp4
@ -406,28 +408,29 @@ python ../tools/convert_weight_sat2hf.py

 ### 从SAT权重文件 导出Huggingface Diffusers lora权重

-支持了从SAT权重文件 
+支持了从SAT权重文件
 在经过上面这些步骤训练之后，我们得到了一个sat带lora的权重，在{args.save}/1000/1000/mp_rank_00_model_states.pt你可以看到这个文件

 导出的lora权重脚本在CogVideoX仓库 tools/export_sat_lora_weight.py ,导出后使用 load_cogvideox_lora.py 推理
- 导出命令
+
+导出命令:
+
 ```
 python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory   {args.save}/export_hf_lora_weights_1/
-···
+```

 这次训练主要修改了下面几个模型结构,下面列出了 转换为HF格式的lora结构对应关系,可以看到lora将模型注意力结构上增加一个低秩权重,

 ```
-
-    'attention.query_key_value.matrix_A.0': 'attn1.to_q.lora_A.weight',
-    'attention.query_key_value.matrix_A.1': 'attn1.to_k.lora_A.weight',
-    'attention.query_key_value.matrix_A.2': 'attn1.to_v.lora_A.weight',
-    'attention.query_key_value.matrix_B.0': 'attn1.to_q.lora_B.weight',
-    'attention.query_key_value.matrix_B.1': 'attn1.to_k.lora_B.weight',
-    'attention.query_key_value.matrix_B.2': 'attn1.to_v.lora_B.weight',
-    'attention.dense.matrix_A.0': 'attn1.to_out.0.lora_A.weight',
-    'attention.dense.matrix_B.0': 'attn1.to_out.0.lora_B.weight'
+'attention.query_key_value.matrix_A.0': 'attn1.to_q.lora_A.weight',
+'attention.query_key_value.matrix_A.1': 'attn1.to_k.lora_A.weight',
+'attention.query_key_value.matrix_A.2': 'attn1.to_v.lora_A.weight',
+'attention.query_key_value.matrix_B.0': 'attn1.to_q.lora_B.weight',
+'attention.query_key_value.matrix_B.1': 'attn1.to_k.lora_B.weight',
+'attention.query_key_value.matrix_B.2': 'attn1.to_v.lora_B.weight',
+'attention.dense.matrix_A.0': 'attn1.to_out.0.lora_A.weight',
+'attention.dense.matrix_B.0': 'attn1.to_out.0.lora_B.weight'
 ```
-    
+
 通过export_sat_lora_weight.py将它转换为HF格式的lora结构
 ![alt text](../resources/hf_lora_weights.png)
--- a/sat/configs/cogvideox_5b_lora.yaml
+++ b/sat/configs/cogvideox_5b_lora.yaml
@ -1,7 +1,7 @@
 model:
  scale_factor: 0.7 # different from cogvideox_2b_infer.yaml
  disable_first_stage_autocast: true
-  not_trainable_prefixes: ['all'] ## Using Lora
+  not_trainable_prefixes: ['all'] # Using Lora
  log_keys:
    - txt

@ -53,7 +53,7 @@ model:
            hidden_size_head: 64
            text_length: 226

-        lora_config: ## Using Lora
+        lora_config: # Using Lora
          target: sat.model.finetune.lora2.LoraMixin
          params:
            r: 128
--- a/sat/diffusion_video.py
+++ b/sat/diffusion_video.py
@ -323,7 +323,6 @@ class SATVideoDiffusionEngine(nn.Module):
            if isinstance(c[k], torch.Tensor):
                c[k], uc[k] = map(lambda y: y[k][:N].to(self.device), (c, uc))

-
        if self.noised_image_input:
            image = x[:, :, 0:1]
            image = self.add_noise_to_first_frame(image)
--- a/sat/finetune_multi_gpus.sh
+++ b/sat/finetune_multi_gpus.sh
@ -1,8 +1,8 @@
 #! /bin/bash

-echo "RUN on $(hostname), CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True"
+echo "RUN on $(hostname), CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"

-run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_5b_i2v_lora.yaml configs/sft.yaml --seed $RANDOM"
+run_cmd="PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=8 train_video.py --base configs/test_cogvideox_5b_i2v_lora.yaml configs/test_sft.yaml --seed $RANDOM"

 echo ${run_cmd}
 eval ${run_cmd}
--- a/sat/sgm/modules/attention.py
+++ b/sat/sgm/modules/attention.py
@ -225,7 +225,7 @@ class CrossAttention(nn.Module):

        q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=h), (q, k, v))

-        ## old
+        # old
        """
        sim = einsum('b i d, b j d -> b i j', q, k) * self.scale
        del q, k
@ -241,7 +241,7 @@ class CrossAttention(nn.Module):

        out = einsum('b i j, b j d -> b i d', sim, v)
        """
-        ## new
+        # new
        with sdp_kernel(**BACKEND_MAP[self.backend]):
            # print("dispatching into backend", self.backend, "q/k/v shape: ", q.shape, k.shape, v.shape)
            out = F.scaled_dot_product_attention(q, k, v, attn_mask=mask)  # scale is dim_head ** -0.5 per default
--- a/sat/sgm/modules/diffusionmodules/openaimodel.py
+++ b/sat/sgm/modules/diffusionmodules/openaimodel.py
@ -34,7 +34,6 @@ def convert_module_to_f32(x):
    pass


-## go
 class AttentionPool2d(nn.Module):
    """
    Adapted from CLIP: https://github.com/openai/CLIP/blob/main/clip/model.py
--- a/sat/vae_modules/attention.py
+++ b/sat/vae_modules/attention.py
@ -225,7 +225,7 @@ class CrossAttention(nn.Module):

        q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=h), (q, k, v))

-        ## old
+        # old
        """
        sim = einsum('b i d, b j d -> b i j', q, k) * self.scale
        del q, k
@ -241,7 +241,7 @@ class CrossAttention(nn.Module):

        out = einsum('b i j, b j d -> b i d', sim, v)
        """
-        ## new
+        # new
        with sdp_kernel(**BACKEND_MAP[self.backend]):
            # print("dispatching into backend", self.backend, "q/k/v shape: ", q.shape, k.shape, v.shape)
            out = F.scaled_dot_product_attention(q, k, v, attn_mask=mask)  # scale is dim_head ** -0.5 per default
--- a/tools/llm_flux_cogvideox/generate.sh
+++ b/tools/llm_flux_cogvideox/generate.sh
@ -0,0 +1,35 @@
+#!/bin/bash
+
+NUM_VIDEOS=10
+INFERENCE_STEPS=50
+GUIDANCE_SCALE=7.0
+OUTPUT_DIR_PREFIX="outputs/gpu_"
+LOG_DIR_PREFIX="logs/gpu_"
+
+VIDEO_MODEL_PATH="/share/official_pretrains/hf_home/CogVideoX-5b-I2V"
+LLM_MODEL_PATH="/share/home/zyx/Models/Meta-Llama-3.1-8B-Instruct"
+IMAGE_MODEL_PATH = "share/home/zyx/Models/FLUX.1-dev"
+
+#VIDEO_MODEL_PATH="THUDM/CogVideoX-5B-I2V"
+#LLM_MODEL_PATH="THUDM/glm-4-9b-chat"
+#IMAGE_MODEL_PATH = "black-forest-labs/FLUX.1-dev"
+
+CUDA_DEVICES=${CUDA_VISIBLE_DEVICES:-"0"}
+
+IFS=',' read -r -a GPU_ARRAY <<< "$CUDA_DEVICES"
+
+for i in "${!GPU_ARRAY[@]}"
+do
+    GPU=${GPU_ARRAY[$i]}
+    echo "Starting task on GPU $GPU..."
+    CUDA_VISIBLE_DEVICES=$GPU nohup python3 llm_flux_cogvideox.py \
+    --caption_generator_model_id $LLM_MODEL_PATH \
+    --image_generator_model_id $IMAGE_MODEL_PATH \
+    --model_path $VIDEO_MODEL_PATH \
+    --num_videos $NUM_VIDEOS \
+    --image_generator_num_inference_steps $INFERENCE_STEPS \
+    --guidance_scale $GUIDANCE_SCALE \
+    --use_dynamic_cfg \
+    --output_dir ${OUTPUT_DIR_PREFIX}${GPU} \
+    > ${LOG_DIR_PREFIX}${GPU}.log 2>&1 &
+done
--- a/tools/llm_flux_cogvideox/llm_flux_cogvideox.py
+++ b/tools/llm_flux_cogvideox/llm_flux_cogvideox.py
@ -0,0 +1,257 @@
+"""
+The original experimental code for this project can be found at:
+
+https://gist.github.com/a-r-r-o-w/d070cce059ab4ceab3a9f289ff83c69c
+
+By using this code, description prompts will be generated through a local large language model, and images will be
+generated using the black-forest-labs/FLUX.1-dev model, followed by video generation via CogVideoX.
+The entire process utilizes open-source solutions, without the need for any API keys.
+
+You can use the generate.sh file in the same folder to automate running this code
+for batch generation of videos and images.
+
+bash generate.sh
+
+"""
+
+import argparse
+import gc
+import json
+import os
+import pathlib
+import random
+from typing import Any, Dict
+
+from transformers import AutoTokenizer
+
+os.environ["TORCH_LOGS"] = "+dynamo,recompiles,graph_breaks"
+os.environ["TORCHDYNAMO_VERBOSE"] = "1"
+
+import numpy as np
+import torch
+import transformers
+from diffusers import CogVideoXImageToVideoPipeline, CogVideoXDPMScheduler, DiffusionPipeline
+from diffusers.utils.logging import get_logger
+from diffusers.utils import export_to_video
+
+torch.set_float32_matmul_precision("high")
+
+logger = get_logger(__name__)
+
+SYSTEM_PROMPT = """
+You are part of a team of people that create videos using generative models. You use a video-generation model that can generate a video about anything you describe.
+
+For example, if you respond with "A beautiful morning in the woods with the sun peaking through the trees", the video generation model will create a video of exactly as described. You task is to summarize the descriptions of videos provided to by users, and create details prompts to feed into the generative model.
+
+There are a few rules to follow:
+- You will only ever output a single video description per request.
+- If the user mentions to summarize the prompt in [X] words, make sure to not exceed the limit.
+
+You responses should just be the video generation prompt. Here are examples:
+- “A lone figure stands on a city rooftop at night, gazing up at the full moon. The moon glows brightly, casting a gentle light over the quiet cityscape. Below, the windows of countless homes shine with warm lights, creating a contrast between the bustling life below and the peaceful solitude above. The scene captures the essence of the Mid-Autumn Festival, where despite the distance, the figure feels connected to loved ones through the shared beauty of the moonlit sky.”
+- "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
+- "A street artist, clad in a worn-out denim jacket and a colorful banana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall"
+""".strip()
+
+USER_PROMPT = """
+Could you generate a prompt for a video generation model? 
+Please limit the prompt to [{0}] words.
+""".strip()
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--num_videos",
+        type=int,
+        default=5,
+        help="Number of unique videos you would like to generate."
+    )
+    parser.add_argument(
+        "--model_path",
+        type=str,
+        default="THUDM/CogVideoX-5B",
+        help="The path of Image2Video CogVideoX-5B",
+    )
+    parser.add_argument(
+        "--caption_generator_model_id",
+        type=str,
+        default="THUDM/glm-4-9b-chat",
+        help="Caption generation model. default GLM-4-9B",
+    )
+    parser.add_argument(
+        "--caption_generator_cache_dir",
+        type=str,
+        default=None,
+        help="Cache directory for caption generation model."
+    )
+    parser.add_argument(
+        "--image_generator_model_id",
+        type=str,
+        default="black-forest-labs/FLUX.1-dev",
+        help="Image generation model."
+    )
+    parser.add_argument(
+        "--image_generator_cache_dir",
+        type=str,
+        default=None,
+        help="Cache directory for image generation model."
+    )
+    parser.add_argument(
+        "--image_generator_num_inference_steps",
+        type=int,
+        default=50,
+        help="Caption generation model."
+    )
+    parser.add_argument(
+        "--guidance_scale",
+        type=float,
+        default=7,
+        help="Guidance scale to be use for generation."
+    )
+    parser.add_argument(
+        "--use_dynamic_cfg",
+        action="store_true",
+        help="Whether or not to use cosine dynamic guidance for generation [Recommended].",
+    )
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        default="outputs/",
+        help="Location where generated images and videos should be stored.",
+    )
+    parser.add_argument(
+        "--compile",
+        action="store_true",
+        help="Whether or not to compile the transformer of image and video generators."
+    )
+    parser.add_argument(
+        "--enable_vae_tiling",
+        action="store_true",
+        help="Whether or not to use VAE tiling when encoding/decoding."
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=42,
+        help="Seed for reproducibility."
+    )
+    return parser.parse_args()
+
+
+def reset_memory():
+    gc.collect()
+    torch.cuda.empty_cache()
+    torch.cuda.reset_peak_memory_stats()
+    torch.cuda.reset_accumulated_memory_stats()
+
+
+@torch.no_grad()
+def main(args: Dict[str, Any]) -> None:
+    output_dir = pathlib.Path(args.output_dir)
+    os.makedirs(output_dir.as_posix(), exist_ok=True)
+
+    random.seed(args.seed)
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    torch.cuda.manual_seed_all(args.seed)
+
+    reset_memory()
+    tokenizer = AutoTokenizer.from_pretrained(args.caption_generator_model_id, trust_remote_code=True)
+    caption_generator = transformers.pipeline(
+        "text-generation",
+        model=args.caption_generator_model_id,
+        device_map="auto",
+        model_kwargs={
+            "local_files_only": True,
+            "cache_dir": args.caption_generator_cache_dir,
+            "torch_dtype": torch.bfloat16,
+        },
+        trust_remote_code=True,
+        tokenizer=tokenizer
+    )
+
+    captions = []
+    for i in range(args.num_videos):
+        num_words = random.choice([50, 75, 100])
+        user_prompt = USER_PROMPT.format(num_words)
+
+        messages = [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": user_prompt},
+        ]
+
+        outputs = caption_generator(messages, max_new_tokens=226)
+        caption = outputs[0]["generated_text"][-1]["content"]
+        if caption.startswith("\"") and caption.endswith("\""):
+            caption = caption[1:-1]
+        captions.append(caption)
+        logger.info(f"Generated caption: {caption}")
+
+    with open(output_dir / "captions.json", "w") as file:
+        json.dump(captions, file)
+
+    del caption_generator
+    reset_memory()
+
+    image_generator = DiffusionPipeline.from_pretrained(
+        args.image_generator_model_id,
+        cache_dir=args.image_generator_cache_dir,
+        torch_dtype=torch.bfloat16
+    )
+    image_generator.to("cuda")
+
+    if args.compile:
+        image_generator.transformer = torch.compile(image_generator.transformer, mode="max-autotune", fullgraph=True)
+
+    if args.enable_vae_tiling:
+        image_generator.vae.enable_tiling()
+
+    images = []
+    for index, caption in enumerate(captions):
+        image = image_generator(
+            prompt=caption,
+            height=480,
+            width=720,
+            num_inference_steps=args.image_generator_num_inference_steps,
+            guidance_scale=3.5,
+        ).images[0]
+        filename = caption[:25].replace(".", "_").replace("'", "_").replace('"', "_").replace(",", "_")
+        image.save(output_dir / f"{index}_{filename}.png")
+        images.append(image)
+
+    del image_generator
+    reset_memory()
+
+    video_generator = CogVideoXImageToVideoPipeline.from_pretrained(
+        args.model_path, torch_dtype=torch.bfloat16).to("cuda")
+    video_generator.scheduler = CogVideoXDPMScheduler.from_config(
+        video_generator.scheduler.config,
+        timestep_spacing="trailing")
+
+    if args.compile:
+        video_generator.transformer = torch.compile(video_generator.transformer, mode="max-autotune", fullgraph=True)
+
+    if args.enable_vae_tiling:
+        video_generator.vae.enable_tiling()
+
+    generator = torch.Generator().manual_seed(args.seed)
+    for index, (caption, image) in enumerate(zip(captions, images)):
+        video = video_generator(
+            image=image,
+            prompt=caption,
+            height=480,
+            width=720,
+            num_frames=49,
+            num_inference_steps=50,
+            guidance_scale=args.guidance_scale,
+            use_dynamic_cfg=args.use_dynamic_cfg,
+            generator=generator,
+        ).frames[0]
+        filename = caption[:25].replace(".", "_").replace("'", "_").replace('"', "_").replace(",", "_")
+        export_to_video(video, output_dir / f"{index}_{filename}.mp4", fps=8)
+
+
+if __name__ == "__main__":
+    args = get_args()
+    main(args)