ADD PAPER

2025-06-16 13:19:17 +08:00 · 2024-08-06 02:41:08 +08:00 · 2024-08-06 02:41:08 +08:00 · ad855f622c
commit ad855f622c
parent 2da594c831
5 changed files with 46 additions and 41 deletions
--- a/README.md
+++ b/README.md
@ -4,11 +4,13 @@
 <div align="center">
 <img src=resources/logo.svg width="50%"/>
 </div>
 <p align="center">
 🤗 Experience on <a href="https://huggingface.co/spaces/THUDM/CogVideoX" target="_blank">CogVideoX Huggingface Space</a>
 </p>
-</div>
+<p align="center">
 📚 Check here to view <a href="resources/CogVideoX.pdf" target="_blank">Paper</a>
 </p>
 <p align="center">
    👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a> 
 </p>
@ -55,18 +57,18 @@ to [清影](https://chatglm.cn/video).
 The table below shows the list of video generation models we currently provide,
 along with related basic information:
-| Model Name                                | CogVideoX-2B                                                 | 
+| Model Name                                | CogVideoX-2B                                                                                                                         | 
-|-------------------------------------------|--------------------------------------------------------------|
+|-------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|
-| Prompt Language                           | English                                                      | 
+| Prompt Language                           | English                                                                                                                              | 
-| GPU Memory Required for Inference (FP16)  | 36GB (will be optimized before the PR is merged)             | 
+| GPU Memory Required for Inference (FP16)  | 36GB using diffusers (will be optimized before the PR is merged)  and 25G using [SAT](https://github.com/THUDM/SwissArmyTransformer) | 
-| GPU Memory Required for Fine-tuning(bs=1) | 46.2GB                                                       |
+| GPU Memory Required for Fine-tuning(bs=1) | 42GB                                                                                                                                 |
-| Prompt Max  Length                        | 226 Tokens                                                   |
+| Prompt Max  Length                        | 226 Tokens                                                                                                                           |
-| Video Length                              | 6 seconds                                                    | 
+| Video Length                              | 6 seconds                                                                                                                            | 
-| Frames Per Second                         | 8 frames                                                     | 
+| Frames Per Second                         | 8 frames                                                                                                                             | 
-| Resolution                                | 720 * 480                                                    |
+| Resolution                                | 720 * 480                                                                                                                            |
-| Quantized Inference                       | Not Supported                                                |          
+| Quantized Inference                       | Not Supported                                                                                                                        |          
-| Multi-card Inference                      | Not Supported                                                |                             
+| Multi-card Inference                      | Not Supported                                                                                                                        |                             
-| Download Link                             | 🤗 [CogVideoX-2B](https://huggingface.co/THUDM/CogVideoX-2B) |
+| Download Link                             | 🤗 [CogVideoX-2B](https://huggingface.co/THUDM/CogVideoX-2B)                                                                         |
 ## Project Structure
@ -89,7 +91,7 @@ of the **CogVideoX** open-source model.
 ### sat
-+ [sat_demo](sat/configs/README_zh.md): Contains the inference code and fine-tuning code of SAT weights. It is
+ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is
  recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform
  rapid stacking and development.
--- a/README_zh.md
+++ b/README_zh.md
@ -5,11 +5,13 @@
 <div align="center">
 <img src=resources/logo.svg width="50%"/>
 </div>
 <p align="center">
 🤗 在 <a href="https://huggingface.co/spaces/THUDM/CogVideoX" target="_blank">CogVideoX Huggingface Space</a> 体验视频生成模型
 </p>
-</div>
+<p align="center">
 📚 查看 <a href="resources/CogVideoX.pdf" target="_blank">论文</a>
 </p>
 <p align="center">
    👋 加入我们的 <a href="resources/WECHAT.md" target="_blank">微信</a> 和  <a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a> 
 </p>
@ -52,18 +54,18 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生
 下表战展示目前我们提供的视频生成模型列表，以及相关基础信息:
-| 模型名字           | CogVideoX-2B                                                 | 
+| 模型名字           | CogVideoX-2B                                                                                                                        | 
-|----------------|--------------------------------------------------------------|
+|----------------|-------------------------------------------------------------------------------------------------------------------------------------|
-| 提示词语言          | English                                                      | 
+| 提示词语言          | English                                                                                                                             | 
-| 推理显存消耗 (FP-16) | 36GB                                                         | 
+| 推理显存消耗 (FP-16) | 36GB using diffusers (will be optimized before the PR is merged) and 25G using [SAT](https://github.com/THUDM/SwissArmyTransformer) | 
-| 微调显存消耗 (bs=1)  | 46.2GB                                                       |
+| 微调显存消耗 (bs=1)  | 42GB                                                                                                                                |
-| 提示词长度上限        | 226 Tokens                                                   |
+| 提示词长度上限        | 226 Tokens                                                                                                                          |
-| 视频长度           | 6 seconds                                                    | 
+| 视频长度           | 6 seconds                                                                                                                           | 
-| 帧率（每秒）         | 8 frames                                                     | 
+| 帧率（每秒）         | 8 frames                                                                                                                            | 
-| 视频分辨率          | 720 * 480                                                    |
+| 视频分辨率          | 720 * 480                                                                                                                           |
-| 量化推理           | 不支持                                                          |          
+| 量化推理           | 不支持                                                                                                                                 |          
-| 多卡推理           | 不支持                                                          |                             
+| 多卡推理           | 不支持                                                                                                                                 |                             
-| 权重地址           | 🤗 [CogVideoX-2B](https://huggingface.co/THUDM/CogVideoX-2B) |
+| 权重地址           | 🤗 [CogVideoX-2B](https://huggingface.co/THUDM/CogVideoX-2B)                                                                        |
 ## 项目结构
@ -77,12 +79,12 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生
 + [web_demo](inference/web_demo.py): 一个简单的streamlit网页应用，展示如何使用 CogVideoX-2B 模型生成视频。
 <div style="text-align: center;">
-    <img src="resources/web_demo.png" style="width: 100%%; height: auto;" />
+    <img src="resources/web_demo.png" style="width: 100%; height: auto;" />
 </div>
 ### sat
-+ [sat_demo](sat/configs/README_zh.md): 包含了 SAT 权重的推理代码和微调代码，推荐基于 CogVideoX
+ [sat_demo](sat/README_zh.md): 包含了 SAT 权重的推理代码和微调代码，推荐基于 CogVideoX
  模型结构进行改进，创新的研究者使用改代码以更好的进行快速的堆叠和开发。
 ### tools
--- a/resources/CogVideoX.pdf
+++ b/resources/CogVideoX.pdf
--- a/sat/configs/cogvideox_2b_infer.yaml
+++ b/sat/configs/cogvideox_2b_infer.yaml
@ -5,7 +5,7 @@ args:
  batch_size: 1
  input_type: txt
  input_file: test.txt
-  sampling_num_frames: 13
+  sampling_num_frames: 13  # Must be 11,13 or 19
  sampling_fps: 8
  fp16: True
  output_dir: outputs/
@ -82,13 +82,13 @@ model:
    target: sgm.modules.GeneralConditioner
    params:
      emb_models:
-          - is_trainable: false
+        - is_trainable: false
-            input_key: txt
+          input_key: txt
-            ucg_rate: 0.1
+          ucg_rate: 0.1
-            target: sgm.modules.encoders.modules.FrozenT5Embedder
+          target: sgm.modules.encoders.modules.FrozenT5Embedder
-            params:
+          params:
-              model_dir: "google/t5-v1_1-xxl"
+            model_dir: "google/t5-v1_1-xxl"
-              max_length: 226
+            max_length: 226
  first_stage_config:
    target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
--- a/sat/sample_video.py
+++ b/sat/sample_video.py
@ -177,12 +177,13 @@ def sampling_main(args, model_cls):
                latent = 1.0 / model.scale_factor * samples_z
                recons = []
-                for i in range(6):
+                loop_num = (T - 1) // 2
                for i in range(loop_num):
                    if i == 0:
                        start_frame, end_frame = 0, 3
                    else:
                        start_frame, end_frame = i * 2 + 1, i * 2 + 3
-                    if i == 5:
+                    if i == loop_num - 1:
                        clear_fake_cp_cache = True
                    else:
                        clear_fake_cp_cache = False