diff --git a/README.md b/README.md index 3db483e..d5790fd 100644 --- a/README.md +++ b/README.md @@ -4,11 +4,13 @@
+

- 🤗 Experience on CogVideoX Huggingface Space

- +

+📚 Check here to view Paper +

👋 Join our WeChat and Discord

@@ -55,18 +57,18 @@ to [清影](https://chatglm.cn/video). The table below shows the list of video generation models we currently provide, along with related basic information: -| Model Name | CogVideoX-2B | -|-------------------------------------------|--------------------------------------------------------------| -| Prompt Language | English | -| GPU Memory Required for Inference (FP16) | 36GB (will be optimized before the PR is merged) | -| GPU Memory Required for Fine-tuning(bs=1) | 46.2GB | -| Prompt Max Length | 226 Tokens | -| Video Length | 6 seconds | -| Frames Per Second | 8 frames | -| Resolution | 720 * 480 | -| Quantized Inference | Not Supported | -| Multi-card Inference | Not Supported | -| Download Link | 🤗 [CogVideoX-2B](https://huggingface.co/THUDM/CogVideoX-2B) | +| Model Name | CogVideoX-2B | +|-------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------| +| Prompt Language | English | +| GPU Memory Required for Inference (FP16) | 36GB using diffusers (will be optimized before the PR is merged) and 25G using [SAT](https://github.com/THUDM/SwissArmyTransformer) | +| GPU Memory Required for Fine-tuning(bs=1) | 42GB | +| Prompt Max Length | 226 Tokens | +| Video Length | 6 seconds | +| Frames Per Second | 8 frames | +| Resolution | 720 * 480 | +| Quantized Inference | Not Supported | +| Multi-card Inference | Not Supported | +| Download Link | 🤗 [CogVideoX-2B](https://huggingface.co/THUDM/CogVideoX-2B) | ## Project Structure @@ -89,7 +91,7 @@ of the **CogVideoX** open-source model. ### sat -+ [sat_demo](sat/configs/README_zh.md): Contains the inference code and fine-tuning code of SAT weights. It is ++ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development. diff --git a/README_zh.md b/README_zh.md index b831df2..e31ce5c 100644 --- a/README_zh.md +++ b/README_zh.md @@ -5,11 +5,13 @@
+

- 🤗 在 CogVideoX Huggingface Space 体验视频生成模型

- +

+📚 查看 论文 +

👋 加入我们的 微信Discord

@@ -52,18 +54,18 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生 下表战展示目前我们提供的视频生成模型列表,以及相关基础信息: -| 模型名字 | CogVideoX-2B | -|----------------|--------------------------------------------------------------| -| 提示词语言 | English | -| 推理显存消耗 (FP-16) | 36GB | -| 微调显存消耗 (bs=1) | 46.2GB | -| 提示词长度上限 | 226 Tokens | -| 视频长度 | 6 seconds | -| 帧率(每秒) | 8 frames | -| 视频分辨率 | 720 * 480 | -| 量化推理 | 不支持 | -| 多卡推理 | 不支持 | -| 权重地址 | 🤗 [CogVideoX-2B](https://huggingface.co/THUDM/CogVideoX-2B) | +| 模型名字 | CogVideoX-2B | +|----------------|-------------------------------------------------------------------------------------------------------------------------------------| +| 提示词语言 | English | +| 推理显存消耗 (FP-16) | 36GB using diffusers (will be optimized before the PR is merged) and 25G using [SAT](https://github.com/THUDM/SwissArmyTransformer) | +| 微调显存消耗 (bs=1) | 42GB | +| 提示词长度上限 | 226 Tokens | +| 视频长度 | 6 seconds | +| 帧率(每秒) | 8 frames | +| 视频分辨率 | 720 * 480 | +| 量化推理 | 不支持 | +| 多卡推理 | 不支持 | +| 权重地址 | 🤗 [CogVideoX-2B](https://huggingface.co/THUDM/CogVideoX-2B) | ## 项目结构 @@ -77,12 +79,12 @@ CogVideoX是 [清影](https://chatglm.cn/video) 同源的开源版本视频生 + [web_demo](inference/web_demo.py): 一个简单的streamlit网页应用,展示如何使用 CogVideoX-2B 模型生成视频。
- +
### sat -+ [sat_demo](sat/configs/README_zh.md): 包含了 SAT 权重的推理代码和微调代码,推荐基于 CogVideoX ++ [sat_demo](sat/README_zh.md): 包含了 SAT 权重的推理代码和微调代码,推荐基于 CogVideoX 模型结构进行改进,创新的研究者使用改代码以更好的进行快速的堆叠和开发。 ### tools diff --git a/resources/CogVideoX.pdf b/resources/CogVideoX.pdf new file mode 100644 index 0000000..c577181 Binary files /dev/null and b/resources/CogVideoX.pdf differ diff --git a/sat/configs/cogvideox_2b_infer.yaml b/sat/configs/cogvideox_2b_infer.yaml index d3c799a..a9cc17c 100644 --- a/sat/configs/cogvideox_2b_infer.yaml +++ b/sat/configs/cogvideox_2b_infer.yaml @@ -5,7 +5,7 @@ args: batch_size: 1 input_type: txt input_file: test.txt - sampling_num_frames: 13 + sampling_num_frames: 13 # Must be 11,13 or 19 sampling_fps: 8 fp16: True output_dir: outputs/ @@ -82,13 +82,13 @@ model: target: sgm.modules.GeneralConditioner params: emb_models: - - is_trainable: false - input_key: txt - ucg_rate: 0.1 - target: sgm.modules.encoders.modules.FrozenT5Embedder - params: - model_dir: "google/t5-v1_1-xxl" - max_length: 226 + - is_trainable: false + input_key: txt + ucg_rate: 0.1 + target: sgm.modules.encoders.modules.FrozenT5Embedder + params: + model_dir: "google/t5-v1_1-xxl" + max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper diff --git a/sat/sample_video.py b/sat/sample_video.py index cce55ba..30d7794 100644 --- a/sat/sample_video.py +++ b/sat/sample_video.py @@ -177,12 +177,13 @@ def sampling_main(args, model_cls): latent = 1.0 / model.scale_factor * samples_z recons = [] - for i in range(6): + loop_num = (T - 1) // 2 + for i in range(loop_num): if i == 0: start_frame, end_frame = 0, 3 else: start_frame, end_frame = i * 2 + 1, i * 2 + 3 - if i == 5: + if i == loop_num - 1: clear_fake_cp_cache = True else: clear_fake_cp_cache = False