diff --git a/README_zh.md b/README_zh.md
index 2e26810..61453aa 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -21,24 +21,26 @@
## 项目更新
+- 🔥 **News**: ``2024/8/7``: CogVideoX 已经合并入 `diffusers` 0.30.0版本,单张3090可以推理,详情请见[代码](inference/cli_demo.py)。
- 🔥 **News**: ``2024/8/6``: 我们开源 **3D Causal VAE**,用于 **CogVideoX-2B**,可以几乎无损地重构视频。
- 🔥 **News**: ``2024/8/6``: 我们开源 CogVideoX 系列视频生成模型的第一个模型, **CogVideoX-2B**。
-- 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型(现在你可以在 `CogVideo` 分支中看到),这是首个开源的基于 Transformer 的大型文本生成视频模型,您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。
-**性能更强,参数量更大的模型正在到来的路上~,欢迎关注**
+- 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型(现在你可以在 `CogVideo` 分支中看到),这是首个开源的基于
+ Transformer 的大型文本生成视频模型,您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。
+ **性能更强,参数量更大的模型正在到来的路上~,欢迎关注**
## 目录
跳转到指定部分:
- [快速开始](#快速开始)
- - [SAT](#sat)
- - [Diffusers](#Diffusers)
+ - [SAT](#sat)
+ - [Diffusers](#Diffusers)
- [CogVideoX-2B 视频作品](#cogvideox-2b-视频作品)
- [CogVideoX模型介绍](#模型介绍)
- [完整项目代码结构](#完整项目代码结构)
- - [Inference](#inference)
- - [SAT](#sat)
- - [Tools](#tools)
+ - [Inference](#inference)
+ - [SAT](#sat)
+ - [Tools](#tools)
- [开源项目规划](#开源项目规划)
- [模型协议](#模型协议)
- [CogVideo(ICLR'23)模型介绍](#cogvideoiclr23)
@@ -53,8 +55,9 @@
### SAT
-查看sat文件夹下的[sat_demo](sat/README.md):包含了 SAT 权重的推理代码和微调代码,推荐基于此代码进行 CogVideoX 模型结构的改进,研究者使用该代码可以更好的进行快速的迭代和开发。
- (18 GB 推理, 40GB lora微调)
+查看sat文件夹下的[sat_demo](sat/README.md):包含了 SAT 权重的推理代码和微调代码,推荐基于此代码进行 CogVideoX
+模型结构的改进,研究者使用该代码可以更好的进行快速的迭代和开发。
+(18 GB 推理, 40GB lora微调)
### Diffusers
@@ -64,7 +67,6 @@ pip install -r requirements.txt
查看[diffusers_demo](inference/cli_demo.py):包含对推理代码更详细的解释,包括各种关键的参数。(36GB 推理,显存优化以及微调代码正在开发)
-
## CogVideoX-2B 视频作品
@@ -93,19 +95,19 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
下表战展示目前我们提供的视频生成模型列表,以及相关基础信息:
-| 模型名字 | CogVideoX-2B |
-|---------------------|--------------------------------------------------------------------------------------------------------------------------------------|
-| 提示词语言 | English |
-| 推理显存消耗 (FP-16) | 36GB using diffusers (will be optimized before the PR is merged) and 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer) |
-| 微调显存消耗 (bs=1) | 42GB |
-| 提示词长度上限 | 226 Tokens |
-| 视频长度 | 6 seconds |
-| 帧率(每秒) | 8 frames |
-| 视频分辨率 | 720 * 480 |
-| 量化推理 | 不支持 |
-| 多卡推理 | 不支持 |
-| 下载地址 (Diffusers 模型) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) |
-| 下载地址 (SAT 模型) | [SAT](./sat/README_zh.md) |
+| 模型名 | CogVideoX-2B |
+|---------------------|-------------------------------------------------------------------------------------------------------------------------------|
+| 提示词语言 | English |
+| 单GPU推理 (FP-16) 显存消耗 | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)
23.9GB using diffusers |
+| 多GPU推理 (FP-16) 显存消耗 | 20GB minimum per GPU using diffusers |
+| 微调显存消耗 (bs=1) | 42GB |
+| 提示词长度上限 | 226 Tokens |
+| 视频长度 | 6 seconds |
+| 帧率(每秒) | 8 frames |
+| 视频分辨率 | 720 * 480 |
+| 量化推理 | 不支持 |
+| 下载地址 (Diffusers 模型) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) |
+| 下载地址 (SAT 模型) | [SAT](./sat/README_zh.md) |
## 完整项目代码结构
@@ -115,7 +117,8 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
+ [diffusers_demo](inference/cli_demo.py): 更详细的推理代码讲解,常见参数的意义,在这里都会提及。
+ [diffusers_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码,目前需要71GB显存,将来会优化。
-+ [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合 CogVideoX的长输入。因为CogVideoX是在长文本上训练的,所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4,也可以替换为GPT、Gemini等任意大语言模型。
++ [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合
+ CogVideoX的长输入。因为CogVideoX是在长文本上训练的,所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4,也可以替换为GPT、Gemini等任意大语言模型。
+ [gradio_web_demo](inference/gradio_web_demo.py): 一个简单的gradio网页应用,展示如何使用 CogVideoX-2B 模型生成视频。
@@ -140,9 +143,10 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): 将 SAT 模型权重转换为 Huggingface 模型权重。
+ [caption_demo](tools/caption/README_zh.md): Caption 工具,对视频理解并用文字输出的模型。
+## CogVideo(ICLR'23)
-## CogVideo(ICLR'23)
- [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) 的官方repo位于[CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)。
+[CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868)
+的官方repo位于[CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)。
**CogVideo可以生成高帧率视频,下面展示了一个32帧的4秒视频。**
@@ -155,11 +159,12 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
-CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/)。您可以在这里体验文本到视频生成。*原始输入为中文。*
+CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/)。您可以在这里体验文本到视频生成。
+*原始输入为中文。*
## 引用
-🌟 如果您发现我们的工作有所帮助,欢迎引用我们的文章,留下宝贵的stars
+🌟 如果您发现我们的工作有所帮助,欢迎引用我们的文章,留下宝贵的stars
```
@article{yang2024cogvideox,
diff --git a/inference/cli_demo.py b/inference/cli_demo.py
index a1bb764..0358ce7 100644
--- a/inference/cli_demo.py
+++ b/inference/cli_demo.py
@@ -22,7 +22,7 @@ from diffusers import CogVideoXPipeline
def export_to_video_imageio(
- video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
+ video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
) -> str:
"""
Export the video frames to a video file using imageio lib to Avoid "green screen" issue (for example CogVideoX)
@@ -38,17 +38,34 @@ def export_to_video_imageio(
def generate_video(
- prompt: str,
- model_path: str,
- output_path: str = "./output.mp4",
- num_inference_steps: int = 50,
- guidance_scale: float = 6.0,
- num_videos_per_prompt: int = 1,
- device: str = "cuda",
- dtype: torch.dtype = torch.float16,
+ prompt: str,
+ model_path: str,
+ output_path: str = "./output.mp4",
+ num_inference_steps: int = 50,
+ guidance_scale: float = 6.0,
+ num_videos_per_prompt: int = 1,
+ device: str = "cuda",
+ dtype: torch.dtype = torch.float16,
):
+ """
+ Generates a video based on the given prompt and saves it to the specified path.
+
+ Parameters:
+ - prompt (str): The description of the video to be generated.
+ - model_path (str): The path of the pre-trained model to be used.
+ - output_path (str): The path where the generated video will be saved.
+ - num_inference_steps (int): Number of steps for the inference process. More steps can result in better quality.
+ - guidance_scale (float): The scale for classifier-free guidance. Higher values can lead to better alignment with the prompt.
+ - num_videos_per_prompt (int): Number of videos to generate per prompt.
+ - device (str): The device to use for computation (e.g., "cuda" or "cpu").
+ - dtype (torch.dtype): The data type for computation (default is torch.float16).
+ """
+
# Load the pre-trained CogVideoX pipeline with the specified precision (float16) and move it to the specified device
- pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype).to(device)
+ # add device_map="balanced" in the from_pretrained function and remove
+ # `pipe.enable_model_cpu_offload()` to enable Multi GPUs (2 or more and each one must have more than 20GB memory) inference.
+ pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype)
+ pipe.enable_model_cpu_offload()
# Encode the prompt to get the prompt embeddings
prompt_embeds, _ = pipe.encode_prompt(
@@ -60,18 +77,19 @@ def generate_video(
device=device, # Device to use for computation
dtype=dtype, # Data type for computation
)
- # Must enable model CPU offload to avoid OOM issue on GPU with 24GB memory
- pipe.enable_model_cpu_offload()
+
# Generate the video frames using the pipeline
video = pipe(
- num_inference_steps=num_inference_steps, # Number of inference steps
+ num_inference_steps=5, # Number of inference steps
guidance_scale=guidance_scale, # Guidance scale for classifier-free guidance
prompt_embeds=prompt_embeds, # Encoded prompt embeddings
negative_prompt_embeds=torch.zeros_like(prompt_embeds), # Not Supported negative prompt
).frames[0]
+
# Export the generated frames to a video file. fps must be 8
export_to_video_imageio(video, output_path, fps=8)
+
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Generate a video from a text prompt using CogVideoX")
parser.add_argument("--prompt", type=str, required=True, help="The description of the video to be generated")
diff --git a/inference/gradio_web_demo.py b/inference/gradio_web_demo.py
index 9f36254..4b4cad0 100644
--- a/inference/gradio_web_demo.py
+++ b/inference/gradio_web_demo.py
@@ -16,7 +16,8 @@ import PIL
dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"
-pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=dtype).to(device)
+pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=dtype)
+pipe.enable_model_cpu_offload()
sys_prompt = """You are part of a team of bots that creates videos. You work with an assistant bot that will draw anything you say in square brackets.
@@ -104,7 +105,7 @@ def infer(
device=device,
dtype=dtype,
)
- pipe.enable_model_cpu_offload()
+
video = pipe(
num_inference_steps=num_inference_steps,
guidance_scale=guidance_scale,
diff --git a/inference/streamlit_web_demo.py b/inference/streamlit_web_demo.py
index 6df62db..342d85b 100644
--- a/inference/streamlit_web_demo.py
+++ b/inference/streamlit_web_demo.py
@@ -39,7 +39,9 @@ def load_model(model_path: str, dtype: torch.dtype, device: str) -> CogVideoXPip
Returns:
- CogVideoXPipeline: Loaded model pipeline.
"""
- return CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype).to(device)
+ pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype)
+ pipe.enable_model_cpu_offload()
+ return pipe
# Define a function to generate video based on the provided prompt and model path
From 9ffa0bea284935e37fe155e6b2a1c2db2f95c6fc Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Wed, 7 Aug 2024 19:28:54 +0800
Subject: [PATCH 07/11] 2
---
README.md | 6 +++---
README_zh.md | 6 +++---
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/README.md b/README.md
index fba09da..8ce33cf 100644
--- a/README.md
+++ b/README.md
@@ -20,11 +20,11 @@
## Update and News
-- 🔥 **News**: `2024/8/7`: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be performed
+- 🔥 **News**: ```2024/8/7```: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be performed
on a single 3090 GPU. For more details, please refer to the [code](inference/cli_demo.py).
-- 🔥 **News**: ``2024/8/6``: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can reconstruct
+- 🔥 **News**: ```2024/8/6```: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can reconstruct
the video almost losslessly.
-- 🔥 **News**: ``2024/8/6``: We have open-sourced **CogVideoX-2B**,the first model in the CogVideoX series of video
+- 🔥 **News**: ```2024/8/6```: We have open-sourced **CogVideoX-2B**,the first model in the CogVideoX series of video
generation models.
- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch),the **first**
open-sourced pretrained text-to-video model, and you can
diff --git a/README_zh.md b/README_zh.md
index 61453aa..274090a 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -21,9 +21,9 @@
## 项目更新
-- 🔥 **News**: ``2024/8/7``: CogVideoX 已经合并入 `diffusers` 0.30.0版本,单张3090可以推理,详情请见[代码](inference/cli_demo.py)。
-- 🔥 **News**: ``2024/8/6``: 我们开源 **3D Causal VAE**,用于 **CogVideoX-2B**,可以几乎无损地重构视频。
-- 🔥 **News**: ``2024/8/6``: 我们开源 CogVideoX 系列视频生成模型的第一个模型, **CogVideoX-2B**。
+- 🔥 **News**: ```2024/8/7```: CogVideoX 已经合并入 `diffusers` 0.30.0版本,单张3090可以推理,详情请见[代码](inference/cli_demo.py)。
+- 🔥 **News**: ```2024/8/6```: 我们开源 **3D Causal VAE**,用于 **CogVideoX-2B**,可以几乎无损地重构视频。
+- 🔥 **News**: ```2024/8/6```: 我们开源 CogVideoX 系列视频生成模型的第一个模型, **CogVideoX-2B**。
- 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型(现在你可以在 `CogVideo` 分支中看到),这是首个开源的基于
Transformer 的大型文本生成视频模型,您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。
**性能更强,参数量更大的模型正在到来的路上~,欢迎关注**
From 71399f755812a05de3f59debc7bf0d8eac668ad3 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Wed, 7 Aug 2024 19:43:50 +0800
Subject: [PATCH 08/11] update GPU memory to 23.9GB
---
README.md | 1 -
README_zh.md | 2 +-
2 files changed, 1 insertion(+), 2 deletions(-)
diff --git a/README.md b/README.md
index 8ce33cf..159ce21 100644
--- a/README.md
+++ b/README.md
@@ -73,7 +73,6 @@ pip install -r requirements.txt
Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the
significance of common parameters.
-(36GB for inference, smaller memory and fine-tuned code are under development)
## CogVideoX-2B Gallery
diff --git a/README_zh.md b/README_zh.md
index 274090a..970d58c 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -65,7 +65,7 @@
pip install -r requirements.txt
```
-查看[diffusers_demo](inference/cli_demo.py):包含对推理代码更详细的解释,包括各种关键的参数。(36GB 推理,显存优化以及微调代码正在开发)
+查看[diffusers_demo](inference/cli_demo.py):包含对推理代码更详细的解释,包括各种关键的参数。
## CogVideoX-2B 视频作品
From 54546d0f8907ebd568153d36a119a9368bbdbbe2 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Wed, 7 Aug 2024 19:45:36 +0800
Subject: [PATCH 09/11] fix MODEL_LICENSE
---
Model_License => MODEL_LICENSE | 0
inference/cli_demo.py | 2 +-
2 files changed, 1 insertion(+), 1 deletion(-)
rename Model_License => MODEL_LICENSE (100%)
diff --git a/Model_License b/MODEL_LICENSE
similarity index 100%
rename from Model_License
rename to MODEL_LICENSE
diff --git a/inference/cli_demo.py b/inference/cli_demo.py
index 0358ce7..d069f02 100644
--- a/inference/cli_demo.py
+++ b/inference/cli_demo.py
@@ -80,7 +80,7 @@ def generate_video(
# Generate the video frames using the pipeline
video = pipe(
- num_inference_steps=5, # Number of inference steps
+ num_inference_steps=num_inference_steps, # Number of inference steps
guidance_scale=guidance_scale, # Guidance scale for classifier-free guidance
prompt_embeds=prompt_embeds, # Encoded prompt embeddings
negative_prompt_embeds=torch.zeros_like(prompt_embeds), # Not Supported negative prompt
From 6fc9de04dc99f12c88c5dcf1a9932d7d8c403b39 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Wed, 7 Aug 2024 19:53:40 +0800
Subject: [PATCH 10/11] restore
---
README.md | 1 +
README_zh.md | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 159ce21..b85f813 100644
--- a/README.md
+++ b/README.md
@@ -73,6 +73,7 @@ pip install -r requirements.txt
Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the
significance of common parameters.
+(24GB for inference,fine-tuned code are under development)
## CogVideoX-2B Gallery
diff --git a/README_zh.md b/README_zh.md
index 970d58c..bf97f15 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -65,7 +65,7 @@
pip install -r requirements.txt
```
-查看[diffusers_demo](inference/cli_demo.py):包含对推理代码更详细的解释,包括各种关键的参数。
+查看[diffusers_demo](inference/cli_demo.py):包含对推理代码更详细的解释,包括各种关键的参数。(24GB 推理,微调代码正在开发)
## CogVideoX-2B 视频作品
From 8c0d0eb42712fa42d5a25a2f595cdb05fcc75fa6 Mon Sep 17 00:00:00 2001
From: zR <2448370773@qq.com>
Date: Fri, 9 Aug 2024 13:46:06 +0800
Subject: [PATCH 11/11] update multi gpus finetune script
---
README.md | 4 ++++
README_zh.md | 2 +-
sat/README.md | 11 ++++++++---
sat/README_zh.md | 7 +++++--
sat/data_video.py | 2 +-
sat/finetune_multi_gpus.sh | 10 ++++++++++
sat/{finetune.sh => finetune_single_gpu.sh} | 0
7 files changed, 29 insertions(+), 7 deletions(-)
create mode 100644 sat/finetune_multi_gpus.sh
rename sat/{finetune.sh => finetune_single_gpu.sh} (100%)
diff --git a/README.md b/README.md
index b85f813..0801638 100644
--- a/README.md
+++ b/README.md
@@ -60,6 +60,8 @@ the quality of the generated video.
### SAT
+**Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.**
+
Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is
recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform
rapid stacking and development.
@@ -67,6 +69,8 @@ rapid stacking and development.
### Diffusers
+**Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.**
+
```
pip install -r requirements.txt
```
diff --git a/README_zh.md b/README_zh.md
index bf97f15..5bd0e7e 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -93,7 +93,7 @@ pip install -r requirements.txt
CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源版本视频生成模型。
-下表战展示目前我们提供的视频生成模型列表,以及相关基础信息:
+下表展示目前我们提供的视频生成模型列表,以及相关基础信息:
| 模型名 | CogVideoX-2B |
|---------------------|-------------------------------------------------------------------------------------------------------------------------------|
diff --git a/sat/README.md b/sat/README.md
index 7325be0..f55445e 100644
--- a/sat/README.md
+++ b/sat/README.md
@@ -117,8 +117,12 @@ bash inference.sh
### Preparing the Environment
-Please note that currently, SAT needs to be installed from the source code for proper fine-tuning. We will address this
-issue in future stable releases.
+Please note that currently, SAT needs to be installed from the source code for proper fine-tuning.
+
+You need to get the code from the source to support the fine-tuning functionality, as these features have not yet been
+released in the Pip package.
+
+We will address this issue in future stable releases.
```
git clone https://github.com/THUDM/SwissArmyTransformer.git
@@ -197,7 +201,8 @@ model:
1. Run the inference code to start fine-tuning.
```shell
-bash finetune.sh
+bash finetune_single_gpu.sh # Single GPU
+bash finetune_multi_gpus.sh # Multi GPUs
```
### Converting to Huggingface Diffusers Supported Weights
diff --git a/sat/README_zh.md b/sat/README_zh.md
index 61f00f6..3335e52 100644
--- a/sat/README_zh.md
+++ b/sat/README_zh.md
@@ -112,7 +112,9 @@ bash inference.sh
### 准备环境
-请注意,目前,SAT需要从源码安装,才能正常微调, 我们将会在未来的稳定版本解决这个问题。
+请注意,目前,SAT需要从源码安装,才能正常微调。
+这是因为你需要使用还没发型到pip包版本的最新代码所支持的功能。
+我们将会在未来的稳定版本解决这个问题。
```
git clone https://github.com/THUDM/SwissArmyTransformer.git
@@ -189,7 +191,8 @@ model:
1. 运行推理代码,即可开始微调。
```shell
-bash finetune.sh
+bash finetune_single_gpu.sh # Single GPU
+bash finetune_multi_gpus.sh # Multi GPUs
```
### 转换到 Huggingface Diffusers 库支持的权重
diff --git a/sat/data_video.py b/sat/data_video.py
index ccfea46..3783340 100644
--- a/sat/data_video.py
+++ b/sat/data_video.py
@@ -425,7 +425,7 @@ class SFTDataset(Dataset):
self.videos_list.append(tensor_frms)
# caption
- caption_path = os.path.join(root, filename.replace("videos", "labels").replace(".mp4", ".txt"))
+ caption_path = os.path.join(root, filename.replace(".mp4", ".txt")).replace("videos", "labels")
if os.path.exists(caption_path):
caption = open(caption_path, "r").read().splitlines()[0]
else:
diff --git a/sat/finetune_multi_gpus.sh b/sat/finetune_multi_gpus.sh
new file mode 100644
index 0000000..d6b6383
--- /dev/null
+++ b/sat/finetune_multi_gpus.sh
@@ -0,0 +1,10 @@
+#! /bin/bash
+
+echo "RUN on `hostname`, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
+
+run_cmd="torchrun --standalone --nproc_per_node=4 train_video.py --base configs/cogvideox_2b_sft.yaml --seed $RANDOM“
+
+echo ${run_cmd}
+eval ${run_cmd}
+
+echo "DONE on `hostname`"
\ No newline at end of file
diff --git a/sat/finetune.sh b/sat/finetune_single_gpu.sh
similarity index 100%
rename from sat/finetune.sh
rename to sat/finetune_single_gpu.sh