Update GPU Memory Cost to 24GB (#90)

update GPU memory to 23.9GB
This commit is contained in:
zR 2024-08-07 19:55:41 +08:00 committed by GitHub
parent 0a3eb32b33
commit d0a9e21dad
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
9 changed files with 228 additions and 114 deletions

125
README.md
View File

@ -20,20 +20,50 @@
## Update and News
- 🔥 **News**: ``2024/8/6``: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can reconstruct
- 🔥 **News**: ```2024/8/7```: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be performed
on a single 3090 GPU. For more details, please refer to the [code](inference/cli_demo.py).
- 🔥 **News**: ```2024/8/6```: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can reconstruct
the video almost losslessly.
- 🔥 **News**: ``2024/8/6``: We have open-sourced **CogVideoX-2B**the first model in the CogVideoX series of video
- 🔥 **News**: ```2024/8/6```: We have open-sourced **CogVideoX-2B**the first model in the CogVideoX series of video
generation models.
- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch)the **first** open-sourced pretrained text-to-video model, and you can check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details.
- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch)the **first**
open-sourced pretrained text-to-video model, and you can
check [ICLR'23 CogVideo Paper](https://arxiv.org/abs/2205.15868) for technical details.
**More powerful models with larger parameter sizes are on the way~ Stay tuned!**
## Table of Contents
Jump to a specific section:
- [Quick Start](#Quick-Start)
- [SAT](#sat)
- [Diffusers](#Diffusers)
- [CogVideoX-2B Video Works](#cogvideox-2b-gallery)
- [Introduction to the CogVideoX Model](#Model-Introduction)
- [Full Project Structure](#project-structure)
- [Inference](#inference)
- [SAT](#sat)
- [Tools](#tools)
- [Introduction to CogVideo(ICLR'23) Model](#cogvideoiclr23)
- [Citations](#Citation)
- [Open Source Project Plan](#Open-Source-Project-Plan)
- [Model License](#Model-License)
## Quick Start
### Prompt Optimization
Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use the GLM-4 model to
optimize the prompt. This is crucial because the model is trained with long prompts, and a good prompt directly affects
the quality of the generated video.
### SAT
Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.
(18 GB for inference, 40GB for lora finetune)
Follow instructions in [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is
recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform
rapid stacking and development.
(18 GB for inference, 40GB for lora finetune)
### Diffusers
@ -41,8 +71,9 @@ Follow instructions in [sat_demo](sat/README.md): Contains the inference code an
pip install -r requirements.txt
```
Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters.
(36GB for inference, smaller memory and fine-tuned code are under development)
Then follow [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the
significance of common parameters.
(24GB for inference,fine-tuned code are under development)
## CogVideoX-2B Gallery
@ -77,14 +108,14 @@ along with related basic information:
| Model Name | CogVideoX-2B |
|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Prompt Language | English |
| GPU Memory Required for Inference (FP16) | 18GB if using [SAT](https://github.com/THUDM/SwissArmyTransformer); 36GB if using diffusers (will be optimized before the PR is merged) |
| Single GPU Inference (FP16) | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer) <br> 23.9GB using diffusers |
| Multi GPUs Inference (FP16) | 20GB minimum per GPU using diffusers |
| GPU Memory Required for Fine-tuning(bs=1) | 40GB |
| Prompt Max Length | 226 Tokens |
| Video Length | 6 seconds |
| Frames Per Second | 8 frames |
| Resolution | 720 * 480 |
| Quantized Inference | Not Supported |
| Multi-card Inference | Not Supported |
| Download Link (HF diffusers Model) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) [💫 WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) |
| Download Link (SAT Model) | [SAT](./sat/README.md) |
@ -95,16 +126,25 @@ of the **CogVideoX** open-source model.
### Inference
+ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the significance of common parameters.
+ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of memory, but it will be optimized in the future.
+ [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as GPT, Gemini, etc.
+ [gradio_demo](gradio_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B model to generate videos.
+ [diffusers_demo](inference/cli_demo.py): A more detailed explanation of the inference code, mentioning the
significance of common parameters.
+ [diffusers_vae_demo](inference/cli_vae_demo.py): Executing the VAE inference code alone currently requires 71GB of
memory, but it will be optimized in the future.
+ [convert_demo](inference/convert_demo.py): How to convert user input into a format suitable for CogVideoX. Because
CogVideoX is trained on long caption, we need to convert the input text to be consistent with the training
distribution using a LLM. By default, the script uses GLM4, but it can also be replaced with any other LLM such as
GPT, Gemini, etc.
+ [gradio_web_demo](inference/gradio_web_demo.py): A simple gradio web UI demonstrating how to use the CogVideoX-2B
model to generate
videos.
<div style="text-align: center;">
<img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
</div>
+ [web_demo](inference/web_demo.py): A simple streamlit web application demonstrating how to use the CogVideoX-2B model to generate videos.
+ [streamlit_web_demo](inference/streamlit_web_demo.py): A simple streamlit web application demonstrating how to use the
CogVideoX-2B model
to generate videos.
<div style="text-align: center;">
<img src="resources/web_demo.png" style="width: 100%; height: auto;" />
@ -112,40 +152,25 @@ of the **CogVideoX** open-source model.
### sat
+ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking and development.
+ [sat_demo](sat/README.md): Contains the inference code and fine-tuning code of SAT weights. It is recommended to
improve based on the CogVideoX model structure. Innovative researchers use this code to better perform rapid stacking
and development.
### Tools
This folder contains some tools for model conversion / caption generation, etc.
+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): Convert SAT model weights to Huggingface model weights.
+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): Convert SAT model weights to Huggingface model weights.
+ [caption_demo](tools/caption): Caption tool, a model that understands videos and outputs them in text.
## Project Plan
- [x] Open source CogVideoX model
- [x] Open source 3D Causal VAE used in CogVideoX.
- [x] CogVideoX model inference example (CLI / Web Demo)
- [x] CogVideoX online experience demo (Huggingface Space)
- [x] CogVideoX open source model API interface example (Huggingface)
- [x] CogVideoX model fine-tuning example (SAT)
- [ ] CogVideoX model fine-tuning example (Huggingface / SAT)
- [ ] Open source CogVideoX-Pro (adapted for CogVideoX-2B suite)
- [x] Release CogVideoX technical report
We welcome your contributions. You can click [here](resources/contribute.md) for more information.
## Model License
The code in this repository is released under the [Apache 2.0 License](LICENSE).
The model weights and implementation code are released under the [CogVideoX LICENSE](MODEL_LICENSE).
## CogVideo(ICLR'23)
The official repo for the paper: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) is on the [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)
The official repo for the
paper: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868)
is on the [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)
**CogVideo is able to generate relatively high-frame-rate videos.**
A 4-second clip of 32 frames is shown below.
A 4-second clip of 32 frames is shown below.
![High-frame-rate sample](https://raw.githubusercontent.com/THUDM/CogVideo/CogVideo/assets/appendix-sample-highframerate.png)
@ -155,8 +180,8 @@ A 4-second clip of 32 frames is shown below.
</div>
The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/), where you can get hands-on practice on text-to-video generation. *The original input is in Chinese.*
The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/), where you can get
hands-on practice on text-to-video generation. *The original input is in Chinese.*
## Citation
@ -175,3 +200,23 @@ The demo for CogVideo is at [https://models.aminer.cn/cogvideo](https://models.a
year={2022}
}
```
## Open Source Project Plan
- [x] Open source CogVideoX model
- [x] Open source 3D Causal VAE used in CogVideoX.
- [x] CogVideoX model inference example (CLI / Web Demo)
- [x] CogVideoX online experience demo (Huggingface Space)
- [x] CogVideoX open source model API interface example (Huggingface)
- [x] CogVideoX model fine-tuning example (SAT)
- [ ] CogVideoX model fine-tuning example (Huggingface / SAT)
- [ ] Open source CogVideoX-Pro (adapted for CogVideoX-2B suite)
- [x] Release CogVideoX technical report
We welcome your contributions. You can click [here](resources/contribute.md) for more information.
## Model License
The code in this repository is released under the [Apache 2.0 License](LICENSE).
The model weights and implementation code are released under the [CogVideoX LICENSE](MODEL_LICENSE).

View File

@ -21,18 +21,43 @@
## 项目更新
- 🔥 **News**: ``2024/8/6``: 我们开源 **3D Causal VAE**,用于 **CogVideoX-2B**,可以几乎无损地重构视频。
- 🔥 **News**: ``2024/8/6``: 我们开源 CogVideoX 系列视频生成模型的第一个模型, **CogVideoX-2B**
- 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型(现在你可以在 `CogVideo` 分支中看到),这是首个开源的基于 Transformer 的大型文本生成视频模型,您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。
**性能更强,参数量更大的模型正在到来的路上~,欢迎关注**
- 🔥 **News**: ```2024/8/7```: CogVideoX 已经合并入 `diffusers` 0.30.0版本单张3090可以推理详情请见[代码](inference/cli_demo.py)。
- 🔥 **News**: ```2024/8/6```: 我们开源 **3D Causal VAE**,用于 **CogVideoX-2B**,可以几乎无损地重构视频。
- 🔥 **News**: ```2024/8/6```: 我们开源 CogVideoX 系列视频生成模型的第一个模型, **CogVideoX-2B**
- 🌱 **Source**: ```2022/5/19```: 我们开源了 CogVideo 视频生成模型(现在你可以在 `CogVideo` 分支中看到),这是首个开源的基于
Transformer 的大型文本生成视频模型,您可以访问 [ICLR'23 论文](https://arxiv.org/abs/2205.15868) 查看技术细节。
**性能更强,参数量更大的模型正在到来的路上~,欢迎关注**
## 目录
跳转到指定部分:
- [快速开始](#快速开始)
- [SAT](#sat)
- [Diffusers](#Diffusers)
- [CogVideoX-2B 视频作品](#cogvideox-2b-视频作品)
- [CogVideoX模型介绍](#模型介绍)
- [完整项目代码结构](#完整项目代码结构)
- [Inference](#inference)
- [SAT](#sat)
- [Tools](#tools)
- [开源项目规划](#开源项目规划)
- [模型协议](#模型协议)
- [CogVideo(ICLR'23)模型介绍](#cogvideoiclr23)
- [引用](#引用)
## 快速开始
### 提示词优化
在开始运行模型之前,请参考[这里](inference/convert_demo.py) 查看我们是怎么使用GLM-4大模型对模型进行优化的这很重要
由于模型是在长提示词下训练的,一额好的直接影响了视频生成的质量。
### SAT
查看sat文件夹下的[sat_demo](sat/README.md):包含了 SAT 权重的推理代码和微调代码,推荐基于此代码进行 CogVideoX 模型结构的改进,研究者使用该代码可以更好的进行快速的迭代和开发。
(18 GB 推理, 40GB lora微调)
查看sat文件夹下的[sat_demo](sat/README.md):包含了 SAT 权重的推理代码和微调代码,推荐基于此代码进行 CogVideoX
模型结构的改进,研究者使用该代码可以更好的进行快速的迭代和开发。
(18 GB 推理, 40GB lora微调)
### Diffusers
@ -40,7 +65,7 @@
pip install -r requirements.txt
```
查看[diffusers_demo](inference/cli_demo.py):包含对推理代码更详细的解释,包括各种关键的参数。(36GB 推理,显存优化以及微调代码正在开发)
查看[diffusers_demo](inference/cli_demo.py):包含对推理代码更详细的解释,包括各种关键的参数。(24GB 推理,微调代码正在开发)
## CogVideoX-2B 视频作品
@ -70,21 +95,21 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
下表战展示目前我们提供的视频生成模型列表,以及相关基础信息:
| 模型名字 | CogVideoX-2B |
|---------------------|--------------------------------------------------------------------------------------------------------------------------------------|
| 提示词语言 | English |
| 推理显存消耗 (FP-16) | 36GB using diffusers (will be optimized before the PR is merged) and 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer) |
| 微调显存消耗 (bs=1) | 42GB |
| 提示词长度上限 | 226 Tokens |
| 视频长度 | 6 seconds |
| 帧率(每秒) | 8 frames |
| 视频分辨率 | 720 * 480 |
| 量化推理 | 不支持 |
| 多卡推理 | 不支持 |
| 下载地址 (Diffusers 模型) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) |
| 下载地址 (SAT 模型) | [SAT](./sat/README_zh.md) |
| 模型名 | CogVideoX-2B |
|---------------------|-------------------------------------------------------------------------------------------------------------------------------|
| 提示词语言 | English |
| 单GPU推理 (FP-16) 显存消耗 | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer) <br> 23.9GB using diffusers |
| 多GPU推理 (FP-16) 显存消耗 | 20GB minimum per GPU using diffusers |
| 微调显存消耗 (bs=1) | 42GB |
| 提示词长度上限 | 226 Tokens |
| 视频长度 | 6 seconds |
| 帧率(每秒) | 8 frames |
| 视频分辨率 | 720 * 480 |
| 量化推理 | 不支持 |
| 下载地址 (Diffusers 模型) | 🤗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🤖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) |
| 下载地址 (SAT 模型) | [SAT](./sat/README_zh.md) |
## 项目结构
## 完整项目代码结构
本开源仓库将带领开发者快速上手 **CogVideoX** 开源模型的基础调用方式、微调示例。
@ -92,14 +117,15 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
+ [diffusers_demo](inference/cli_demo.py): 更详细的推理代码讲解,常见参数的意义,在这里都会提及。
+ [diffusers_vae_demo](inference/cli_vae_demo.py): 单独执行VAE的推理代码目前需要71GB显存将来会优化。
+ [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合 CogVideoX的长输入。因为CogVideoX是在长文本上训练的所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4也可以替换为GPT、Gemini等任意大语言模型。
+ [gradio_demo](gradio_demo.py): 一个简单的gradio网页应用展示如何使用 CogVideoX-2B 模型生成视频。
+ [convert_demo](inference/convert_demo.py): 如何将用户的输入转换成适合
CogVideoX的长输入。因为CogVideoX是在长文本上训练的所以我们需要把输入文本的分布通过LLM转换为和训练一致的长文本。脚本中默认使用GLM4也可以替换为GPT、Gemini等任意大语言模型。
+ [gradio_web_demo](inference/gradio_web_demo.py): 一个简单的gradio网页应用展示如何使用 CogVideoX-2B 模型生成视频。
<div style="text-align: center;">
<img src="resources/gradio_demo.png" style="width: 100%; height: auto;" />
</div>
+ [web_demo](inference/web_demo.py): 一个简单的streamlit网页应用展示如何使用 CogVideoX-2B 模型生成视频。
+ [streamlit_web_demo](inference/streamlit_web_demo.py): 一个简单的streamlit网页应用展示如何使用 CogVideoX-2B 模型生成视频。
<div style="text-align: center;">
<img src="resources/web_demo.png" style="width: 100%; height: auto;" />
@ -117,27 +143,10 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): 将 SAT 模型权重转换为 Huggingface 模型权重。
+ [caption_demo](tools/caption/README_zh.md): Caption 工具,对视频理解并用文字输出的模型。
## 项目规划
## CogVideo(ICLR'23)
- [x] CogVideoX 模型开源
- [x] CogVideoX 模型推理示例 (CLI / Web Demo)
- [x] CogVideoX 在线体验示例 (Huggingface Space)
- [x] CogVideoX 开源模型API接口示例 (Huggingface)
- [x] CogVideoX 模型微调示例 (SAT)
- [ ] CogVideoX 模型微调示例 (Huggingface / SAT)
- [ ] CogVideoX-Pro 开源(适配 CogVideoX-2B 套件)
- [ ] CogVideoX 技术报告公开
我们欢迎您的贡献,您可以点击[这里](resources/contribute_zh.md)查看更多信息。
## 模型协议
本仓库代码使用 [Apache 2.0 协议](LICENSE) 发布。
本模型权重和模型实现代码根据 [CogVideoX LICENSE](MODEL_LICENSE) 许可证发布。
## CogVideo(ICLR'23)
[CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) 的官方repo位于[CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)。
[CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868)
的官方repo位于[CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo)。
**CogVideo可以生成高帧率视频下面展示了一个32帧的4秒视频。**
@ -150,11 +159,12 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源
<video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="80%" controls autoplay></video>
</div>
CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/)。您可以在这里体验文本到视频生成。*原始输入为中文。*
CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/)。您可以在这里体验文本到视频生成。
*原始输入为中文。*
## 引用
🌟 如果您发现我们的工作有所帮助欢迎引用我们的文章留下宝贵的stars
🌟 如果您发现我们的工作有所帮助欢迎引用我们的文章留下宝贵的stars
```
@article{yang2024cogvideox,
@ -168,4 +178,23 @@ CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.amine
journal={arXiv preprint arXiv:2205.15868},
year={2022}
}
```
```
## 开源项目规划
- [x] CogVideoX 模型开源
- [x] CogVideoX 模型推理示例 (CLI / Web Demo)
- [x] CogVideoX 在线体验示例 (Huggingface Space)
- [x] CogVideoX 开源模型API接口示例 (Huggingface)
- [x] CogVideoX 模型微调示例 (SAT)
- [ ] CogVideoX 模型微调示例 (Huggingface / SAT)
- [ ] CogVideoX-Pro 开源(适配 CogVideoX-2B 套件)
- [X] CogVideoX 技术报告公开
我们欢迎您的贡献,您可以点击[这里](resources/contribute_zh.md)查看更多信息。
## 模型协议
本仓库代码使用 [Apache 2.0 协议](LICENSE) 发布。
本模型权重和模型实现代码根据 [CogVideoX LICENSE](MODEL_LICENSE) 许可证发布。

View File

@ -22,7 +22,7 @@ from diffusers import CogVideoXPipeline
def export_to_video_imageio(
video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
) -> str:
"""
Export the video frames to a video file using imageio lib to Avoid "green screen" issue (for example CogVideoX)
@ -38,14 +38,14 @@ def export_to_video_imageio(
def generate_video(
prompt: str,
model_path: str,
output_path: str = "./output.mp4",
num_inference_steps: int = 50,
guidance_scale: float = 6.0,
num_videos_per_prompt: int = 1,
device: str = "cuda",
dtype: torch.dtype = torch.float16,
prompt: str,
model_path: str,
output_path: str = "./output.mp4",
num_inference_steps: int = 50,
guidance_scale: float = 6.0,
num_videos_per_prompt: int = 1,
device: str = "cuda",
dtype: torch.dtype = torch.float16,
):
"""
Generates a video based on the given prompt and saves it to the specified path.
@ -62,7 +62,10 @@ def generate_video(
"""
# Load the pre-trained CogVideoX pipeline with the specified precision (float16) and move it to the specified device
pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype).to(device)
# add device_map="balanced" in the from_pretrained function and remove
# `pipe.enable_model_cpu_offload()` to enable Multi GPUs (2 or more and each one must have more than 20GB memory) inference.
pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype)
pipe.enable_model_cpu_offload()
# Encode the prompt to get the prompt embeddings
prompt_embeds, _ = pipe.encode_prompt(

View File

@ -16,7 +16,8 @@ import PIL
dtype = torch.bfloat16
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=dtype).to(device)
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=dtype)
pipe.enable_model_cpu_offload()
sys_prompt = """You are part of a team of bots that creates videos. You work with an assistant bot that will draw anything you say in square brackets.

View File

@ -39,7 +39,9 @@ def load_model(model_path: str, dtype: torch.dtype, device: str) -> CogVideoXPip
Returns:
- CogVideoXPipeline: Loaded model pipeline.
"""
return CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype).to(device)
pipe = CogVideoXPipeline.from_pretrained(model_path, torch_dtype=dtype)
pipe.enable_model_cpu_offload()
return pipe
# Define a function to generate video based on the provided prompt and model path
@ -76,7 +78,7 @@ def generate_video(
device=device,
dtype=dtype,
)
pipe.enable_model_cpu_offload()
# Generate video
video = pipe(
num_inference_steps=num_inference_steps,

View File

@ -1,8 +1,9 @@
git+https://github.com/huggingface/diffusers.git@d1c575ad7ee0390c2735f50cc59a79aae666567a#egg=diffusers
SwissArmyTransformer
diffusers>=0.3.0
SwissArmyTransformer==0.4.11 # Inference
torch==2.4.0
torchvision==0.19.0
streamlit==1.37.0
gradio==4.40.0 # For HF gradio demo
streamlit==1.37.0 # For web demo
opencv-python==4.10
imageio-ffmpeg==0.5.1
openai==1.38.0

View File

@ -1,6 +1,7 @@
# SAT CogVideoX-2B
This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the fine-tuning code for SAT weights.
This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the
fine-tuning code for SAT weights.
This code is the framework used by the team to train the model. It has few comments and requires careful study.
@ -41,12 +42,27 @@ Then unzip, the model structure should look like this:
Next, clone the T5 model, which is not used for training and fine-tuning, but must be used.
```shell
git lfs install
git clone https://huggingface.co/google/t5-v1_1-xxl.git
```
git clone https://huggingface.co/THUDM/CogVideoX-2b.git
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
```
**We don't need the tf_model.h5** file. This file can be deleted.
By following the above approach, you will obtain a safetensor format T5 file. Ensure that there are no errors when
loading it into Deepspeed in Finetune.
```
├── added_tokens.json
├── config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── spiece.model
└── tokenizer_config.json
0 directories, 8 files
```
3. Modify the file `configs/cogvideox_2b_infer.yaml`.
@ -101,6 +117,9 @@ bash inference.sh
### Preparing the Environment
Please note that currently, SAT needs to be installed from the source code for proper fine-tuning. We will address this
issue in future stable releases.
```
git clone https://github.com/THUDM/SwissArmyTransformer.git
cd SwissArmyTransformer
@ -130,7 +149,8 @@ For style fine-tuning, please prepare at least 50 videos and labels with similar
### Modifying the Configuration File
We support both `Lora` and `full-parameter fine-tuning` methods. Please note that both fine-tuning methods only apply to the `transformer` part. The `VAE part` is not modified. `T5` is only used as an Encoder.
We support both `Lora` and `full-parameter fine-tuning` methods. Please note that both fine-tuning methods only apply to
the `transformer` part. The `VAE part` is not modified. `T5` is only used as an Encoder.
the `configs/cogvideox_2b_sft.yaml` (for full fine-tuning) as follows.

View File

@ -41,13 +41,24 @@ unzip transformer.zip
接着,克隆 T5 模型,该模型不用做训练和微调,但是必须使用。
```shell
git lfs install
git clone https://huggingface.co/google/t5-v1_1-xxl.git
```
git clone https://huggingface.co/THUDM/CogVideoX-2b.git
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
```
通过上述方案,你将会得到一个 safetensor 格式的T5文件确保在 Deepspeed微调过程中读入的时候不会报错。
```
├── added_tokens.json
├── config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── spiece.model
└── tokenizer_config.json
**我们不需要使用tf_model.h5**文件。该文件可以删除。
0 directories, 8 files
```
3. 修改`configs/cogvideox_2b_infer.yaml`中的文件。
```yaml
@ -101,6 +112,8 @@ bash inference.sh
### 准备环境
请注意目前SAT需要从源码安装才能正常微调, 我们将会在未来的稳定版本解决这个问题。
```
git clone https://github.com/THUDM/SwissArmyTransformer.git
cd SwissArmyTransformer