diff --git a/README.md b/README.md index 63ac86a..441f01c 100644 --- a/README.md +++ b/README.md @@ -21,9 +21,11 @@ ## Update and News -- 🔥 **News**: ```2024/8/7```: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be performed +- 🔥 **News**: ```2024/8/7```: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be + performed on a single 3090 GPU. For more details, please refer to the [code](inference/cli_demo.py). -- 🔥 **News**: ```2024/8/6```: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can reconstruct +- 🔥 **News**: ```2024/8/6```: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can + reconstruct the video almost losslessly. - 🔥 **News**: ```2024/8/6```: We have open-sourced **CogVideoX-2B**the first model in the CogVideoX series of video generation models. @@ -55,9 +57,9 @@ Jump to a specific section: ### Prompt Optimization -Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use the GLM-4 model to -optimize the prompt. This is crucial because the model is trained with long prompts, and a good prompt directly affects -the quality of the generated video. +Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use large models like +GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained +with long prompts, and a good prompt directly impacts the quality of the video generation. ### SAT @@ -124,6 +126,15 @@ along with related basic information: | Download Link (HF diffusers Model) | 🀗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🀖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) [💫 WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) | | Download Link (SAT Model) | [SAT](./sat/README.md) | +## Friendly Links + +We highly welcome contributions from the community and actively contribute to the open-source community. The following +works have already been adapted for CogVideoX, and we invite everyone to use them: + ++ [Xorbits Inference](https://github.com/xorbitsai/inference): A powerful and comprehensive distributed inference + framework, allowing you to easily deploy your own models or the latest cutting-edge open-source models with just one + click. + ## Project Structure This open-source repository will guide developers to quickly get started with the basic usage and fine-tuning examples diff --git a/README_ja.md b/README_ja.md index f697e48..de7dabf 100644 --- a/README_ja.md +++ b/README_ja.md @@ -21,10 +21,13 @@ ## 曎新ずニュヌス -- 🔥 **ニュヌス**: ```2024/8/7```: CogVideoX は `diffusers` バヌゞョン 0.30.0 に統合されたした。単䞀の 3090 GPU で掚論を実行できたす。詳现に぀いおは [コヌド](inference/cli_demo.py) を参照しおください。 +- 🔥 **ニュヌス**: ```2024/8/7```: CogVideoX は `diffusers` バヌゞョン 0.30.0 に統合されたした。単䞀の 3090 GPU + で掚論を実行できたす。詳现に぀いおは [コヌド](inference/cli_demo.py) を参照しおください。 - 🔥 **ニュヌス**: ```2024/8/6```: **CogVideoX-2B** で䜿甚される **3D Causal VAE** もオヌプン゜ヌス化したした。これにより、ビデオをほが無損倱で再構築できたす。 - 🔥 **ニュヌス**: ```2024/8/6```: **CogVideoX-2B**、CogVideoXシリヌズのビデオ生成モデルの最初のモデルをオヌプン゜ヌス化したした。 -- 🌱 **゜ヌス**: ```2022/5/19```: **CogVideo** (珟圚 `CogVideo` ブランチで確認できたす) をオヌプン゜ヌス化したした。これは、最初のオヌプン゜ヌスの事前孊習枈みテキストからビデオ生成モデルであり、技術的な詳现に぀いおは [ICLR'23 CogVideo 論文](https://arxiv.org/abs/2205.15868) をご芧ください。 +- 🌱 **゜ヌス**: ```2022/5/19```: **CogVideo** (珟圚 `CogVideo` ブランチで確認できたす) + をオヌプン゜ヌス化したした。これは、最初のオヌプン゜ヌスの事前孊習枈みテキストからビデオ生成モデルであり、技術的な詳现に぀いおは [ICLR'23 CogVideo 論文](https://arxiv.org/abs/2205.15868) + をご芧ください。 **より匷力なモデルが、より倧きなパラメヌタサむズで登堎予定です。お楜しみに** @@ -50,11 +53,13 @@ ### プロンプトの最適化 -モデルを実行する前に、[このガむド](inference/convert_demo.py) を参照しお、GLM-4 モデルを䜿甚しおプロンプトを最適化する方法を確認しおください。これは重芁です。モデルは長いプロンプトでトレヌニングされおいるため、良いプロンプトは生成されるビデオの品質に盎接圱響したす。 +モデルを実行する前に、[こちら](inference/convert_demo.py) +を参考にしお、GLM-4たたは同等の補品、䟋えばGPT-4の倧芏暡モデルを䜿甚しおどのようにモデルを最適化するかをご確認ください。これは非垞に重芁です。モデルは長いプロンプトでトレヌニングされおいるため、良いプロンプトがビデオ生成の品質に盎接圱響を䞎えたす。 ### SAT -[sat_demo](sat/README.md) の指瀺に埓っおください: SATりェむトの掚論コヌドず埮調敎コヌドが含たれおいたす。CogVideoXモデル構造に基づいお改善するこずをお勧めしたす。革新的な研究者は、このコヌドを䜿甚しお迅速なスタッキングず開発を行うこずができたす。 +[sat_demo](sat/README.md) の指瀺に埓っおください: +SATりェむトの掚論コヌドず埮調敎コヌドが含たれおいたす。CogVideoXモデル構造に基づいお改善するこずをお勧めしたす。革新的な研究者は、このコヌドを䜿甚しお迅速なスタッキングず開発を行うこずができたす。 (掚論には18GB、lora埮調敎には40GBが必芁です) ### Diffusers @@ -94,19 +99,26 @@ CogVideoXは、[枅圱](https://chatglm.cn/video?fr=osm_cogvideox) ず同源の 以䞋の衚は、珟圚提䟛しおいるビデオ生成モデルのリストず関連する基本情報を瀺しおいたす: -| モデル名 | CogVideoX-2B | -|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| プロンプト蚀語 | 英語 | -| 単䞀GPU掚論 (FP16) | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)
23.9GB using diffusers | +| モデル名 | CogVideoX-2B | +|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| プロンプト蚀語 | 英語 | +| 単䞀GPU掚論 (FP16) | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)
23.9GB using diffusers | | 耇数GPU掚論 (FP16) | 20GB minimum per GPU using diffusers | -| 埮調敎に必芁なGPUメモリ(bs=1) | 40GB | -| プロンプトの最倧長 | 226 トヌクン | -| ビデオの長さ | 6秒 | -| フレヌムレヌト | 8フレヌム | -| 解像床 | 720 * 480 | -| 量子化掚論 | サポヌトされおいたせん | -| ダりンロヌドリンク (HF diffusers モデル) | 🀗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🀖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) [💫 WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) | -| ダりンロヌドリンク (SAT モデル) | [SAT](./sat/README.md) | +| 埮調敎に必芁なGPUメモリ(bs=1) | 40GB | +| プロンプトの最倧長 | 226 トヌクン | +| ビデオの長さ | 6秒 | +| フレヌムレヌト | 8フレヌム | +| 解像床 | 720 * 480 | +| 量子化掚論 | サポヌトされおいたせん | +| ダりンロヌドリンク (HF diffusers モデル) | 🀗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🀖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) [💫 WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) | +| ダりンロヌドリンク (SAT モデル) | [SAT](./sat/README.md) | + +## 友奜的リンク + +コミュニティからの貢献を倧歓迎し、私たちもオヌプン゜ヌスコミュニティに積極的に貢献しおいたす。以䞋の䜜品はすでにCogVideoXに察応しおおり、ぜひご利甚ください + ++ [Xorbits Inference](https://github.com/xorbitsai/inference): + 匷力で包括的な分散掚論フレヌムワヌクであり、ワンクリックで独自のモデルや最新のオヌプン゜ヌスモデルを簡単にデプロむできたす。 ## プロゞェクト構造 @@ -116,14 +128,17 @@ CogVideoXは、[枅圱](https://chatglm.cn/video?fr=osm_cogvideox) ず同源の + [diffusers_demo](inference/cli_demo.py): 掚論コヌドの詳现な説明が含たれおおり、䞀般的なパラメヌタの意味に぀いおも蚀及しおいたす。 + [diffusers_vae_demo](inference/cli_vae_demo.py): VAE掚論コヌドの実行には珟圚71GBのメモリが必芁ですが、将来的には最適化される予定です。 -+ [convert_demo](inference/convert_demo.py): ナヌザヌ入力をCogVideoXに適した圢匏に倉換する方法。CogVideoXは長いキャプションでトレヌニングされおいるため、入力テキストをLLMを䜿甚しおトレヌニング分垃ず䞀臎させる必芁がありたす。デフォルトではGLM4を䜿甚したすが、GPT、Geminiなどの他のLLMに眮き換えるこずもできたす。 -+ [gradio_web_demo](inference/gradio_web_demo.py): CogVideoX-2Bモデルを䜿甚しおビデオを生成する方法を瀺すシンプルなgradio Web UI。 ++ [convert_demo](inference/convert_demo.py): + ナヌザヌ入力をCogVideoXに適した圢匏に倉換する方法。CogVideoXは長いキャプションでトレヌニングされおいるため、入力テキストをLLMを䜿甚しおトレヌニング分垃ず䞀臎させる必芁がありたす。デフォルトではGLM4を䜿甚したすが、GPT、Geminiなどの他のLLMに眮き換えるこずもできたす。 ++ [gradio_web_demo](inference/gradio_web_demo.py): CogVideoX-2Bモデルを䜿甚しおビデオを生成する方法を瀺すシンプルなgradio + Web UI。
-+ [streamlit_web_demo](inference/streamlit_web_demo.py): CogVideoX-2Bモデルを䜿甚しおビデオを生成する方法を瀺すシンプルなstreamlit Webアプリケヌション。 ++ [streamlit_web_demo](inference/streamlit_web_demo.py): CogVideoX-2Bモデルを䜿甚しおビデオを生成する方法を瀺すシンプルなstreamlit + Webアプリケヌション。
@@ -131,13 +146,14 @@ CogVideoXは、[枅圱](https://chatglm.cn/video?fr=osm_cogvideox) ず同源の ### sat -+ [sat_demo](sat/README.md): SATりェむトの掚論コヌドず埮調敎コヌドが含たれおいたす。CogVideoXモデル構造に基づいお改善するこずをお勧めしたす。革新的な研究者は、このコヌドを䜿甚しお迅速なスタッキングず開発を行うこずができたす。 ++ [sat_demo](sat/README.md): + SATりェむトの掚論コヌドず埮調敎コヌドが含たれおいたす。CogVideoXモデル構造に基づいお改善するこずをお勧めしたす。革新的な研究者は、このコヌドを䜿甚しお迅速なスタッキングず開発を行うこずができたす。 ### ツヌル このフォルダには、モデル倉換/キャプション生成などのツヌルが含たれおいたす。 -+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): SATモデルのりェむトをHuggingfaceモデルのりェむトに倉換したす。 ++ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): SATモデルのりェむトをHuggingfaceモデルのりェむトに倉換したす。 + [caption_demo](tools/caption): キャプションツヌル、ビデオを理解し、テキストで出力するモデル。 ## プロゞェクト蚈画 @@ -161,7 +177,9 @@ CogVideoXは、[枅圱](https://chatglm.cn/video?fr=osm_cogvideox) ず同源の モデルのりェむトず実装コヌドは [CogVideoX LICENSE](MODEL_LICENSE) の䞋で公開されおいたす。 ## CogVideo(ICLR'23) -論文の公匏リポゞトリ: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) は [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo) にありたす。 + +論文の公匏リポゞトリ: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) +は [CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo) にありたす。 **CogVideoは比范的高フレヌムレヌトのビデオを生成するこずができたす。** 32フレヌムの4秒間のクリップが以䞋に瀺されおいたす。 @@ -174,8 +192,8 @@ CogVideoXは、[枅圱](https://chatglm.cn/video?fr=osm_cogvideox) ず同源の
-CogVideoのデモは [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/) で䜓隓できたす。*元の入力は䞭囜語です。* - +CogVideoのデモは [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/) で䜓隓できたす。 +*元の入力は䞭囜語です。* ## 匕甚 diff --git a/README_zh.md b/README_zh.md index 410a443..33025d1 100644 --- a/README_zh.md +++ b/README_zh.md @@ -22,7 +22,8 @@ ## 项目曎新 -- 🔥 **News**: ```2024/8/7```: CogVideoX 已经合并入 `diffusers` 0.30.0版本单匠3090可以掚理诊情请见[代码](inference/cli_demo.py)。 +- 🔥 **News**: ```2024/8/7```: CogVideoX 已经合并入 `diffusers` + 0.30.0版本单匠3090可以掚理诊情请见[代码](inference/cli_demo.py)。 - 🔥 **News**: ```2024/8/6```: 我们匀源 **3D Causal VAE**甚于 **CogVideoX-2B**可以几乎无损地重构视频。 - 🔥 **News**: ```2024/8/6```: 我们匀源 CogVideoX 系列视频生成暡型的第䞀䞪暡型, **CogVideoX-2B**。 - 🌱 **Source**: ```2022/5/19```: 我们匀源了 CogVideo 视频生成暡型现圚䜠可以圚 `CogVideo` 分支䞭看到这是銖䞪匀源的基于 @@ -51,8 +52,8 @@ ### 提瀺词䌘化 -圚匀始运行暡型之前请参考[这里](inference/convert_demo.py) 查看我们是怎么䜿甚GLM-4倧暡型对暡型进行䌘化的这埈重芁 -由于暡型是圚长提瀺词䞋训练的䞀额奜的盎接圱响了视频生成的莚量。 +圚匀始运行暡型之前请参考[这里](inference/convert_demo.py) 查看我们是怎么䜿甚GLM-4(或者同级别的其他产品䟋劂GPT-4)倧暡型对暡型进行䌘化的这埈重芁 +由于暡型是圚长提瀺词䞋训练的䞀䞪奜的提瀺词盎接圱响了视频生成的莚量。 ### SAT @@ -96,19 +97,25 @@ CogVideoX是 [枅圱](https://chatglm.cn/video?fr=osm_cogvideox) 同源的匀源 䞋衚展瀺目前我们提䟛的视频生成暡型列衚以及盞关基础信息: -| 暡型名 | CogVideoX-2B | -|---------------------|-------------------------------------------------------------------------------------------------------------------------------| -| 提瀺词语蚀 | English | -| 单GPU掚理 (FP-16) 星存消耗 | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)
23.9GB using diffusers | -| 倚GPU掚理 (FP-16) 星存消耗 | 20GB minimum per GPU using diffusers | -| 埮调星存消耗 (bs=1) | 42GB | -| 提瀺词长床䞊限 | 226 Tokens | -| 视频长床 | 6 seconds | -| 垧率每秒 | 8 frames | -| 视频分蟚率 | 720 * 480 | -| 量化掚理 | 䞍支持 | +| 暡型名 | CogVideoX-2B | +|---------------------|---------------------------------------------------------------------------------------------------------------------------------| +| 提瀺词语蚀 | English | +| 单GPU掚理 (FP-16) 星存消耗 | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)
23.9GB using diffusers | +| 倚GPU掚理 (FP-16) 星存消耗 | 20GB minimum per GPU using diffusers | +| 埮调星存消耗 (bs=1) | 42GB | +| 提瀺词长床䞊限 | 226 Tokens | +| 视频长床 | 6 seconds | +| 垧率每秒 | 8 frames | +| 视频分蟚率 | 720 * 480 | +| 量化掚理 | 䞍支持 | | 䞋蜜地址 (Diffusers æš¡åž‹) | 🀗 [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [🀖 ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) | -| 䞋蜜地址 (SAT æš¡åž‹) | [SAT](./sat/README_zh.md) | +| 䞋蜜地址 (SAT æš¡åž‹) | [SAT](./sat/README_zh.md) | + +## 友情铟接 + +我们非垞欢迎来自瀟区的莡献并积极的莡献匀源瀟区。以䞋䜜品已经对CogVideoX进行了适配欢迎倧家䜿甚: + ++ [Xorbits Inference](https://github.com/xorbitsai/inference): 性胜区倧䞔功胜党面的分垃匏掚理框架蜻束䞀键郚眲䜠自己的暡型或内眮的前沿匀源暡型。 ## 完敎项目代码结构 diff --git a/inference/cli_demo.py b/inference/cli_demo.py index d069f02..8b0813e 100644 --- a/inference/cli_demo.py +++ b/inference/cli_demo.py @@ -22,7 +22,7 @@ from diffusers import CogVideoXPipeline def export_to_video_imageio( - video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8 + video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8 ) -> str: """ Export the video frames to a video file using imageio lib to Avoid "green screen" issue (for example CogVideoX) @@ -38,14 +38,14 @@ def export_to_video_imageio( def generate_video( - prompt: str, - model_path: str, - output_path: str = "./output.mp4", - num_inference_steps: int = 50, - guidance_scale: float = 6.0, - num_videos_per_prompt: int = 1, - device: str = "cuda", - dtype: torch.dtype = torch.float16, + prompt: str, + model_path: str, + output_path: str = "./output.mp4", + num_inference_steps: int = 50, + guidance_scale: float = 6.0, + num_videos_per_prompt: int = 1, + device: str = "cuda", + dtype: torch.dtype = torch.float16, ): """ Generates a video based on the given prompt and saves it to the specified path. diff --git a/inference/cli_vae_demo.py b/inference/cli_vae_demo.py index b133f20..18b9a95 100644 --- a/inference/cli_vae_demo.py +++ b/inference/cli_vae_demo.py @@ -1,14 +1,24 @@ """ -This script demonstrates how to encode video frames using a pre-trained CogVideoX model with 🀗 Huggingface Diffusers. +This script is designed to demonstrate how to use the CogVideoX-2b VAE model for video encoding and decoding. +It allows you to encode a video into a latent representation, decode it back into a video, or perform both operations sequentially. +Before running the script, make sure to clone the CogVideoX Hugging Face model repository and set the `{your local diffusers path}` argument to the path of the cloned repository. -Note: - This script requires the `diffusers>=0.30.0` library to be installed. - If the video appears “completely green” and cannot be viewed, please switch to a different player to watch it. This is a normal phenomenon. - Cost 71GB of GPU memory for encoding a 6s video at 720p resolution. +Command 1: Encoding Video +Encodes the video located at ../resources/videos/1.mp4 using the CogVideoX-2b VAE model. +Memory Usage: ~34GB of GPU memory for encoding. +If you do not have enough GPU memory, we provide a pre-encoded tensor file (encoded.pt) in the resources folder and you can still run the decoding command. +$ python cli_vae_demo.py --model_path {your local diffusers path}/CogVideoX-2b/vae/ --video_path ../resources/videos/1.mp4 --mode encode -Run the script: - $ python cli_demo.py --model_path THUDM/CogVideoX-2b --video_path path/to/video.mp4 --output_path path/to/output +Command 2: Decoding Video +Decodes the latent representation stored in encoded.pt back into a video. +Memory Usage: ~19GB of GPU memory for decoding. +$ python cli_vae_demo.py --model_path {your local diffusers path}/CogVideoX-2b/vae/ --encoded_path ./encoded.pt --mode decode + +Command 3: Encoding and Decoding Video +Encodes the video located at ../resources/videos/1.mp4 and then immediately decodes it. +Memory Usage: 34GB for encoding + 19GB for decoding (sequentially). +$ python cli_vae_demo.py --model_path {your local diffusers path}/CogVideoX-2b/vae/ --video_path ../resources/videos/1.mp4 --mode both """ import argparse @@ -19,7 +29,7 @@ from diffusers import AutoencoderKLCogVideoX from torchvision import transforms -def vae_demo(model_path, video_path, dtype, device): +def encode_video(model_path, video_path, dtype, device): """ Loads a pre-trained AutoencoderKLCogVideoX model and encodes the video frames. @@ -32,50 +42,58 @@ def vae_demo(model_path, video_path, dtype, device): Returns: - torch.Tensor: The encoded video frames. """ - # Load the pre-trained model model = AutoencoderKLCogVideoX.from_pretrained(model_path, torch_dtype=dtype).to(device) - - # Load video frames video_reader = imageio.get_reader(video_path, "ffmpeg") - frames = [] - for frame in video_reader: - frames.append(frame) + + frames = [transforms.ToTensor()(frame) for frame in video_reader] video_reader.close() - # Transform frames to Tensor - transform = transforms.Compose( - [ - transforms.ToTensor(), - ] - ) - frames_tensor = torch.stack([transform(frame) for frame in frames]).to(device) + frames_tensor = torch.stack(frames).to(device).permute(1, 0, 2, 3).unsqueeze(0).to(dtype) - # Add batch dimension and reshape to [1, 3, 49, 480, 720] - frames_tensor = frames_tensor.permute(1, 0, 2, 3).unsqueeze(0).to(dtype).to(device) - - # Run the model with Encoder and Decoder with torch.no_grad(): - output = model(frames_tensor) + encoded_frames = model.encode(frames_tensor)[0].sample() + return encoded_frames - return output + +def decode_video(model_path, encoded_tensor_path, dtype, device): + """ + Loads a pre-trained AutoencoderKLCogVideoX model and decodes the encoded video frames. + + Parameters: + - model_path (str): The path to the pre-trained model. + - encoded_tensor_path (str): The path to the encoded tensor file. + - dtype (torch.dtype): The data type for computation. + - device (str): The device to use for computation (e.g., "cuda" or "cpu"). + + Returns: + - torch.Tensor: The decoded video frames. + """ + model = AutoencoderKLCogVideoX.from_pretrained(model_path, torch_dtype=dtype).to(device) + encoded_frames = torch.load(encoded_tensor_path, weights_only=True).to(device).to(dtype) + with torch.no_grad(): + decoded_frames = [] + for i in range(6): # 6 seconds + start_frame, end_frame = (0, 3) if i == 0 else (2 * i + 1, 2 * i + 3) + current_frames = model.decode(encoded_frames[:, :, start_frame:end_frame]).sample + decoded_frames.append(current_frames) + model.clear_fake_context_parallel_cache() + + decoded_frames = torch.cat(decoded_frames, dim=2) + return decoded_frames def save_video(tensor, output_path): """ - Saves the encoded video frames to a video file. + Saves the video frames to a video file. Parameters: - - tensor (torch.Tensor): The encoded video frames. + - tensor (torch.Tensor): The video frames tensor. - output_path (str): The path to save the output video. """ - # Remove batch dimension and permute back to [49, 480, 720, 3] frames = tensor[0].squeeze(0).permute(1, 2, 3, 0).cpu().numpy() + frames = np.clip(frames, 0, 1) * 255 + frames = frames.astype(np.uint8) - # Clip values to [0, 1] and convert to uint8 - frames = np.clip(frames, 0, 1) - frames = (frames * 255).astype(np.uint8) - - # Save frames to video writer = imageio.get_writer(output_path + "/output.mp4", fps=30) for frame in frames: writer.append_data(frame) @@ -83,10 +101,14 @@ def save_video(tensor, output_path): if __name__ == "__main__": - parser = argparse.ArgumentParser(description="Convert a CogVideoX model to Diffusers") + parser = argparse.ArgumentParser(description="CogVideoX encode/decode demo") parser.add_argument("--model_path", type=str, required=True, help="The path to the CogVideoX model") - parser.add_argument("--video_path", type=str, required=True, help="The path to the video file") - parser.add_argument("--output_path", type=str, default="./", help="The path to save the output video") + parser.add_argument("--video_path", type=str, help="The path to the video file (for encoding)") + parser.add_argument("--encoded_path", type=str, help="The path to the encoded tensor file (for decoding)") + parser.add_argument("--output_path", type=str, default=".", help="The path to save the output file") + parser.add_argument( + "--mode", type=str, choices=["encode", "decode", "both"], required=True, help="Mode: encode, decode, or both" + ) parser.add_argument( "--dtype", type=str, default="float16", help="The data type for computation (e.g., 'float16' or 'float32')" ) @@ -95,9 +117,21 @@ if __name__ == "__main__": ) args = parser.parse_args() - # Set device and dtype device = torch.device(args.device) dtype = torch.float16 if args.dtype == "float16" else torch.float32 - output = vae_demo(args.model_path, args.video_path, dtype, device) - save_video(output, args.output_path) + if args.mode == "encode": + assert args.video_path, "Video path must be provided for encoding." + encoded_output = encode_video(args.model_path, args.video_path, dtype, device) + torch.save(encoded_output, args.output_path + "/encoded.pt") + print(f"Finished encoding the video to a tensor, save it to a file at {encoded_output}/encoded.pt") + elif args.mode == "decode": + assert args.encoded_path, "Encoded tensor path must be provided for decoding." + decoded_output = decode_video(args.model_path, args.encoded_path, dtype, device) + save_video(decoded_output, args.output_path) + print(f"Finished decoding the video and saved it to a file at {args.output_path}/output.mp4") + elif args.mode == "both": + assert args.video_path, "Video path must be provided for encoding." + encoded_output = encode_video(args.model_path, args.video_path, dtype, device) + decoded_output = decode_video(args.model_path, args.output_path + "/encoded.pt", dtype, device) + save_video(decoded_output, args.output_path) diff --git a/inference/encoded.pt b/inference/encoded.pt new file mode 100644 index 0000000..4367fec Binary files /dev/null and b/inference/encoded.pt differ diff --git a/inference/gradio_web_demo.py b/inference/gradio_web_demo.py index 4b4cad0..b81c0ba 100644 --- a/inference/gradio_web_demo.py +++ b/inference/gradio_web_demo.py @@ -34,7 +34,7 @@ Video descriptions must have the same num of words as examples below. Extra word def export_to_video_imageio( - video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8 + video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8 ) -> str: """ Export the video frames to a video file using imageio lib to Avoid "green screen" issue (for example CogVideoX) @@ -62,20 +62,34 @@ def convert_prompt(prompt: str, retry_times: int = 3) -> str: response = client.chat.completions.create( messages=[ {"role": "system", "content": sys_prompt}, - {"role": "user", - "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : "a girl is on the beach"'}, - {"role": "assistant", - "content": "A radiant woman stands on a deserted beach, arms outstretched, wearing a beige trench coat, white blouse, light blue jeans, and chic boots, against a backdrop of soft sky and sea. Moments later, she is seen mid-twirl, arms exuberant, with the lighting suggesting dawn or dusk. Then, she runs along the beach, her attire complemented by an off-white scarf and black ankle boots, the tranquil sea behind her. Finally, she holds a paper airplane, her pose reflecting joy and freedom, with the ocean's gentle waves and the sky's soft pastel hues enhancing the serene ambiance."}, - {"role": "user", - "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : "A man jogging on a football field"'}, - {"role": "assistant", - "content": "A determined man in athletic attire, including a blue long-sleeve shirt, black shorts, and blue socks, jogs around a snow-covered soccer field, showcasing his solitary exercise in a quiet, overcast setting. His long dreadlocks, focused expression, and the serene winter backdrop highlight his dedication to fitness. As he moves, his attire, consisting of a blue sports sweatshirt, black athletic pants, gloves, and sneakers, grips the snowy ground. He is seen running past a chain-link fence enclosing the playground area, with a basketball hoop and children's slide, suggesting a moment of solitary exercise amidst the empty field."}, - {"role": "user", - "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : " A woman is dancing, HD footage, close-up"'}, - {"role": "assistant", - "content": "A young woman with her hair in an updo and wearing a teal hoodie stands against a light backdrop, initially looking over her shoulder with a contemplative expression. She then confidently makes a subtle dance move, suggesting rhythm and movement. Next, she appears poised and focused, looking directly at the camera. Her expression shifts to one of introspection as she gazes downward slightly. Finally, she dances with confidence, her left hand over her heart, symbolizing a poignant moment, all while dressed in the same teal hoodie against a plain, light-colored background."}, - {"role": "user", - "content": f'Create an imaginative video descriptive caption or modify an earlier caption in ENGLISH for the user input: "{text}"'}, + { + "role": "user", + "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : "a girl is on the beach"', + }, + { + "role": "assistant", + "content": "A radiant woman stands on a deserted beach, arms outstretched, wearing a beige trench coat, white blouse, light blue jeans, and chic boots, against a backdrop of soft sky and sea. Moments later, she is seen mid-twirl, arms exuberant, with the lighting suggesting dawn or dusk. Then, she runs along the beach, her attire complemented by an off-white scarf and black ankle boots, the tranquil sea behind her. Finally, she holds a paper airplane, her pose reflecting joy and freedom, with the ocean's gentle waves and the sky's soft pastel hues enhancing the serene ambiance.", + }, + { + "role": "user", + "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : "A man jogging on a football field"', + }, + { + "role": "assistant", + "content": "A determined man in athletic attire, including a blue long-sleeve shirt, black shorts, and blue socks, jogs around a snow-covered soccer field, showcasing his solitary exercise in a quiet, overcast setting. His long dreadlocks, focused expression, and the serene winter backdrop highlight his dedication to fitness. As he moves, his attire, consisting of a blue sports sweatshirt, black athletic pants, gloves, and sneakers, grips the snowy ground. He is seen running past a chain-link fence enclosing the playground area, with a basketball hoop and children's slide, suggesting a moment of solitary exercise amidst the empty field.", + }, + { + "role": "user", + "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : " A woman is dancing, HD footage, close-up"', + }, + { + "role": "assistant", + "content": "A young woman with her hair in an updo and wearing a teal hoodie stands against a light backdrop, initially looking over her shoulder with a contemplative expression. She then confidently makes a subtle dance move, suggesting rhythm and movement. Next, she appears poised and focused, looking directly at the camera. Her expression shifts to one of introspection as she gazes downward slightly. Finally, she dances with confidence, her left hand over her heart, symbolizing a poignant moment, all while dressed in the same teal hoodie against a plain, light-colored background.", + }, + { + "role": "user", + "content": f'Create an imaginative video descriptive caption or modify an earlier caption in ENGLISH for the user input: "{text}"', + }, ], model="glm-4-0520", temperature=0.01, @@ -88,12 +102,7 @@ def convert_prompt(prompt: str, retry_times: int = 3) -> str: return prompt -def infer( - prompt: str, - num_inference_steps: int, - guidance_scale: float, - progress=gr.Progress(track_tqdm=True) -): +def infer(prompt: str, num_inference_steps: int, guidance_scale: float, progress=gr.Progress(track_tqdm=True)): torch.cuda.empty_cache() prompt_embeds, _ = pipe.encode_prompt( @@ -113,7 +122,6 @@ def infer( negative_prompt_embeds=torch.zeros_like(prompt_embeds), ).frames[0] - return video @@ -124,11 +132,12 @@ def save_video(tensor): export_to_video_imageio(tensor[1:], video_path) return video_path + def convert_to_gif(video_path): clip = mp.VideoFileClip(video_path) clip = clip.set_fps(8) clip = clip.resize(height=240) - gif_path = video_path.replace('.mp4', '.gif') + gif_path = video_path.replace(".mp4", ".gif") clip.write_gif(gif_path, fps=8) return gif_path @@ -137,7 +146,7 @@ def delete_old_files(): while True: now = datetime.now() cutoff = now - timedelta(minutes=10) - output_dir = './output' + output_dir = "./output" for filename in os.listdir(output_dir): file_path = os.path.join(output_dir, filename) if os.path.isfile(file_path): @@ -169,13 +178,16 @@ with gr.Blocks() as demo: prompt = gr.Textbox(label="Prompt (Less than 200 Words)", placeholder="Enter your prompt here", lines=5) with gr.Row(): gr.Markdown( - "✹Upon pressing the enhanced prompt button, we will use [GLM-4 Model](https://github.com/THUDM/GLM-4) to polish the prompt and overwrite the original one.") + "✹Upon pressing the enhanced prompt button, we will use [GLM-4 Model](https://github.com/THUDM/GLM-4) to polish the prompt and overwrite the original one." + ) enhance_button = gr.Button("✹ Enhance Prompt(Optional)") with gr.Column(): - gr.Markdown("**Optional Parameters** (default values are recommended)
" - "Turn Inference Steps larger if you want more detailed video, but it will be slower.
" - "50 steps are recommended for most cases. will cause 120 seconds for inference.
") + gr.Markdown( + "**Optional Parameters** (default values are recommended)
" + "Turn Inference Steps larger if you want more detailed video, but it will be slower.
" + "50 steps are recommended for most cases. will cause 120 seconds for inference.
" + ) with gr.Row(): num_inference_steps = gr.Number(label="Inference Steps", value=50) guidance_scale = gr.Number(label="Guidance Scale", value=6.0) @@ -222,7 +234,6 @@ with gr.Blocks() as demo: """) - def generate(prompt, num_inference_steps, guidance_scale, progress=gr.Progress(track_tqdm=True)): tensor = infer(prompt, num_inference_steps, guidance_scale, progress=progress) video_path = save_video(tensor) @@ -232,22 +243,16 @@ with gr.Blocks() as demo: return video_path, video_update, gif_update - def enhance_prompt_func(prompt): return convert_prompt(prompt, retry_times=1) - generate_button.click( generate, inputs=[prompt, num_inference_steps, guidance_scale], - outputs=[video_output, download_video_button, download_gif_button] + outputs=[video_output, download_video_button, download_gif_button], ) - enhance_button.click( - enhance_prompt_func, - inputs=[prompt], - outputs=[prompt] - ) + enhance_button.click(enhance_prompt_func, inputs=[prompt], outputs=[prompt]) if __name__ == "__main__": demo.launch(server_name="127.0.0.1", server_port=7870, share=True) diff --git a/sat/README_ja.md b/sat/README_ja.md index de5def7..deb830f 100644 --- a/sat/README_ja.md +++ b/sat/README_ja.md @@ -1,6 +1,7 @@ # SAT CogVideoX-2B -このフォルダには、[SAT](https://github.com/THUDM/SwissArmyTransformer) りェむトを䜿甚した掚論コヌドず、SAT りェむトのファむンチュヌニングコヌドが含たれおいたす。 +このフォルダには、[SAT](https://github.com/THUDM/SwissArmyTransformer) りェむトを䜿甚した掚論コヌドず、SAT +りェむトのファむンチュヌニングコヌドが含たれおいたす。 このコヌドは、チヌムがモデルをトレヌニングするために䜿甚したフレヌムワヌクです。コメントが少なく、泚意深く研究する必芁がありたす。 @@ -86,7 +87,9 @@ first_stage_config: ckpt_path: "{your_CogVideoX-2b-sat_path}/vae/3d-vae.pt" ## VAE モデルパス ``` -+ 耇数のプロンプトを保存するために txt を䜿甚する堎合は、`configs/test.txt` を参照しお倉曎しおください。1行に1぀のプロンプトを蚘述したす。プロンプトの曞き方がわからない堎合は、最初に [このコヌド](../inference/convert_demo.py) を䜿甚しお LLM によるリファむンメントを呌び出すこずができたす。 ++ 耇数のプロンプトを保存するために txt を䜿甚する堎合は、`configs/test.txt` + を参照しお倉曎しおください。1行に1぀のプロンプトを蚘述したす。プロンプトの曞き方がわからない堎合は、最初に [このコヌド](../inference/convert_demo.py) + を䜿甚しお LLM によるリファむンメントを呌び出すこずができたす。 + コマンドラむンを入力ずしお䜿甚する堎合は、次のように倉曎したす。 ```yaml @@ -113,7 +116,8 @@ bash inference.sh ### 環境の準備 -珟圚、SAT は゜ヌスコヌドからむンストヌルする必芁があり、正垞にファむンチュヌニングを行うためにはこれが必芁です。この問題は将来の安定版で解決される予定です。 +ご泚意ください、珟圚、SATを正垞にファむンチュヌニングするためには、゜ヌスコヌドからむンストヌルする必芁がありたす。 +これは、ただpipパッケヌゞバヌゞョンにリリヌスされおいない最新の機胜を䜿甚する必芁があるためです。この問題は、今埌の安定版で解決する予定です。 ``` git clone https://github.com/THUDM/SwissArmyTransformer.git @@ -143,7 +147,9 @@ pip install -e . ### 蚭定ファむルの倉曎 -`Lora` ず 党パラメヌタファむンチュヌニングの2぀の方法をサポヌトしおいたす。これらのファむンチュヌニング方法は `transformer` 郚分にのみ適甚されたす。`VAE` 郚分は倉曎されたせん。`T5` ぱンコヌダヌずしおのみ䜿甚されたす。 +`Lora` ず +党パラメヌタファむンチュヌニングの2぀の方法をサポヌトしおいたす。これらのファむンチュヌニング方法は `transformer` +郚分にのみ適甚されたす。`VAE` 郚分は倉曎されたせん。`T5` ぱンコヌダヌずしおのみ䜿甚されたす。 `configs/cogvideox_2b_sft.yaml` (党量ファむンチュヌニング甚) を次のように倉曎したす。 @@ -190,7 +196,8 @@ model: 1. 掚論コヌドを実行しおファむンチュヌニングを開始したす。 ```shell -bash finetune.sh +bash finetune_single_gpu.sh # Single GPU +bash finetune_multi_gpus.sh # Multi GPUs ``` ### Huggingface Diffusers サポヌトのりェむトに倉換 diff --git a/sat/README_zh.md b/sat/README_zh.md index 8566313..3335e52 100644 --- a/sat/README_zh.md +++ b/sat/README_zh.md @@ -112,7 +112,6 @@ bash inference.sh ### 准倇环境 - 请泚意目前SAT需芁从源码安装才胜正垞埮调。 这是因䞺䜠需芁䜿甚还没发型到pip包版本的最新代码所支持的功胜。 我们将䌚圚未来的皳定版本解决这䞪问题。