diff --git a/README.md b/README.md index 0bc7173..9f8122e 100644 --- a/README.md +++ b/README.md @@ -171,49 +171,49 @@ models we currently offer, along with their foundational information. + + - - + + - - - - + + - - + + - - - - + + + + + - + - @@ -221,38 +221,37 @@ models we currently offer, along with their foundational information. - + + - - - + + - - + + + - - + - - +
Model NameCogVideoX1.5-5B (Latest)CogVideoX1.5-5B-I2V (Latest) CogVideoX-2B CogVideoX-5B CogVideoX-5B-I2VCogVideoX1.5-5BCogVideoX1.5-5B-I2V
Release DateNovember 8, 2024November 8, 2024 August 6, 2024 August 27, 2024 September 19, 2024November 8, 2024November 8, 2024
Video Resolution720 * 480 1360 * 768256 <= W <=1360
256 <= H <=768
W,H % 16 == 0
256 <= W <=1360
256 <= H <=768
W,H % 16 == 0
720 * 480
Inference PrecisionFP16*(recommended), BF16, FP32, FP8*, INT8, not supported: INT4BF16(recommended), FP16, FP32, FP8*, INT8, not supported: INT4 BF16FP16*(Recommended), BF16, FP32, FP8*, INT8, Not supported: INT4BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4
Single GPU Memory UsageSAT FP16: 18GB
diffusers FP16: from 4GB*
diffusers INT8(torchao): from 3.6GB*
SAT BF16: 26GB
diffusers BF16 : from 5GB*
diffusers INT8(torchao): from 4.4GB*
SAT BF16: 66GB
Single GPU Memory Usage
SAT BF16: 66GB
SAT FP16: 18GB
diffusers FP16: 4GB minimum*
diffusers INT8 (torchao): 3.6GB minimum*
SAT BF16: 26GB
diffusers BF16 : 5GB minimum*
diffusers INT8 (torchao): 4.4GB minimum*
Multi-GPU Memory UsageNot Supported
FP16: 10GB* using diffusers
BF16: 15GB* using diffusers
Not supported
Inference Speed
(Step = 50, FP/BF16)
Single A100: ~1000 seconds (5-second video)
Single H100: ~550 seconds (5-second video)
Single A100: ~90 seconds
Single H100: ~45 seconds
Single A100: ~180 seconds
Single H100: ~90 seconds
Single A100: ~1000 seconds (5-second video)
Single H100: ~550 seconds (5-second video)
Prompt Language
Prompt Token Limit226 Tokens 224 Tokens226 Tokens
Video Length5 seconds or 10 seconds 6 seconds5 or 10 seconds
Frame Rate8 frames / second16 frames / second16 frames / second 8 frames / second
Positional Encoding3d_sincos_pos_embedPosition Encoding3d_rope_pos_embed3d_sincos_pos_embed 3d_rope_pos_embed 3d_rope_pos_embed + learnable_pos_embed3d_rope_pos_embed3d_rope_pos_embed
Download Link (Diffusers) Coming Soon 🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel
🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel
🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel
Coming Soon
Download Link (SAT)SAT 🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel
SAT
diff --git a/README_ja.md b/README_ja.md index 1bc3d13..9962d1b 100644 --- a/README_ja.md +++ b/README_ja.md @@ -163,88 +163,87 @@ CogVideoXは、[清影](https://chatglm.cn/video?fr=osm_cogvideox) と同源の + + - - - + + + - - - - + + - - + + - - - - + + + + - - - - + + + + - - - - + + + + - - + + - - + + - - + + - + + - - + - - +
モデル名CogVideoX1.5-5B (最新)CogVideoX1.5-5B-I2V (最新) CogVideoX-2B CogVideoX-5B CogVideoX-5B-I2VCogVideoX1.5-5BCogVideoX1.5-5B-I2V
リリース日公開日2024年11月8日2024年11月8日 2024年8月6日 2024年8月27日 2024年9月19日2024年11月8日2024年11月8日
ビデオ解像度720 * 480 1360 * 768256 <= W <=1360
256 <= H <=768
W,H % 16 == 0
256 <= W <=1360
256 <= H <=768
W,H % 16 == 0
720 * 480
推論精度FP16*(推奨), BF16, FP32, FP8*, INT8, INT4は非対応BF16(推奨), FP16, FP32, FP8*, INT8, INT4は非対応 BF16FP16*(推奨), BF16, FP32,FP8*,INT8,INT4非対応BF16(推奨), FP16, FP32,FP8*,INT8,INT4非対応
シングルGPUメモリ消費SAT FP16: 18GB
diffusers FP16: 4GBから*
diffusers INT8(torchao): 3.6GBから*
SAT BF16: 26GB
diffusers BF16: 5GBから*
diffusers INT8(torchao): 4.4GBから*
SAT BF16: 66GB
単一GPUメモリ消費量
SAT BF16: 66GB
SAT FP16: 18GB
diffusers FP16: 4GB以上*
diffusers INT8(torchao): 3.6GB以上*
SAT BF16: 26GB
diffusers BF16 : 5GB以上*
diffusers INT8(torchao): 4.4GB以上*
マルチGPUメモリ消費FP16: 10GB* using diffusers
BF16: 15GB* using diffusers
サポートなし
複数GPU推論メモリ消費量非対応
FP16: 10GB* diffusers使用
BF16: 15GB* diffusers使用
推論速度
(ステップ数 = 50, FP/BF16)
単一A100: 約90秒
単一H100: 約45秒
単一A100: 約180秒
単一H100: 約90秒
単一A100: 約1000秒(5秒動画)
単一H100: 約550秒(5秒動画)
推論速度
(Step = 50, FP/BF16)
シングルA100: ~1000秒(5秒ビデオ)
シングルH100: ~550秒(5秒ビデオ)
シングルA100: ~90秒
シングルH100: ~45秒
シングルA100: ~180秒
シングルH100: ~90秒
プロンプト言語 英語*
プロンプトトークン制限226トークンプロンプト長さの上限 224トークン226トークン
ビデオの長さ6秒ビデオ長さ 5秒または10秒6秒
フレームレート8 フレーム / 秒16 フレーム / 秒16フレーム/秒8フレーム/秒
位置エンコーディング3d_sincos_pos_embed3d_rope_pos_embed3d_sincos_pos_embed 3d_rope_pos_embed 3d_rope_pos_embed + learnable_pos_embed3d_rope_pos_embed3d_rope_pos_embed
ダウンロードリンク (Diffusers) 近日公開 🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel
🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel
🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel
近日公開
ダウンロードリンク (SAT)SAT 🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel
SAT
diff --git a/README_zh.md b/README_zh.md index a88cc36..c66fc85 100644 --- a/README_zh.md +++ b/README_zh.md @@ -154,49 +154,49 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源 + + - - + + - - - - + + + - + - + - + - @@ -204,39 +204,37 @@ CogVideoX是 [清影](https://chatglm.cn/video?fr=osm_cogvideox) 同源的开源 - + - + - + - + + - - + - - - +
模型名CogVideoX1.5-5B (最新)CogVideoX1.5-5B-I2V (最新) CogVideoX-2B CogVideoX-5B CogVideoX-5B-I2V CogVideoX1.5-5BCogVideoX1.5-5B-I2V
发布时间2024年11月8日2024年11月8日 2024年8月6日 2024年8月27日 2024年9月19日2024年11月8日2024年11月8日
视频分辨率720 * 480 1360 * 768 256 <= W <=1360
256 <= H <=768
W,H % 16 == 0
720 * 480
推理精度BF16 FP16*(推荐), BF16, FP32,FP8*,INT8,不支持INT4 BF16(推荐), FP16, FP32,FP8*,INT8,不支持INT4BF16
单GPU显存消耗
SAT BF16: 66GB
SAT FP16: 18GB
diffusers FP16: 4GB起*
diffusers INT8(torchao): 3.6G起*
SAT BF16: 26GB
diffusers BF16 : 5GB起*
diffusers INT8(torchao): 4.4G起*
SAT BF16: 66GB
多GPU推理显存消耗不支持
FP16: 10GB* using diffusers
BF16: 15GB* using diffusers
Not support
推理速度
(Step = 50, FP/BF16)
单卡A100: ~1000秒(5秒视频)
单卡H100: ~550秒(5秒视频)
单卡A100: ~90秒
单卡H100: ~45秒
单卡A100: ~180秒
单卡H100: ~90秒
单卡A100: ~1000秒(5秒视频)
单卡H100: ~550秒(5秒视频)
提示词语言
提示词长度上限226 Tokens 224 Tokens226 Tokens
视频长度6 秒 5 秒 或 10 秒6 秒
帧率8 帧 / 秒 16 帧 / 秒 8 帧 / 秒
位置编码3d_sincos_pos_embed3d_rope_pos_embed3d_sincos_pos_embed 3d_rope_pos_embed 3d_rope_pos_embed + learnable_pos_embed3d_rope_pos_embed3d_rope_pos_embed
下载链接 (Diffusers) 即将推出 🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel
🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel
🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel
即将推出
下载链接 (SAT)SAT 🤗 HuggingFace
🤖 ModelScope
🟣 WiseModel
SAT
diff --git a/sat/dit_video_concat.py b/sat/dit_video_concat.py index b55a3f1..22c3821 100644 --- a/sat/dit_video_concat.py +++ b/sat/dit_video_concat.py @@ -7,7 +7,6 @@ import numpy as np import torch from torch import nn import torch.nn.functional as F - from sat.model.base_model import BaseModel, non_conflict from sat.model.mixins import BaseMixin from sat.transformer_defaults import HOOKS_DEFAULT, attention_fn_default diff --git a/sat/inference.sh b/sat/inference.sh index a22ef87..240b4df 100755 --- a/sat/inference.sh +++ b/sat/inference.sh @@ -4,7 +4,7 @@ echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" environs="WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1" -run_cmd="$environs python sample_video.py --base configs/cogvideox1.5_5b.yaml configs/test_inference.yaml --seed $RANDOM" +run_cmd="$environs python sample_video.py --base configs/test_cogvideox_5b.yaml configs/test_inference.yaml --seed $RANDOM" echo ${run_cmd} eval ${run_cmd} diff --git a/sat/vae_modules/autoencoder.py b/sat/vae_modules/autoencoder.py index 9642fb4..226d955 100644 --- a/sat/vae_modules/autoencoder.py +++ b/sat/vae_modules/autoencoder.py @@ -1,17 +1,13 @@ import logging import math import re -import random from abc import abstractmethod from contextlib import contextmanager from typing import Any, Dict, List, Optional, Tuple, Union -import numpy as np import pytorch_lightning as pl import torch import torch.distributed -import torch.nn as nn -from einops import rearrange from packaging import version from vae_modules.ema import LitEma @@ -56,17 +52,6 @@ class AbstractAutoencoder(pl.LightningModule): if version.parse(torch.__version__) >= version.parse("2.0.0"): self.automatic_optimization = False - # def apply_ckpt(self, ckpt: Union[None, str, dict]): - # if ckpt is None: - # return - # if isinstance(ckpt, str): - # ckpt = { - # "target": "sgm.modules.checkpoint.CheckpointEngine", - # "params": {"ckpt_path": ckpt}, - # } - # engine = instantiate_from_config(ckpt) - # engine(self) - def apply_ckpt(self, ckpt: Union[None, str, dict]): if ckpt is None: return @@ -85,6 +70,18 @@ class AbstractAutoencoder(pl.LightningModule): print("Unexpected keys: ", unexpected_keys) print(f"Restored from {path}") + def apply_ckpt(self, ckpt: Union[None, str, dict]): + if ckpt is None: + return + if isinstance(ckpt, str): + ckpt = { + "target": "sgm.modules.checkpoint.CheckpointEngine", + "params": {"ckpt_path": ckpt}, + } + engine = instantiate_from_config(ckpt) + engine(self) + + @abstractmethod def get_input(self, batch) -> Any: raise NotImplementedError() @@ -216,12 +213,13 @@ class AutoencodingEngine(AbstractAutoencoder): return self.decoder.get_last_layer() def encode( - self, - x: torch.Tensor, - return_reg_log: bool = False, - unregularized: bool = False, + self, + x: torch.Tensor, + return_reg_log: bool = False, + unregularized: bool = False, + **kwargs, ) -> Union[torch.Tensor, Tuple[torch.Tensor, dict]]: - z = self.encoder(x) + z = self.encoder(x, **kwargs) if unregularized: return z, dict() z, reg_log = self.regularization(z)