32 KiB
CogVideo & CogVideoX
ð€ Huggingface Space ãŸã㯠ð€ ModelScope Space 㧠CogVideoX-5B ã¢ãã«ããªã³ã©ã€ã³ã§äœéšããŠãã ãã
ð è«æãšäœ¿çšããã¥ã¡ã³ãã衚瀺ããŸãã
ð WeChat ãš Discord ã«åå
ð æž åœ± ãš APIãã©ãããã©ãŒã ã蚪åããŠããã倧èŠæš¡ãªåçšãããªçæã¢ãã«ãäœéš.
æŽæ°ãšãã¥ãŒã¹
- ð¥ð¥
2025/03/24
: CogKit 㯠CogView4 ããã³ CogVideoX ã·ãªãŒãºã®åŸ®èª¿æŽãšæšè«ã®ããã®ãã¬ãŒã ã¯ãŒã¯ã§ãããã®ããŒã«ãããã掻çšããããšã§ãç§ãã¡ã®ãã«ãã¢ãŒãã«çæã¢ãã«ãæ倧éã«æŽ»çšã§ããŸãã - ãã¥ãŒã¹:
2025/02/28
: DDIM Inverse ãCogVideoX-5B
ãšCogVideoX1.5-5B
ã§ãµããŒããããŸããã詳现㯠ãã¡ã ãã芧ãã ããã - ãã¥ãŒã¹:
2025/01/08
: ç§ãã¡ã¯diffusers
ããŒãžã§ã³ã®ã¢ãã«ãããŒã¹ã«ããLora
埮調æŽçšã®ã³ãŒããæŽæ°ããŸãããããå°ãªãVRAMïŒãããªã¡ã¢ãªïŒã§åäœããŸãã詳现ã«ã€ããŠã¯ãã¡ããã芧ãã ããã - ãã¥ãŒã¹:
2024/11/15
:CogVideoX1.5
ã¢ãã«ã®diffusersããŒãžã§ã³ããªãªãŒã¹ããŸãããããããªãã©ã¡ãŒã¿èª¿æŽã§ä»¥åã®ã³ãŒãããã®ãŸãŸå©çšå¯èœã§ãã - ãã¥ãŒã¹:
2024/11/08
:CogVideoX1.5
ã¢ãã«ããªãªãŒã¹ããŸãããCogVideoX1.5 㯠CogVideoX ãªãŒãã³ãœãŒã¹ã¢ãã«ã®ã¢ããã°ã¬ãŒãããŒãžã§ã³ã§ãã CogVideoX1.5-5B ã·ãªãŒãºã¢ãã«ã¯ã10ç§ é·ã®åç»ãšããé«ã解å床ããµããŒãããŠãããCogVideoX1.5-5B-I2V
ã¯ä»»æã®è§£å床ã§ã®åç»çæã«å¯Ÿå¿ããŠããŸãã SAT ã³ãŒãã¯ãã§ã«æŽæ°ãããŠãããdiffusers
ããŒãžã§ã³ã¯çŸåšé©å¿äžã§ãã SAT ããŒãžã§ã³ã®ã³ãŒã㯠ãã¡ã ããããŠã³ããŒãã§ããŸãã - ð¥ ãã¥ãŒã¹:
2024/10/13
: ã³ã¹ãåæžã®ãããåäžã®4090 GPUã§CogVideoX-5B
ã埮調æŽã§ãããã¬ãŒã ã¯ãŒã¯ cogvideox-factory ããªãªãŒã¹ãããŸãããè€æ°ã®è§£å床ã§ã®åŸ®èª¿æŽã«å¯Ÿå¿ããŠããŸãããã²ãå©çšãã ããïŒ - ð¥ãã¥ãŒã¹:
2024/10/10
: æè¡å ±åæžãæŽæ°ãããã詳现ãªãã¬ãŒãã³ã°æ å ±ãšãã¢ãè¿œå ããŸããã - ð¥ ãã¥ãŒã¹:
2024/10/10
: æè¡å ±åæžãæŽæ°ããŸããããã¡ã ãã¯ãªãã¯ããŠã芧ãã ãããããã«ãã¬ãŒãã³ã°ã®è©³çŽ°ãšãã¢ãè¿œå ããŸããããã¢ãèŠãã«ã¯ãã¡ã ãã¯ãªãã¯ããŠãã ããã - ð¥ãã¥ãŒã¹:
2024/10/09
: é£æžã®æè¡ããã¥ã¡ã³ã ã§CogVideoXã®åŸ®èª¿æŽã¬ã€ããå ¬éããŠããŸããåé ã®èªç±åºŠãããã«é«ãããããå ¬éãããŠããããã¥ã¡ã³ãå ã®ãã¹ãŠã®äŸãå®å šã«åçŸå¯èœã§ãã - ð¥ãã¥ãŒã¹:
2024/9/19
: CogVideoXã·ãªãŒãºã®ç»åçæãããªã¢ãã« CogVideoX-5B-I2V ããªãŒãã³ãœãŒã¹åããŸããããã®ã¢ãã«ã¯ãç»åãèæ¯å ¥åãšããŠäœ¿çšããããã³ããã¯ãŒããšçµã¿åãããŠãããªãçæããããšãã§ããããé«ãå¶åŸ¡æ§ãæäŸããŸããããã«ãããCogVideoXã·ãªãŒãºã®ã¢ãã«ã¯ãããã¹ããããããªçæããããªã®ç¶ç¶ãç»åãããããªçæã®3ã€ã®ã¿ã¹ã¯ããµããŒãããããã«ãªããŸããããªã³ã©ã€ã³ã§ã®äœéš ãã楜ãã¿ãã ããã - ð¥ ãã¥ãŒã¹:
2024/9/19
: CogVideoXã®ãã¬ãŒãã³ã°ããã»ã¹ã§ãããªããŒã¿ãããã¹ãèšè¿°ã«å€æããããã«äœ¿çšããããã£ãã·ã§ã³ã¢ãã« CogVLM2-Caption ããªãŒãã³ãœãŒã¹åããŸãããããŠã³ããŒãããŠãå©çšãã ããã - ð¥
2024/8/27
: CogVideoXã·ãªãŒãºã®ãã倧ããªã¢ãã« CogVideoX-5B ããªãŒãã³ãœãŒã¹åããŸãããã¢ãã«ã®æšè«æ§èœãå€§å¹ ã«æé©åããæšè«ã®ããŒãã«ãå€§å¹ ã«äžããŸãããGTX 1080TI
ãªã©ã®æ§åGPU㧠CogVideoX-2B ããRTX 3060
ãªã©ã®ãã¹ã¯ãããGPU㧠CogVideoX-5B ã¢ãã«ãå®è¡ã§ããŸããäŸåé¢ä¿ãæŽæ°ã»ã€ã³ã¹ããŒã«ããããã«ãèŠä»¶ ãå³å®ããæšè«ã³ãŒã㯠cli_demo ãåç §ããŠãã ãããããã«ãCogVideoX-2B ã¢ãã«ã®ãªãŒãã³ãœãŒã¹ã©ã€ã»ã³ã¹ã Apache 2.0 ã©ã€ã»ã³ã¹ ã«å€æŽãããŸããã - ð¥
2024/8/6
: CogVideoX-2B çšã® 3D Causal VAE ããªãŒãã³ãœãŒã¹åããŸãããããã«ããããããªãã»ãŒç¡æ倱ã§åæ§ç¯ããããšãã§ããŸãã - ð¥
2024/8/6
: CogVideoXã·ãªãŒãºã®ãããªçæã¢ãã«ã®æåã®ã¢ãã«ãCogVideoX-2B ããªãŒãã³ãœãŒã¹åããŸããã - ð± ãœãŒã¹:
2022/5/19
: CogVideoãããªçæã¢ãã«ããªãŒãã³ãœãŒã¹åããŸããïŒçŸåšãCogVideo
ãã©ã³ãã§ç¢ºèªã§ããŸãïŒãããã¯ããã©ã³ã¹ãã©ãŒããŒã«åºã¥ãåã®ãªãŒãã³ãœãŒã¹å€§èŠæš¡ããã¹ãçæãããªã¢ãã«ã§ããæè¡çãªè©³çŽ°ã«ã€ããŠã¯ãICLR'23è«æ ãã芧ãã ããã
ãã匷åãªã¢ãã«ãããã倧ããªãã©ã¡ãŒã¿ãµã€ãºã§ç»å Žäºå®ã§ããã楜ãã¿ã«ïŒ
ç®æ¬¡
ç¹å®ã®ã»ã¯ã·ã§ã³ã«ãžã£ã³ãïŒ
- ã¯ã€ãã¯ã¹ã¿ãŒã
- Gallery
- ã¢ãã«çŽ¹ä»
- å奜çãªã³ã¯
- ãããžã§ã¯ãæ§é
- CogVideo(ICLR'23)
- åŒçš
- ã©ã€ã»ã³ã¹å¥çŽ
ã¯ã€ãã¯ã¹ã¿ãŒã
ããã³ããã®æé©å
ã¢ãã«ãå®è¡ããåã«ããã¡ã ãåèã«ããŠãGLM-4ïŒãŸãã¯åçã®è£œåãäŸãã°GPT-4ïŒã®å€§èŠæš¡ã¢ãã«ã䜿çšããŠã©ã®ããã«ã¢ãã«ãæé©åããããã確èªãã ãããããã¯éåžžã«éèŠã§ããã¢ãã«ã¯é·ãããã³ããã§ãã¬ãŒãã³ã°ãããŠãããããè¯ãããã³ããããããªçæã®å質ã«çŽæ¥åœ±é¿ãäžããŸãã
SAT
sat_demo ã®æ瀺ã«åŸã£ãŠãã ãã: SATãŠã§ã€ãã®æšè«ã³ãŒããšåŸ®èª¿æŽã³ãŒããå«ãŸããŠããŸããCogVideoXã¢ãã«æ§é ã«åºã¥ããŠæ¹åããããšããå§ãããŸããé©æ°çãªç 究è ã¯ããã®ã³ãŒãã䜿çšããŠè¿ éãªã¹ã¿ããã³ã°ãšéçºãè¡ãããšãã§ããŸãã
Diffusers
pip install -r requirements.txt
次㫠diffusers_demo ãåç §ããŠãã ãã: æšè«ã³ãŒãã®è©³çŽ°ãªèª¬æãå«ãŸããŠãããäžè¬çãªãã©ã¡ãŒã¿ã®æå³ã«ã€ããŠãèšåããŠããŸãã
éååæšè«ã®è©³çŽ°ã«ã€ããŠã¯ãdiffusers-torchao ãåç §ããŠãã ãããDiffusers ãš TorchAO ã䜿çšããããšã§ãéååæšè«ãå¯èœãšãªããã¡ã¢ãªå¹çã®è¯ãæšè«ããã³ã³ãã€ã«æã«å Žåã«ãã£ãŠã¯é床ã®åäžãæåŸ ã§ããŸããA100 ããã³ H100 äžã§ã®ããŸããŸãªèšå®ã«ãããã¡ã¢ãªããã³æéã®ãã³ãããŒã¯ã®å®å šãªãªã¹ãã¯ãdiffusers-torchao ã«å ¬éãããŠããŸãã
Gallery
CogVideoX-5B
CogVideoX-2B
ã®ã£ã©ãªãŒã®å¯Ÿå¿ããããã³ããã¯ãŒãã衚瀺ããã«ã¯ããã¡ããã¯ãªãã¯ããŠãã ãã
ã¢ãã«çŽ¹ä»
CogVideoXã¯ãæž åœ± ãšåæºã®ãªãŒãã³ãœãŒã¹çãããªçæã¢ãã«ã§ãã 以äžã®è¡šã«ãæäŸããŠãããããªçæã¢ãã«ã®åºæ¬æ å ±ã瀺ããŸã:
ã¢ãã«å | CogVideoX1.5-5B (ææ°) | CogVideoX1.5-5B-I2V (ææ°) | CogVideoX-2B | CogVideoX-5B | CogVideoX-5B-I2V |
---|---|---|---|---|---|
å ¬éæ¥ | 2024幎11æ8æ¥ | 2024幎11æ8æ¥ | 2024幎8æ6æ¥ | 2024幎8æ27æ¥ | 2024幎9æ19æ¥ |
ãããªè§£å床 | 1360 * 768 | Min(W, H) = 768 768 †Max(W, H) †1360 Max(W, H) % 16 = 0 |
720 * 480 | ||
ãã¬ãŒã æ° | 16N + 1 (N <= 10) ã§ããå¿ èŠããããŸã (ããã©ã«ã 81) | 8N + 1 (N <= 6) ã§ããå¿ èŠããããŸã (ããã©ã«ã 49) | |||
æšè«ç²ŸåºŠ | BF16(æšå¥š), FP16, FP32ïŒFP8*ïŒINT8ïŒINT4éå¯Ÿå¿ | FP16*(æšå¥š), BF16, FP32ïŒFP8*ïŒINT8ïŒINT4éå¯Ÿå¿ | BF16(æšå¥š), FP16, FP32ïŒFP8*ïŒINT8ïŒINT4éå¯Ÿå¿ | ||
åäžGPUã¡ã¢ãªæ¶è²»é |
SAT BF16: 76GB diffusers BF16ïŒ10GBãã* diffusers INT8(torchao)ïŒ7GBãã* |
SAT FP16: 18GB diffusers FP16: 4GB以äž* diffusers INT8(torchao): 3.6GB以äž* |
SAT BF16: 26GB diffusers BF16 : 5GB以äž* diffusers INT8(torchao): 4.4GB以äž* |
||
è€æ°GPUæšè«ã¡ã¢ãªæ¶è²»é | BF16: 24GB* using diffusers |
FP16: 10GB* diffusersäœ¿çš |
BF16: 15GB* diffusersäœ¿çš |
||
æšè«é床 (Step = 50, FP/BF16) |
ã·ã³ã°ã«A100: ~1000ç§(5ç§ãããª) ã·ã³ã°ã«H100: ~550ç§(5ç§ãããª) |
ã·ã³ã°ã«A100: ~90ç§ ã·ã³ã°ã«H100: ~45ç§ |
ã·ã³ã°ã«A100: ~180ç§ ã·ã³ã°ã«H100: ~90ç§ |
||
ããã³ããèšèª | è±èª* | ||||
ããã³ããé·ãã®äžé | 224ããŒã¯ã³ | 226ããŒã¯ã³ | |||
ãããªé·ã | 5ç§ãŸãã¯10ç§ | 6ç§ | |||
ãã¬ãŒã ã¬ãŒã | 16ãã¬ãŒã /ç§ | 8ãã¬ãŒã /ç§ | |||
äœçœ®ãšã³ã³ãŒãã£ã³ã° | 3d_rope_pos_embed | 3d_sincos_pos_embed | 3d_rope_pos_embed | 3d_rope_pos_embed + learnable_pos_embed | |
ããŠã³ããŒããªã³ã¯ (Diffusers) | ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
ããŠã³ããŒããªã³ã¯ (SAT) | ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
SAT |
ããŒã¿è§£èª¬
- diffusersã©ã€ãã©ãªã䜿çšããŠãã¹ãããéã«ã¯ã
diffusers
ã©ã€ãã©ãªãæäŸããå šãŠã®æé©åãæå¹ã«ãªã£ãŠããŸãããã®æ¹æ³ã¯ NVIDIA A100 / H100以å€ã®ããã€ã¹ã§ã®ã¡ã¢ãª/ã¡ã¢ãªæ¶è²»ã®ãã¹ãã¯è¡ã£ãŠããŸãããéåžžããã®æ¹æ³ã¯NVIDIA Ampereã¢ãŒããã¯ã㣠以äžã®å šãŠã®ããã€ã¹ã«é©å¿ã§ããŸããæé©åãç¡å¹ã«ãããšãã¡ã¢ãªæ¶è²»ã¯åå¢ããããŒã¯ã¡ã¢ãªäœ¿çšéã¯è¡šã®3åã«ãªããŸãããé床ã¯çŽ3ã4ååäžããŸãã以äžã®æé©åãéšåçã«ç¡å¹ã«ããããšãå¯èœã§ã:
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
- ãã«ãGPUã§æšè«ããå Žåã
enable_sequential_cpu_offload()
æé©åãç¡å¹ã«ããå¿ èŠããããŸãã - INT8ã¢ãã«ã䜿çšãããšæšè«é床ãäœäžããŸãããããã¯ã¡ã¢ãªã®å°ãªãGPUã§æ£åžžã«æšè«ãè¡ãããããªå質ã®æ倱ãæå°éã«æããããã®æªçœ®ã§ããæšè«é床ã¯å€§å¹ ã«äœäžããŸãã
- CogVideoX-2Bã¢ãã«ã¯
FP16
粟床ã§ãã¬ãŒãã³ã°ãããŠãããCogVideoX-5Bã¢ãã«ã¯BF16
粟床ã§ãã¬ãŒãã³ã°ãããŠããŸããæšè«æã«ã¯ã¢ãã«ããã¬ãŒãã³ã°ããã粟床ã䜿çšããããšããå§ãããŸãã - PytorchAOããã³Optimum-quanto
ã¯ãCogVideoXã®ã¡ã¢ãªèŠä»¶ãåæžããããã«ããã¹ããšã³ã³ãŒãããã©ã³ã¹ãã©ãŒããããã³VAEã¢ãžã¥ãŒã«ãéååããããã«äœ¿çšã§ããŸããããã«ãããç¡æã®T4
Colabãããå°ãªãã¡ã¢ãªã®GPUã§ã¢ãã«ãå®è¡ããããšãå¯èœã«ãªããŸããåæ§ã«éèŠãªã®ã¯ãTorchAOã®éååã¯
torch.compile
ãšå®å šã«äºææ§ããããæšè«é床ãå€§å¹ ã«åäžãããããšãã§ããç¹ã§ããNVIDIA H100
ããã³ãã以äžã®ããã€ã¹ã§ã¯FP8
粟床ã䜿çšããå¿ èŠããããŸããããã«ã¯ãtorch
ãtorchao
Pythonããã±ãŒãžã®ãœãŒã¹ã³ãŒãããã®ã€ã³ã¹ããŒã«ãå¿ èŠã§ããCUDA 12.4
ã®äœ¿çšããå§ãããŸãã - æšè«é床ãã¹ããåæ§ã«ãäžèšã®ã¡ã¢ãªæé©åæ¹æ³ã䜿çšããŠããŸããã¡ã¢ãªæé©åã䜿çšããªãå Žåãæšè«é床ã¯çŽ10ïŒ
åäžããŸãã
diffusers
ããŒãžã§ã³ã®ã¢ãã«ã®ã¿ãéååããµããŒãããŠããŸãã - ã¢ãã«ã¯è±èªå ¥åã®ã¿ããµããŒãããŠãããä»ã®èšèªã¯å€§èŠæš¡ã¢ãã«ã®æ¹åãéããŠè±èªã«ç¿»èš³ã§ããŸãã
å奜çãªã³ã¯
ã³ãã¥ããã£ããã®è²¢ç®ã倧æè¿ããç§ãã¡ããªãŒãã³ãœãŒã¹ã³ãã¥ããã£ã«ç©æ¥µçã«è²¢ç®ããŠããŸãã以äžã®äœåã¯ãã§ã«CogVideoXã«å¯Ÿå¿ããŠããããã²ãå©çšãã ããïŒ
- RIFLEx-CogVideoXïŒ RIFLExã¯åç»ã®é·ããå€æ¿ããææ³ã§ããã£ã1è¡ã®ã³ãŒãã§åç»ã®é·ããå ã®2åã«å»¶é·ã§ããŸããRIFLExã¯ãã¬ãŒãã³ã°äžèŠã®æšè«ããµããŒãããã ãã§ãªããCogVideoXãããŒã¹ã«ãã¡ã€ã³ãã¥ãŒãã³ã°ããã¢ãã«ãæäŸããŠããŸããå ã®é·ãã®åç»ã§ããã1000ã¹ãããã®ãã¡ã€ã³ãã¥ãŒãã³ã°ãè¡ãã ãã§ãé·ãå€æ¿èœåãå€§å¹ ã«åäžãããããšãã§ããŸãã
- CogVideoX-Fun: CogVideoX-Funã¯ãCogVideoXã¢ãŒããã¯ãã£ãåºã«ããæ¹è¯ãã€ãã©ã€ã³ã§ãèªç±ãªè§£å床ãšè€æ°ã®èµ·åæ¹æ³ããµããŒãããŠããŸãã
- CogStudio: CogVideo ã® Gradio Web UI ã®å¥ã®ãªããžããªãããé«æ©èœãª Web UI ããµããŒãããŸãã
- Xorbits Inference: 匷åã§å æ¬çãªåæ£æšè«ãã¬ãŒã ã¯ãŒã¯ã§ãããã¯ã³ã¯ãªãã¯ã§ç¬èªã®ã¢ãã«ãææ°ã®ãªãŒãã³ãœãŒã¹ã¢ãã«ãç°¡åã«ãããã€ã§ããŸãã
- ComfyUI-CogVideoXWrapper ComfyUIãã¬ãŒã ã¯ãŒã¯ã䜿çšããŠãCogVideoXãã¯ãŒã¯ãããŒã«çµ±åããŸãã
- VideoSys: VideoSysã¯ã䜿ããããé«æ§èœãªãããªçæã€ã³ãã©ãæäŸããææ°ã®ã¢ãã«ãæè¡ãç¶ç¶çã«çµ±åããŠããŸãã
- AutoDLã€ã¡ãŒãž: ã³ãã¥ããã£ã¡ã³ããŒãæäŸããHuggingface Spaceã€ã¡ãŒãžã®ã¯ã³ã¯ãªãã¯ãããã€ã¡ã³ãã
- ã€ã³ããªã¢ãã¶ã€ã³åŸ®èª¿æŽã¢ãã«: ã¯ãCogVideoXãåºç€ã«ãã埮調æŽã¢ãã«ã§ãã€ã³ããªã¢ãã¶ã€ã³å°çšã«èšèšãããŠããŸãã
- xDiT: xDiTã¯ãè€æ°ã®GPUã¯ã©ã¹ã¿ãŒäžã§DiTsã䞊åæšè«ããããã®ãšã³ãžã³ã§ããxDiTã¯ãªã¢ã«ã¿ã€ã ã®ç»åããã³ãããªçæãµãŒãã¹ããµããŒãããŠããŸãã
- CogVideoX-Interpolation: ããŒãã¬ãŒã è£éçæã«ãããŠããã倧ããªæè»æ§ãæäŸããããšãç®çãšãããCogVideoXæ§é ãåºã«ããä¿®æ£çã®ãã€ãã©ã€ã³ã
- DiffSynth-Studio: DiffSynth Studioã¯ãæ¡æ£ãšã³ãžã³ã§ããããã¹ããšã³ã³ãŒããŒãUNetãVAEãªã©ãå«ãã¢ãŒããã¯ãã£ãåæ§ç¯ãããªãŒãã³ãœãŒã¹ã³ãã¥ããã£ã¢ãã«ãšã®äºææ§ãç¶æãã€ã€ãèšç®æ§èœãåäžãããŸããããã®ãã¬ãŒã ã¯ãŒã¯ã¯CogVideoXã«é©å¿ããŠããŸãã
- CogVideoX-Controlnet: CogVideoXã¢ãã«ãå«ãã·ã³ãã«ãªControlNetã¢ãžã¥ãŒã«ã®ã³ãŒãã
- VideoTuna: VideoTuna ã¯ãããã¹ããããããªãç»åãããããªãããã¹ãããç»åçæã®ããã®è€æ°ã®AIãããªçæã¢ãã«ãçµ±åããæåã®ãªããžããªã§ãã
- ConsisID: äžè²«æ§ã®ããé¡ãä¿æããããã«ãåšæ³¢æ°å解ã䜿çšããCogVideoX-5Bã«åºã¥ããã¢ã€ãã³ãã£ãã£ä¿æåããã¹ãããåç»çæã¢ãã«ã
- ã¹ããããã€ã¹ããããã¥ãŒããªã¢ã«: Windowsããã³ã¯ã©ãŠãã§ã®CogVideoX1.5-5B-I2Vã¢ãã«ã®ã€ã³ã¹ããŒã«ãšæé©åã«é¢ããã¹ããããã€ã¹ãããã¬ã€ããFurkanGozukaraæ°ã®å°œåãšãµããŒãã«æè¬ããããŸãïŒ
ãããžã§ã¯ãæ§é
ãã®ãªãŒãã³ãœãŒã¹ãªããžããªã¯ãCogVideoX ãªãŒãã³ãœãŒã¹ã¢ãã«ã®åºæ¬çãªäœ¿çšæ¹æ³ãšåŸ®èª¿æŽã®äŸãè¿ éã«éå§ããããã®ã¬ã€ãã§ãã
Colabã§ã®ã¯ã€ãã¯ã¹ã¿ãŒã
ç¡æã®Colab T4äžã§çŽæ¥å®è¡ã§ãã3ã€ã®ãããžã§ã¯ããæäŸããŠããŸãã
- CogVideoX-5B-T2V-Colab.ipynb: CogVideoX-5B ããã¹ããããããªãžã®çæçšColabã³ãŒãã
- CogVideoX-5B-T2V-Int8-Colab.ipynb: CogVideoX-5B ããã¹ããããããªãžã®éååæšè«çšColabã³ãŒãã1åã®å®è¡ã«çŽ30åããããŸãã
- CogVideoX-5B-I2V-Colab.ipynb: CogVideoX-5B ç»åãããããªãžã®çæçšColabã³ãŒãã
- CogVideoX-5B-V2V-Colab.ipynb: CogVideoX-5B ãããªãããããªãžã®çæçšColabã³ãŒãã
Inference
- cli_demo: æšè«ã³ãŒãã®è©³çŽ°ãªèª¬æãå«ãŸããŠãããäžè¬çãªãã©ã¡ãŒã¿ã®æå³ã«ã€ããŠãèšåããŠããŸãã
- cli_demo_quantization: éååã¢ãã«æšè«ã³ãŒãã§ãäœã¡ã¢ãªã®ããã€ã¹ã§ãå®è¡å¯èœã§ãããŸãããã®ã³ãŒããå€æŽããŠãFP8 粟床㮠CogVideoX ã¢ãã«ã®å®è¡ããµããŒãããããšãã§ããŸãã
- diffusers_vae_demo: VAEæšè«ã³ãŒãã®å®è¡ã«ã¯çŸåš71GBã®ã¡ã¢ãªãå¿ èŠã§ãããå°æ¥çã«ã¯æé©åãããäºå®ã§ãã
- space demo: Huggingface SpaceãšåãGUIã³ãŒãã§ããã¬ãŒã è£éãè¶ è§£åããŒã«ãçµã¿èŸŒãŸããŠããŸãã
- convert_demo: ãŠãŒã¶ãŒå ¥åãCogVideoXã«é©ãã圢åŒã«å€æããæ¹æ³ãCogVideoXã¯é·ããã£ãã·ã§ã³ã§ãã¬ãŒãã³ã°ãããŠãããããå ¥åããã¹ããLLMã䜿çšããŠãã¬ãŒãã³ã°ååžãšäžèŽãããå¿ èŠããããŸããããã©ã«ãã§ã¯GLM-4ã䜿çšããŸãããGPTãGeminiãªã©ã®ä»ã®LLMã«çœ®ãæããããšãã§ããŸãã
- gradio_web_demo: CogVideoX-2B / 5B ã¢ãã«ã䜿çšããŠåç»ãçæããæ¹æ³ã瀺ããã·ã³ãã«ãª Gradio Web UI ãã¢ã§ããç§ãã¡ã® Huggingface Space ãšåæ§ã«ããã®ã¹ã¯ãªããã䜿çšã㊠Web ãã¢ãèµ·åããããšãã§ããŸãã
finetune
- train_cogvideox_lora: CogVideoX diffusers 埮調æŽæ¹æ³ã®è©³çŽ°ãªèª¬æãå«ãŸããŠããŸãããã®ã³ãŒãã䜿çšããŠãèªåã®ããŒã¿ã»ãã㧠CogVideoX ã埮調æŽããããšãã§ããŸãã
sat
- sat_demo: SATãŠã§ã€ãã®æšè«ã³ãŒããšåŸ®èª¿æŽã³ãŒããå«ãŸããŠããŸããCogVideoXã¢ãã«æ§é ã«åºã¥ããŠæ¹åããããšããå§ãããŸããé©æ°çãªç 究è ã¯ããã®ã³ãŒãã䜿çšããŠè¿ éãªã¹ã¿ããã³ã°ãšéçºãè¡ãããšãã§ããŸãã
ããŒã«
ãã®ãã©ã«ãã«ã¯ãã¢ãã«å€æ/ãã£ãã·ã§ã³çæãªã©ã®ããŒã«ãå«ãŸããŠããŸãã
- convert_weight_sat2hf: SAT ã¢ãã«ã®éã¿ã Huggingface ã¢ãã«ã®éã¿ã«å€æããŸãã
- caption_demo: Caption ããŒã«ããããªãç解ããŠããã¹ãã§åºåããã¢ãã«ã
- export_sat_lora_weight: SAT ãã¡ã€ã³ãã¥ãŒãã³ã°ã¢ãã«ã®ãšã¯ã¹ããŒãããŒã«ãSAT Lora Adapter ã diffusers 圢åŒã§ãšã¯ã¹ããŒãããŸãã
- load_cogvideox_lora: diffusers çã®ãã¡ã€ã³ãã¥ãŒãã³ã°ããã Lora Adapter ãããŒãããããã®ããŒã«ã³ãŒãã
- llm_flux_cogvideox: ãªãŒãã³ãœãŒã¹ã®ããŒã«ã«å€§èŠæš¡èšèªã¢ãã« + Flux + CogVideoX ã䜿çšããŠèªåçã«åç»ãçæããŸãã
- parallel_inference_xditïŒ xDiT ã«ãã£ãŠãµããŒãããããããªçæããã»ã¹ãè€æ°ã® GPU ã§äžŠååããŸãã
- cogvideox-factory: CogVideoXã®äœã³ã¹ã埮調æŽãã¬ãŒã ã¯ãŒã¯ã§ã
diffusers
ããŒãžã§ã³ã®ã¢ãã«ã«é©å¿ããŠããŸããããå€ãã®è§£å床ã«å¯Ÿå¿ããåäžã®4090 GPUã§CogVideoX-5Bã®åŸ®èª¿æŽãå¯èœã§ãã
CogVideo(ICLR'23)
è«æã®å ¬åŒãªããžããª: CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers 㯠CogVideo branch ã«ãããŸãã
CogVideoã¯æ¯èŒçé«ãã¬ãŒã ã¬ãŒãã®ãããªãçæããããšãã§ããŸãã 32ãã¬ãŒã ã®4ç§éã®ã¯ãªããã以äžã«ç€ºãããŠããŸãã
CogVideoã®ãã¢ã¯ https://models.aminer.cn/cogvideo ã§äœéšã§ããŸãã å ã®å ¥åã¯äžåœèªã§ãã
åŒçš
ð ç§ãã¡ã®ä»äºã圹ç«ã€ãšæãããå Žåããã²ã¹ã¿ãŒãä»ããŠããã ããè«æãåŒçšããŠãã ããã
@article{yang2024cogvideox,
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
journal={arXiv preprint arXiv:2408.06072},
year={2024}
}
@article{hong2022cogvideo,
title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
journal={arXiv preprint arXiv:2205.15868},
year={2022}
}
ã©ã€ã»ã³ã¹å¥çŽ
ãã®ãªããžããªã®ã³ãŒã㯠Apache 2.0 License ã®äžã§å ¬éãããŠããŸãã
CogVideoX-2B ã¢ãã« (察å¿ããTransformersã¢ãžã¥ãŒã«ãVAEã¢ãžã¥ãŒã«ãå«ã) 㯠Apache 2.0 License ã®äžã§å ¬éãããŠããŸãã
CogVideoX-5B ã¢ãã«ïŒTransformers ã¢ãžã¥ãŒã«ãç»åçæãããªãšããã¹ãçæãããªã®ããŒãžã§ã³ãå«ãïŒ ã¯ CogVideoX LICENSE ã®äžã§å ¬éãããŠããŸãã