mirror of
https://github.com/THUDM/CogVideo.git
synced 2025-04-06 03:57:56 +08:00
416 lines
29 KiB
Markdown
416 lines
29 KiB
Markdown
# CogVideo & CogVideoX
|
||
|
||
[Read this in English](./README.md)
|
||
|
||
[äžæé
读](./README_zh.md)
|
||
|
||
<div align="center">
|
||
<img src=resources/logo.svg width="50%"/>
|
||
</div>
|
||
<p align="center">
|
||
<a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B" target="_blank"> ð€ Huggingface Space</a> ãŸã㯠<a href="https://modelscope.cn/studios/ZhipuAI/CogVideoX-5b-demo" target="_blank"> ð€ ModelScope Space</a> 㧠CogVideoX-5B ã¢ãã«ããªã³ã©ã€ã³ã§äœéšããŠãã ãã
|
||
</p>
|
||
<p align="center">
|
||
ð <a href="https://arxiv.org/abs/2408.06072" target="_blank">è«æ</a>ãš<a href="https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh" target="_blank">䜿çšããã¥ã¡ã³ã</a>ã衚瀺ããŸãã
|
||
</p>
|
||
<p align="center">
|
||
ð <a href="resources/WECHAT.md" target="_blank">WeChat</a> ãš <a href="https://discord.gg/dCGfUsagrD" target="_blank">Discord</a> ã«åå
|
||
</p>
|
||
<p align="center">
|
||
ð <a href="https://chatglm.cn/video?lang=en?fr=osm_cogvideo">æž
圱</a> ãš <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">APIãã©ãããã©ãŒã </a> ã蚪åããŠããã倧èŠæš¡ãªåçšãããªçæã¢ãã«ãäœéš.
|
||
</p>
|
||
|
||
## æŽæ°ãšãã¥ãŒã¹
|
||
|
||
- ð¥ð¥ ãã¥ãŒã¹: ```2024/11/08```: `CogVideoX1.5` ã¢ãã«ããªãªãŒã¹ããŸãããCogVideoX1.5 㯠CogVideoX ãªãŒãã³ãœãŒã¹ã¢ãã«ã®ã¢ããã°ã¬ãŒãããŒãžã§ã³ã§ãã
|
||
CogVideoX1.5-5B ã·ãªãŒãºã¢ãã«ã¯ã10ç§ é·ã®åç»ãšããé«ã解å床ããµããŒãããŠããã`CogVideoX1.5-5B-I2V` ã¯ä»»æã®è§£å床ã§ã®åç»çæã«å¯Ÿå¿ããŠããŸãã
|
||
SAT ã³ãŒãã¯ãã§ã«æŽæ°ãããŠããã`diffusers` ããŒãžã§ã³ã¯çŸåšé©å¿äžã§ãã
|
||
SAT ããŒãžã§ã³ã®ã³ãŒã㯠[ãã¡ã](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT) ããããŠã³ããŒãã§ããŸãã
|
||
- ð¥ **ãã¥ãŒã¹**: ```2024/10/13```: ã³ã¹ãåæžã®ãããåäžã®4090 GPUã§`CogVideoX-5B`
|
||
ã埮調æŽã§ãããã¬ãŒã ã¯ãŒã¯ [cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory)
|
||
ããªãªãŒã¹ãããŸãããè€æ°ã®è§£å床ã§ã®åŸ®èª¿æŽã«å¯Ÿå¿ããŠããŸãããã²ãå©çšãã ããïŒ
|
||
- ð¥**ãã¥ãŒã¹**: ```2024/10/10```:
|
||
æè¡å ±åæžãæŽæ°ãããã詳现ãªãã¬ãŒãã³ã°æ
å ±ãšãã¢ãè¿œå ããŸããã
|
||
- ð¥ **ãã¥ãŒã¹**: ```2024/10/10```: æè¡å ±åæžãæŽæ°ããŸããã[ãã¡ã](https://arxiv.org/pdf/2408.06072)
|
||
ãã¯ãªãã¯ããŠã芧ãã ãããããã«ãã¬ãŒãã³ã°ã®è©³çŽ°ãšãã¢ãè¿œå ããŸããããã¢ãèŠãã«ã¯[ãã¡ã](https://yzy-thu.github.io/CogVideoX-demo/)
|
||
ãã¯ãªãã¯ããŠãã ããã
|
||
- ð¥**ãã¥ãŒã¹**: ```2024/10/09```: é£æžã®[æè¡ããã¥ã¡ã³ã](https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh)
|
||
ã§CogVideoXã®åŸ®èª¿æŽã¬ã€ããå
¬éããŠããŸããåé
ã®èªç±åºŠãããã«é«ãããããå
¬éãããŠããããã¥ã¡ã³ãå
ã®ãã¹ãŠã®äŸãå®å
šã«åçŸå¯èœã§ãã
|
||
- ð¥**ãã¥ãŒã¹**: ```2024/9/19```: CogVideoXã·ãªãŒãºã®ç»åçæãããªã¢ãã« **CogVideoX-5B-I2V**
|
||
ããªãŒãã³ãœãŒã¹åããŸããããã®ã¢ãã«ã¯ãç»åãèæ¯å
¥åãšããŠäœ¿çšããããã³ããã¯ãŒããšçµã¿åãããŠãããªãçæããããšãã§ããããé«ãå¶åŸ¡æ§ãæäŸããŸããããã«ãããCogVideoXã·ãªãŒãºã®ã¢ãã«ã¯ãããã¹ããããããªçæããããªã®ç¶ç¶ãç»åãããããªçæã®3ã€ã®ã¿ã¹ã¯ããµããŒãããããã«ãªããŸããããªã³ã©ã€ã³ã§ã®[äœéš](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space)
|
||
ãã楜ãã¿ãã ããã
|
||
- ð¥ **ãã¥ãŒã¹**: ```2024/9/19```:
|
||
CogVideoXã®ãã¬ãŒãã³ã°ããã»ã¹ã§ãããªããŒã¿ãããã¹ãèšè¿°ã«å€æããããã«äœ¿çšããããã£ãã·ã§ã³ã¢ãã« [CogVLM2-Caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption)
|
||
ããªãŒãã³ãœãŒã¹åããŸãããããŠã³ããŒãããŠãå©çšãã ããã
|
||
- ð¥ ```2024/8/27```: CogVideoXã·ãªãŒãºã®ãã倧ããªã¢ãã« **CogVideoX-5B**
|
||
ããªãŒãã³ãœãŒã¹åããŸãããã¢ãã«ã®æšè«æ§èœã倧å¹
ã«æé©åããæšè«ã®ããŒãã«ã倧å¹
ã«äžããŸããã`GTX 1080TI` ãªã©ã®æ§åGPUã§
|
||
**CogVideoX-2B** ãã`RTX 3060` ãªã©ã®ãã¹ã¯ãããGPU㧠**CogVideoX-5B**
|
||
ã¢ãã«ãå®è¡ã§ããŸããäŸåé¢ä¿ãæŽæ°ã»ã€ã³ã¹ããŒã«ããããã«ã[èŠä»¶](requirements.txt)
|
||
ãå³å®ããæšè«ã³ãŒã㯠[cli_demo](inference/cli_demo.py) ãåç
§ããŠãã ãããããã«ã**CogVideoX-2B** ã¢ãã«ã®ãªãŒãã³ãœãŒã¹ã©ã€ã»ã³ã¹ã
|
||
**Apache 2.0 ã©ã€ã»ã³ã¹** ã«å€æŽãããŸããã
|
||
- ð¥ ```2024/8/6```: **CogVideoX-2B** çšã® **3D Causal VAE** ããªãŒãã³ãœãŒã¹åããŸãããããã«ããããããªãã»ãŒç¡æ倱ã§åæ§ç¯ããããšãã§ããŸãã
|
||
- ð¥ ```2024/8/6```: CogVideoXã·ãªãŒãºã®ãããªçæã¢ãã«ã®æåã®ã¢ãã«ã**CogVideoX-2B** ããªãŒãã³ãœãŒã¹åããŸããã
|
||
- ð± **ãœãŒã¹**: ```2022/5/19```: CogVideoãããªçæã¢ãã«ããªãŒãã³ãœãŒã¹åããŸããïŒçŸåšã`CogVideo`
|
||
ãã©ã³ãã§ç¢ºèªã§ããŸãïŒãããã¯ããã©ã³ã¹ãã©ãŒããŒã«åºã¥ãåã®ãªãŒãã³ãœãŒã¹å€§èŠæš¡ããã¹ãçæãããªã¢ãã«ã§ããæè¡çãªè©³çŽ°ã«ã€ããŠã¯ã[ICLR'23è«æ](https://arxiv.org/abs/2205.15868)
|
||
ãã芧ãã ããã
|
||
|
||
**ãã匷åãªã¢ãã«ãããã倧ããªãã©ã¡ãŒã¿ãµã€ãºã§ç»å Žäºå®ã§ããã楜ãã¿ã«ïŒ**
|
||
|
||
## ç®æ¬¡
|
||
|
||
ç¹å®ã®ã»ã¯ã·ã§ã³ã«ãžã£ã³ãïŒ
|
||
|
||
- [ã¯ã€ãã¯ã¹ã¿ãŒã](#ã¯ã€ãã¯ã¹ã¿ãŒã)
|
||
- [SAT](#sat)
|
||
- [Diffusers](#Diffusers)
|
||
- [CogVideoX-2B ã®ã£ã©ãªãŒ](#CogVideoX-2B-ã®ã£ã©ãªãŒ)
|
||
- [ã¢ãã«çŽ¹ä»](#ã¢ãã«çŽ¹ä»)
|
||
- [ãããžã§ã¯ãæ§é ](#ãããžã§ã¯ãæ§é )
|
||
- [æšè«](#æšè«)
|
||
- [sat](#sat)
|
||
- [ããŒã«](#ããŒã«)=
|
||
- [CogVideo(ICLR'23)ã¢ãã«çŽ¹ä»](#CogVideoICLR23)
|
||
- [åŒçš](#åŒçš)
|
||
- [ã©ã€ã»ã³ã¹å¥çŽ](#ã©ã€ã»ã³ã¹å¥çŽ)
|
||
|
||
## ã¯ã€ãã¯ã¹ã¿ãŒã
|
||
|
||
### ããã³ããã®æé©å
|
||
|
||
ã¢ãã«ãå®è¡ããåã«ã[ãã¡ã](inference/convert_demo.py)
|
||
ãåèã«ããŠãGLM-4ïŒãŸãã¯åçã®è£œåãäŸãã°GPT-4ïŒã®å€§èŠæš¡ã¢ãã«ã䜿çšããŠã©ã®ããã«ã¢ãã«ãæé©åããããã確èªãã ãããããã¯éåžžã«éèŠã§ããã¢ãã«ã¯é·ãããã³ããã§ãã¬ãŒãã³ã°ãããŠãããããè¯ãããã³ããããããªçæã®å質ã«çŽæ¥åœ±é¿ãäžããŸãã
|
||
|
||
### SAT
|
||
|
||
[sat_demo](sat/README.md) ã®æ瀺ã«åŸã£ãŠãã ãã:
|
||
SATãŠã§ã€ãã®æšè«ã³ãŒããšåŸ®èª¿æŽã³ãŒããå«ãŸããŠããŸããCogVideoXã¢ãã«æ§é ã«åºã¥ããŠæ¹åããããšããå§ãããŸããé©æ°çãªç 究è
ã¯ããã®ã³ãŒãã䜿çšããŠè¿
éãªã¹ã¿ããã³ã°ãšéçºãè¡ãããšãã§ããŸãã
|
||
|
||
### Diffusers
|
||
|
||
```
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
次㫠[diffusers_demo](inference/cli_demo.py) ãåç
§ããŠãã ãã: æšè«ã³ãŒãã®è©³çŽ°ãªèª¬æãå«ãŸããŠãããäžè¬çãªãã©ã¡ãŒã¿ã®æå³ã«ã€ããŠãèšåããŠããŸãã
|
||
|
||
éååæšè«ã®è©³çŽ°ã«ã€ããŠã¯ã[diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao/) ãåç
§ããŠãã ãããDiffusers
|
||
ãš TorchAO ã䜿çšããããšã§ãéååæšè«ãå¯èœãšãªããã¡ã¢ãªå¹çã®è¯ãæšè«ããã³ã³ãã€ã«æã«å Žåã«ãã£ãŠã¯é床ã®åäžãæåŸ
ã§ããŸããA100
|
||
ããã³ H100
|
||
äžã§ã®ããŸããŸãªèšå®ã«ãããã¡ã¢ãªããã³æéã®ãã³ãããŒã¯ã®å®å
šãªãªã¹ãã¯ã[diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao)
|
||
ã«å
¬éãããŠããŸãã
|
||
|
||
## Gallery
|
||
|
||
### CogVideoX-5B
|
||
|
||
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
|
||
<tr>
|
||
<td>
|
||
<video src="https://github.com/user-attachments/assets/cf5953ea-96d3-48fd-9907-c4708752c714" width="100%" controls autoplay loop></video>
|
||
</td>
|
||
<td>
|
||
<video src="https://github.com/user-attachments/assets/fe0a78e6-b669-4800-8cf0-b5f9b5145b52" width="100%" controls autoplay loop></video>
|
||
</td>
|
||
<td>
|
||
<video src="https://github.com/user-attachments/assets/c182f606-8f8c-421d-b414-8487070fcfcb" width="100%" controls autoplay loop></video>
|
||
</td>
|
||
<td>
|
||
<video src="https://github.com/user-attachments/assets/7db2bbce-194d-434d-a605-350254b6c298" width="100%" controls autoplay loop></video>
|
||
</td>
|
||
</tr>
|
||
<tr>
|
||
<td>
|
||
<video src="https://github.com/user-attachments/assets/62b01046-8cab-44cc-bd45-4d965bb615ec" width="100%" controls autoplay loop></video>
|
||
</td>
|
||
<td>
|
||
<video src="https://github.com/user-attachments/assets/d78e552a-4b3f-4b81-ac3f-3898079554f6" width="100%" controls autoplay loop></video>
|
||
</td>
|
||
<td>
|
||
<video src="https://github.com/user-attachments/assets/30894f12-c741-44a2-9e6e-ddcacc231e5b" width="100%" controls autoplay loop></video>
|
||
</td>
|
||
<td>
|
||
<video src="https://github.com/user-attachments/assets/926575ca-7150-435b-a0ff-4900a963297b" width="100%" controls autoplay loop></video>
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
|
||
### CogVideoX-2B
|
||
|
||
<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
|
||
<tr>
|
||
<td>
|
||
<video src="https://github.com/user-attachments/assets/ea3af39a-3160-4999-90ec-2f7863c5b0e9" width="100%" controls autoplay loop></video>
|
||
</td>
|
||
<td>
|
||
<video src="https://github.com/user-attachments/assets/9de41efd-d4d1-4095-aeda-246dd834e91d" width="100%" controls autoplay loop></video>
|
||
</td>
|
||
<td>
|
||
<video src="https://github.com/user-attachments/assets/941d6661-6a8d-4a1b-b912-59606f0b2841" width="100%" controls autoplay loop></video>
|
||
</td>
|
||
<td>
|
||
<video src="https://github.com/user-attachments/assets/938529c4-91ae-4f60-b96b-3c3947fa63cb" width="100%" controls autoplay loop></video>
|
||
</td>
|
||
</tr>
|
||
</table>
|
||
|
||
ã®ã£ã©ãªãŒã®å¯Ÿå¿ããããã³ããã¯ãŒãã衚瀺ããã«ã¯ã[ãã¡ã](resources/galary_prompt.md)ãã¯ãªãã¯ããŠãã ãã
|
||
|
||
## ã¢ãã«çŽ¹ä»
|
||
|
||
CogVideoXã¯ã[æž
圱](https://chatglm.cn/video?fr=osm_cogvideox) ãšåæºã®ãªãŒãã³ãœãŒã¹çãããªçæã¢ãã«ã§ãã
|
||
以äžã®è¡šã«ãæäŸããŠãããããªçæã¢ãã«ã®åºæ¬æ
å ±ã瀺ããŸã:
|
||
|
||
<table style="border-collapse: collapse; width: 100%;">
|
||
<tr>
|
||
<th style="text-align: center;">ã¢ãã«å</th>
|
||
<th style="text-align: center;">CogVideoX-2B</th>
|
||
<th style="text-align: center;">CogVideoX-5B</th>
|
||
<th style="text-align: center;">CogVideoX-5B-I2V</th>
|
||
<th style="text-align: center;">CogVideoX1.5-5B</th>
|
||
<th style="text-align: center;">CogVideoX1.5-5B-I2V</th>
|
||
</tr>
|
||
<tr>
|
||
<td style="text-align: center;">ãªãªãŒã¹æ¥</td>
|
||
<th style="text-align: center;">2024幎8æ6æ¥</th>
|
||
<th style="text-align: center;">2024幎8æ27æ¥</th>
|
||
<th style="text-align: center;">2024幎9æ19æ¥</th>
|
||
<th style="text-align: center;">2024幎11æ8æ¥</th>
|
||
<th style="text-align: center;">2024幎11æ8æ¥</th>
|
||
</tr>
|
||
<tr>
|
||
<td style="text-align: center;">ãããªè§£å床</td>
|
||
<td colspan="3" style="text-align: center;">720 * 480</td>
|
||
<td colspan="1" style="text-align: center;">1360 * 768</td>
|
||
<td colspan="1" style="text-align: center;">256 <= W <=1360<br>256 <= H <=768<br> W,H % 16 == 0</td>
|
||
</tr>
|
||
<tr>
|
||
<td style="text-align: center;">æšè«ç²ŸåºŠ</td>
|
||
<td style="text-align: center;"><b>FP16*(æšå¥š)</b>, BF16, FP32, FP8*, INT8, INT4ã¯é察å¿</td>
|
||
<td colspan="2" style="text-align: center;"><b>BF16(æšå¥š)</b>, FP16, FP32, FP8*, INT8, INT4ã¯é察å¿</td>
|
||
<td colspan="2" style="text-align: center;"><b>BF16</b></td>
|
||
</tr>
|
||
<tr>
|
||
<td style="text-align: center;">ã·ã³ã°ã«GPUã¡ã¢ãªæ¶è²»</td>
|
||
<td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB<br><b>diffusers FP16: 4GBãã*</b><br><b>diffusers INT8(torchao): 3.6GBãã*</b></td>
|
||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB<br><b>diffusers BF16: 5GBãã*</b><br><b>diffusers INT8(torchao): 4.4GBãã*</b></td>
|
||
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 66GB<br></td>
|
||
</tr>
|
||
<tr>
|
||
<td style="text-align: center;">ãã«ãGPUã¡ã¢ãªæ¶è²»</td>
|
||
<td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
|
||
<td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
|
||
<td colspan="2" style="text-align: center;"><b>ãµããŒããªã</b><br></td>
|
||
</tr>
|
||
<tr>
|
||
<td style="text-align: center;">æšè«é床<br>(ã¹ãããæ° = 50, FP/BF16)</td>
|
||
<td style="text-align: center;">åäžA100: çŽ90ç§<br>åäžH100: çŽ45ç§</td>
|
||
<td colspan="2" style="text-align: center;">åäžA100: çŽ180ç§<br>åäžH100: çŽ90ç§</td>
|
||
<td colspan="2" style="text-align: center;">åäžA100: çŽ1000ç§(5ç§åç»)<br>åäžH100: çŽ550ç§(5ç§åç»)</td>
|
||
</tr>
|
||
<tr>
|
||
<td style="text-align: center;">ããã³ããèšèª</td>
|
||
<td colspan="5" style="text-align: center;">è±èª*</td>
|
||
</tr>
|
||
<tr>
|
||
<td style="text-align: center;">ããã³ããããŒã¯ã³å¶é</td>
|
||
<td colspan="3" style="text-align: center;">226ããŒã¯ã³</td>
|
||
<td colspan="2" style="text-align: center;">224ããŒã¯ã³</td>
|
||
</tr>
|
||
<tr>
|
||
<td style="text-align: center;">ãããªã®é·ã</td>
|
||
<td colspan="3" style="text-align: center;">6ç§</td>
|
||
<td colspan="2" style="text-align: center;">5ç§ãŸãã¯10ç§</td>
|
||
</tr>
|
||
<tr>
|
||
<td style="text-align: center;">ãã¬ãŒã ã¬ãŒã</td>
|
||
<td colspan="3" style="text-align: center;">8 ãã¬ãŒã / ç§</td>
|
||
<td colspan="2" style="text-align: center;">16 ãã¬ãŒã / ç§</td>
|
||
</tr>
|
||
<tr>
|
||
<td style="text-align: center;">äœçœ®ãšã³ã³ãŒãã£ã³ã°</td>
|
||
<td style="text-align: center;">3d_sincos_pos_embed</td>
|
||
<td style="text-align: center;">3d_rope_pos_embed</td>
|
||
<td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
|
||
<td style="text-align: center;">3d_rope_pos_embed</td>
|
||
<td style="text-align: center;">3d_rope_pos_embed</td>
|
||
</tr>
|
||
<tr>
|
||
<td style="text-align: center;">ããŠã³ããŒããªã³ã¯ (Diffusers)</td>
|
||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">ð€ HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">ð€ ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">ð£ WiseModel</a></td>
|
||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">ð€ HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">ð€ ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">ð£ WiseModel</a></td>
|
||
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">ð€ HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">ð€ ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">ð£ WiseModel</a></td>
|
||
<td colspan="2" style="text-align: center;">è¿æ¥å
Ž</td>
|
||
</tr>
|
||
<tr>
|
||
<td style="text-align: center;">ããŠã³ããŒããªã³ã¯ (SAT)</td>
|
||
<td colspan="3" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
|
||
<td colspan="2" style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5b-SAT">ð€ HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">ð€ ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">ð£ WiseModel</a></td>
|
||
</tr>
|
||
</table>
|
||
|
||
**ããŒã¿è§£èª¬**
|
||
|
||
+ diffusersã©ã€ãã©ãªã䜿çšããŠãã¹ãããéã«ã¯ã`diffusers`ã©ã€ãã©ãªãæäŸããå
šãŠã®æé©åãæå¹ã«ãªã£ãŠããŸãããã®æ¹æ³ã¯
|
||
**NVIDIA A100 / H100**以å€ã®ããã€ã¹ã§ã®ã¡ã¢ãª/ã¡ã¢ãªæ¶è²»ã®ãã¹ãã¯è¡ã£ãŠããŸãããéåžžããã®æ¹æ³ã¯**NVIDIA
|
||
Ampereã¢ãŒããã¯ãã£**
|
||
以äžã®å
šãŠã®ããã€ã¹ã«é©å¿ã§ããŸããæé©åãç¡å¹ã«ãããšãã¡ã¢ãªæ¶è²»ã¯åå¢ããããŒã¯ã¡ã¢ãªäœ¿çšéã¯è¡šã®3åã«ãªããŸãããé床ã¯çŽ3ã4ååäžããŸãã以äžã®æé©åãéšåçã«ç¡å¹ã«ããããšãå¯èœã§ã:
|
||
|
||
```
|
||
pipe.enable_sequential_cpu_offload()
|
||
pipe.vae.enable_slicing()
|
||
pipe.vae.enable_tiling()
|
||
```
|
||
|
||
+ ãã«ãGPUã§æšè«ããå Žåã`enable_sequential_cpu_offload()`æé©åãç¡å¹ã«ããå¿
èŠããããŸãã
|
||
+ INT8ã¢ãã«ã䜿çšãããšæšè«é床ãäœäžããŸãããããã¯ã¡ã¢ãªã®å°ãªãGPUã§æ£åžžã«æšè«ãè¡ãããããªå質ã®æ倱ãæå°éã«æããããã®æªçœ®ã§ããæšè«é床ã¯å€§å¹
ã«äœäžããŸãã
|
||
+ CogVideoX-2Bã¢ãã«ã¯`FP16`粟床ã§ãã¬ãŒãã³ã°ãããŠãããCogVideoX-5Bã¢ãã«ã¯`BF16`
|
||
粟床ã§ãã¬ãŒãã³ã°ãããŠããŸããæšè«æã«ã¯ã¢ãã«ããã¬ãŒãã³ã°ããã粟床ã䜿çšããããšããå§ãããŸãã
|
||
+ [PytorchAO](https://github.com/pytorch/ao)ããã³[Optimum-quanto](https://github.com/huggingface/optimum-quanto/)
|
||
ã¯ãCogVideoXã®ã¡ã¢ãªèŠä»¶ãåæžããããã«ããã¹ããšã³ã³ãŒãããã©ã³ã¹ãã©ãŒããããã³VAEã¢ãžã¥ãŒã«ãéååããããã«äœ¿çšã§ããŸããããã«ãããç¡æã®T4
|
||
Colabãããå°ãªãã¡ã¢ãªã®GPUã§ã¢ãã«ãå®è¡ããããšãå¯èœã«ãªããŸããåæ§ã«éèŠãªã®ã¯ãTorchAOã®éååã¯`torch.compile`
|
||
ãšå®å
šã«äºææ§ããããæšè«é床ã倧å¹
ã«åäžãããããšãã§ããç¹ã§ãã`NVIDIA H100`ããã³ãã以äžã®ããã€ã¹ã§ã¯`FP8`
|
||
粟床ã䜿çšããå¿
èŠããããŸããããã«ã¯ã`torch`ã`torchao`ã`diffusers`ã`accelerate`
|
||
Pythonããã±ãŒãžã®ãœãŒã¹ã³ãŒãããã®ã€ã³ã¹ããŒã«ãå¿
èŠã§ãã`CUDA 12.4`ã®äœ¿çšããå§ãããŸãã
|
||
+ æšè«é床ãã¹ããåæ§ã«ãäžèšã®ã¡ã¢ãªæé©åæ¹æ³ã䜿çšããŠããŸããã¡ã¢ãªæé©åã䜿çšããªãå Žåãæšè«é床ã¯çŽ10ïŒ
åäžããŸãã
|
||
`diffusers`ããŒãžã§ã³ã®ã¢ãã«ã®ã¿ãéååããµããŒãããŠããŸãã
|
||
+ ã¢ãã«ã¯è±èªå
¥åã®ã¿ããµããŒãããŠãããä»ã®èšèªã¯å€§èŠæš¡ã¢ãã«ã®æ¹åãéããŠè±èªã«ç¿»èš³ã§ããŸãã
|
||
+ ã¢ãã«ã®ãã¡ã€ã³ãã¥ãŒãã³ã°ã«äœ¿çšãããã¡ã¢ãªã¯`8 * H100`ç°å¢ã§ãã¹ããããŠããŸããããã°ã©ã ã¯èªåçã«`Zero 2`
|
||
æé©åã䜿çšããŠããŸããè¡šã«å
·äœçãªGPUæ°ãèšèŒãããŠããå Žåããã¡ã€ã³ãã¥ãŒãã³ã°ã«ã¯ãã®æ°ä»¥äžã®GPUãå¿
èŠã§ãã
|
||
|
||
## å奜çãªã³ã¯
|
||
|
||
ã³ãã¥ããã£ããã®è²¢ç®ã倧æè¿ããç§ãã¡ããªãŒãã³ãœãŒã¹ã³ãã¥ããã£ã«ç©æ¥µçã«è²¢ç®ããŠããŸãã以äžã®äœåã¯ãã§ã«CogVideoXã«å¯Ÿå¿ããŠããããã²ãå©çšãã ããïŒ
|
||
|
||
+ [CogVideoX-Fun](https://github.com/aigc-apps/CogVideoX-Fun):
|
||
CogVideoX-Funã¯ãCogVideoXã¢ãŒããã¯ãã£ãåºã«ããæ¹è¯ãã€ãã©ã€ã³ã§ãèªç±ãªè§£å床ãšè€æ°ã®èµ·åæ¹æ³ããµããŒãããŠããŸãã
|
||
+ [CogStudio](https://github.com/pinokiofactory/cogstudio): CogVideo ã® Gradio Web UI ã®å¥ã®ãªããžããªãããé«æ©èœãª Web
|
||
UI ããµããŒãããŸãã
|
||
+ [Xorbits Inference](https://github.com/xorbitsai/inference):
|
||
匷åã§å
æ¬çãªåæ£æšè«ãã¬ãŒã ã¯ãŒã¯ã§ãããã¯ã³ã¯ãªãã¯ã§ç¬èªã®ã¢ãã«ãææ°ã®ãªãŒãã³ãœãŒã¹ã¢ãã«ãç°¡åã«ãããã€ã§ããŸãã
|
||
+ [ComfyUI-CogVideoXWrapper](https://github.com/kijai/ComfyUI-CogVideoXWrapper)
|
||
ComfyUIãã¬ãŒã ã¯ãŒã¯ã䜿çšããŠãCogVideoXãã¯ãŒã¯ãããŒã«çµ±åããŸãã
|
||
+ [VideoSys](https://github.com/NUS-HPC-AI-Lab/VideoSys): VideoSysã¯ã䜿ããããé«æ§èœãªãããªçæã€ã³ãã©ãæäŸããææ°ã®ã¢ãã«ãæè¡ãç¶ç¶çã«çµ±åããŠããŸãã
|
||
+ [AutoDLã€ã¡ãŒãž](https://www.codewithgpu.com/i/THUDM/CogVideo/CogVideoX-5b-demo): ã³ãã¥ããã£ã¡ã³ããŒãæäŸããHuggingface
|
||
Spaceã€ã¡ãŒãžã®ã¯ã³ã¯ãªãã¯ãããã€ã¡ã³ãã
|
||
+ [ã€ã³ããªã¢ãã¶ã€ã³åŸ®èª¿æŽã¢ãã«](https://huggingface.co/collections/bertjiazheng/koolcogvideox-66e4762f53287b7f39f8f3ba):
|
||
ã¯ãCogVideoXãåºç€ã«ãã埮調æŽã¢ãã«ã§ãã€ã³ããªã¢ãã¶ã€ã³å°çšã«èšèšãããŠããŸãã
|
||
+ [xDiT](https://github.com/xdit-project/xDiT):
|
||
xDiTã¯ãè€æ°ã®GPUã¯ã©ã¹ã¿ãŒäžã§DiTsã䞊åæšè«ããããã®ãšã³ãžã³ã§ããxDiTã¯ãªã¢ã«ã¿ã€ã ã®ç»åããã³ãããªçæãµãŒãã¹ããµããŒãããŠããŸãã
|
||
+ [CogVideoX-Interpolation](https://github.com/feizc/CogvideX-Interpolation):
|
||
ããŒãã¬ãŒã è£éçæã«ãããŠããã倧ããªæè»æ§ãæäŸããããšãç®çãšãããCogVideoXæ§é ãåºã«ããä¿®æ£çã®ãã€ãã©ã€ã³ã
|
||
+ [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio): DiffSynth
|
||
Studioã¯ãæ¡æ£ãšã³ãžã³ã§ããããã¹ããšã³ã³ãŒããŒãUNetãVAEãªã©ãå«ãã¢ãŒããã¯ãã£ãåæ§ç¯ãããªãŒãã³ãœãŒã¹ã³ãã¥ããã£ã¢ãã«ãšã®äºææ§ãç¶æãã€ã€ãèšç®æ§èœãåäžãããŸããããã®ãã¬ãŒã ã¯ãŒã¯ã¯CogVideoXã«é©å¿ããŠããŸãã
|
||
|
||
## ãããžã§ã¯ãæ§é
|
||
|
||
ãã®ãªãŒãã³ãœãŒã¹ãªããžããªã¯ã**CogVideoX** ãªãŒãã³ãœãŒã¹ã¢ãã«ã®åºæ¬çãªäœ¿çšæ¹æ³ãšåŸ®èª¿æŽã®äŸãè¿
éã«éå§ããããã®ã¬ã€ãã§ãã
|
||
|
||
### Colabã§ã®ã¯ã€ãã¯ã¹ã¿ãŒã
|
||
|
||
ç¡æã®Colab T4äžã§çŽæ¥å®è¡ã§ãã3ã€ã®ãããžã§ã¯ããæäŸããŠããŸãã
|
||
|
||
+ [CogVideoX-5B-T2V-Colab.ipynb](https://colab.research.google.com/drive/1pCe5s0bC_xuXbBlpvIH1z0kfdTLQPzCS?usp=sharing):
|
||
CogVideoX-5B ããã¹ããããããªãžã®çæçšColabã³ãŒãã
|
||
+ [CogVideoX-5B-T2V-Int8-Colab.ipynb](https://colab.research.google.com/drive/1DUffhcjrU-uz7_cpuJO3E_D4BaJT7OPa?usp=sharing):
|
||
CogVideoX-5B ããã¹ããããããªãžã®éååæšè«çšColabã³ãŒãã1åã®å®è¡ã«çŽ30åããããŸãã
|
||
+ [CogVideoX-5B-I2V-Colab.ipynb](https://colab.research.google.com/drive/17CqYCqSwz39nZAX2YyonDxosVKUZGzcX?usp=sharing):
|
||
CogVideoX-5B ç»åãããããªãžã®çæçšColabã³ãŒãã
|
||
+ [CogVideoX-5B-V2V-Colab.ipynb](https://colab.research.google.com/drive/1comfGAUJnChl5NwPuO8Ox5_6WCy4kbNN?usp=sharing):
|
||
CogVideoX-5B ãããªãããããªãžã®çæçšColabã³ãŒãã
|
||
|
||
### Inference
|
||
|
||
+ [cli_demo](inference/cli_demo.py): æšè«ã³ãŒãã®è©³çŽ°ãªèª¬æãå«ãŸããŠãããäžè¬çãªãã©ã¡ãŒã¿ã®æå³ã«ã€ããŠãèšåããŠããŸãã
|
||
+ [cli_demo_quantization](inference/cli_demo_quantization.py):
|
||
éååã¢ãã«æšè«ã³ãŒãã§ãäœã¡ã¢ãªã®ããã€ã¹ã§ãå®è¡å¯èœã§ãããŸãããã®ã³ãŒããå€æŽããŠãFP8 粟床㮠CogVideoX
|
||
ã¢ãã«ã®å®è¡ããµããŒãããããšãã§ããŸãã
|
||
+ [diffusers_vae_demo](inference/cli_vae_demo.py): VAEæšè«ã³ãŒãã®å®è¡ã«ã¯çŸåš71GBã®ã¡ã¢ãªãå¿
èŠã§ãããå°æ¥çã«ã¯æé©åãããäºå®ã§ãã
|
||
+ [space demo](inference/gradio_composite_demo): Huggingface SpaceãšåãGUIã³ãŒãã§ããã¬ãŒã è£éãè¶
解åããŒã«ãçµã¿èŸŒãŸããŠããŸãã
|
||
|
||
<div style="text-align: center;">
|
||
<img src="resources/web_demo.png" style="width: 100%; height: auto;" />
|
||
</div>
|
||
|
||
+ [convert_demo](inference/convert_demo.py):
|
||
ãŠãŒã¶ãŒå
¥åãCogVideoXã«é©ãã圢åŒã«å€æããæ¹æ³ãCogVideoXã¯é·ããã£ãã·ã§ã³ã§ãã¬ãŒãã³ã°ãããŠãããããå
¥åããã¹ããLLMã䜿çšããŠãã¬ãŒãã³ã°ååžãšäžèŽãããå¿
èŠããããŸããããã©ã«ãã§ã¯GLM-4ã䜿çšããŸãããGPTãGeminiãªã©ã®ä»ã®LLMã«çœ®ãæããããšãã§ããŸãã
|
||
+ [gradio_web_demo](inference/gradio_web_demo.py): CogVideoX-2B / 5B ã¢ãã«ã䜿çšããŠåç»ãçæããæ¹æ³ã瀺ããã·ã³ãã«ãª
|
||
Gradio Web UI ãã¢ã§ããç§ãã¡ã® Huggingface Space ãšåæ§ã«ããã®ã¹ã¯ãªããã䜿çšã㊠Web ãã¢ãèµ·åããããšãã§ããŸãã
|
||
|
||
### finetune
|
||
|
||
+ [train_cogvideox_lora](finetune/README_ja.md): CogVideoX diffusers 埮調æŽæ¹æ³ã®è©³çŽ°ãªèª¬æãå«ãŸããŠããŸãããã®ã³ãŒãã䜿çšããŠãèªåã®ããŒã¿ã»ããã§
|
||
CogVideoX ã埮調æŽããããšãã§ããŸãã
|
||
|
||
### sat
|
||
|
||
+ [sat_demo](sat/README.md):
|
||
SATãŠã§ã€ãã®æšè«ã³ãŒããšåŸ®èª¿æŽã³ãŒããå«ãŸããŠããŸããCogVideoXã¢ãã«æ§é ã«åºã¥ããŠæ¹åããããšããå§ãããŸããé©æ°çãªç 究è
ã¯ããã®ã³ãŒãã䜿çšããŠè¿
éãªã¹ã¿ããã³ã°ãšéçºãè¡ãããšãã§ããŸãã
|
||
|
||
### ããŒã«
|
||
|
||
ãã®ãã©ã«ãã«ã¯ãã¢ãã«å€æ/ãã£ãã·ã§ã³çæãªã©ã®ããŒã«ãå«ãŸããŠããŸãã
|
||
|
||
+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): SAT ã¢ãã«ã®éã¿ã Huggingface ã¢ãã«ã®éã¿ã«å€æããŸãã
|
||
+ [caption_demo](tools/caption/README_ja.md): Caption ããŒã«ããããªãç解ããŠããã¹ãã§åºåããã¢ãã«ã
|
||
+ [export_sat_lora_weight](tools/export_sat_lora_weight.py): SAT ãã¡ã€ã³ãã¥ãŒãã³ã°ã¢ãã«ã®ãšã¯ã¹ããŒãããŒã«ãSAT Lora
|
||
Adapter ã diffusers 圢åŒã§ãšã¯ã¹ããŒãããŸãã
|
||
+ [load_cogvideox_lora](tools/load_cogvideox_lora.py): diffusers çã®ãã¡ã€ã³ãã¥ãŒãã³ã°ããã Lora Adapter
|
||
ãããŒãããããã®ããŒã«ã³ãŒãã
|
||
+ [llm_flux_cogvideox](tools/llm_flux_cogvideox/llm_flux_cogvideox.py): ãªãŒãã³ãœãŒã¹ã®ããŒã«ã«å€§èŠæš¡èšèªã¢ãã« +
|
||
Flux + CogVideoX ã䜿çšããŠèªåçã«åç»ãçæããŸãã
|
||
+ [parallel_inference_xdit](tools/parallel_inference/parallel_inference_xdit.py)ïŒ
|
||
[xDiT](https://github.com/xdit-project/xDiT)
|
||
ã«ãã£ãŠãµããŒãããããããªçæããã»ã¹ãè€æ°ã® GPU ã§äžŠååããŸãã
|
||
+ [cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory): CogVideoXã®äœã³ã¹ã埮調æŽãã¬ãŒã ã¯ãŒã¯ã§ã
|
||
`diffusers`ããŒãžã§ã³ã®ã¢ãã«ã«é©å¿ããŠããŸããããå€ãã®è§£å床ã«å¯Ÿå¿ããåäžã®4090 GPUã§CogVideoX-5Bã®åŸ®èª¿æŽãå¯èœã§ãã
|
||
|
||
## CogVideo(ICLR'23)
|
||
|
||
è«æã®å
¬åŒãªããžããª: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868)
|
||
㯠[CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo) ã«ãããŸãã
|
||
|
||
**CogVideoã¯æ¯èŒçé«ãã¬ãŒã ã¬ãŒãã®ãããªãçæããããšãã§ããŸãã**
|
||
32ãã¬ãŒã ã®4ç§éã®ã¯ãªããã以äžã«ç€ºãããŠããŸãã
|
||
|
||

|
||
|
||

|
||
<div align="center">
|
||
<video src="https://github.com/user-attachments/assets/2fa19651-e925-4a2a-b8d6-b3f216d490ba" width="80%" controls autoplay></video>
|
||
</div>
|
||
|
||
|
||
CogVideoã®ãã¢ã¯ [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/) ã§äœéšã§ããŸãã
|
||
*å
ã®å
¥åã¯äžåœèªã§ãã*
|
||
|
||
## åŒçš
|
||
|
||
ð ç§ãã¡ã®ä»äºã圹ç«ã€ãšæãããå Žåããã²ã¹ã¿ãŒãä»ããŠããã ããè«æãåŒçšããŠãã ããã
|
||
|
||
```
|
||
@article{yang2024cogvideox,
|
||
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
|
||
author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
|
||
journal={arXiv preprint arXiv:2408.06072},
|
||
year={2024}
|
||
}
|
||
@article{hong2022cogvideo,
|
||
title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
|
||
author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
|
||
journal={arXiv preprint arXiv:2205.15868},
|
||
year={2022}
|
||
}
|
||
```
|
||
|
||
ããªãã®è²¢ç®ããåŸ
ã¡ããŠããŸãïŒè©³çŽ°ã¯[ãã¡ã](resources/contribute_ja.md)ãã¯ãªãã¯ããŠãã ããã
|
||
|
||
## ã©ã€ã»ã³ã¹å¥çŽ
|
||
|
||
ãã®ãªããžããªã®ã³ãŒã㯠[Apache 2.0 License](LICENSE) ã®äžã§å
¬éãããŠããŸãã
|
||
|
||
CogVideoX-2B ã¢ãã« (察å¿ããTransformersã¢ãžã¥ãŒã«ãVAEã¢ãžã¥ãŒã«ãå«ã) ã¯
|
||
[Apache 2.0 License](LICENSE) ã®äžã§å
¬éãããŠããŸãã
|
||
|
||
CogVideoX-5B ã¢ãã«ïŒTransformers ã¢ãžã¥ãŒã«ãç»åçæãããªãšããã¹ãçæãããªã®ããŒãžã§ã³ãå«ãïŒ ã¯
|
||
[CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE) ã®äžã§å
¬éãããŠããŸãã
|