diff --git a/README.md b/README.md
index a27782c..4e83ec2 100644
--- a/README.md
+++ b/README.md
@@ -22,7 +22,10 @@ Experience the CogVideoX-5B model online at CogVideoX-2B
CogVideoX-5B |
CogVideoX-5B-I2V |
+ CogVideoX1.5-5B |
+ CogVideoX1.5-5B-I2V |
- Model Description |
- Entry-level model, balancing compatibility. Low cost for running and secondary development. |
- Larger model with higher video generation quality and better visual effects. |
- CogVideoX-5B image-to-video version. |
+ Release Date |
+ August 6, 2024 |
+ August 27, 2024 |
+ September 19, 2024 |
+ November 8, 2024 |
+ November 8, 2024 |
+
+
+ Video Resolution |
+ 720 * 480 |
+ 1360 * 768 |
+ 256 <= W <=1360 256 <= H <=768 W,H % 16 == 0 |
Inference Precision |
FP16*(recommended), BF16, FP32, FP8*, INT8, not supported: INT4 |
- BF16 (recommended), FP16, FP32, FP8*, INT8, not supported: INT4 |
+ BF16(recommended), FP16, FP32, FP8*, INT8, not supported: INT4 |
+ BF16 |
- Single GPU Memory Usage
|
- SAT FP16: 18GB diffusers FP16: from 4GB* diffusers INT8 (torchao): from 3.6GB* |
- SAT BF16: 26GB diffusers BF16: from 5GB* diffusers INT8 (torchao): from 4.4GB* |
+ Single GPU Memory Usage |
+ SAT FP16: 18GB diffusers FP16: from 4GB* diffusers INT8(torchao): from 3.6GB* |
+ SAT BF16: 26GB diffusers BF16 : from 5GB* diffusers INT8(torchao): from 4.4GB* |
+ SAT BF16: 66GB
|
- Multi-GPU Inference Memory Usage |
+ Multi-GPU Memory Usage |
FP16: 10GB* using diffusers
|
BF16: 15GB* using diffusers
|
+ Not supported
|
Inference Speed (Step = 50, FP/BF16) |
Single A100: ~90 seconds Single H100: ~45 seconds |
Single A100: ~180 seconds Single H100: ~90 seconds |
-
-
- Fine-tuning Precision |
- FP16 |
- BF16 |
-
-
- Fine-tuning Memory Usage |
- 47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT) |
- 63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)
|
- 78 GB (bs=1, LORA) 75GB (bs=1, SFT, 16GPU)
|
+ Single A100: ~1000 seconds (5-second video) Single H100: ~550 seconds (5-second video) |
Prompt Language |
- English* |
+ English* |
- Maximum Prompt Length |
+ Prompt Token Limit |
226 Tokens |
+ 224 Tokens |
Video Length |
- 6 Seconds |
+ 6 seconds |
+ 5 or 10 seconds |
Frame Rate |
- 8 Frames / Second |
+ 8 frames / second |
+ 16 frames / second |
- Video Resolution |
- 720 x 480, no support for other resolutions (including fine-tuning) |
-
-
- Position Encoding |
+ Positional Encoding |
3d_sincos_pos_embed |
3d_sincos_pos_embed |
3d_rope_pos_embed + learnable_pos_embed |
+ 3d_sincos_pos_embed |
+ 3d_rope_pos_embed + learnable_pos_embed |
Download Link (Diffusers) |
ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
+ Coming Soon |
Download Link (SAT) |
- SAT |
+ SAT |
+ ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
@@ -422,7 +430,7 @@ hands-on practice on text-to-video generation. *The original input is in Chinese
We welcome your contributions! You can click [here](resources/contribute.md) for more information.
-## License Agreement
+## Model-License
The code in this repository is released under the [Apache 2.0 License](LICENSE).
diff --git a/README_ja.md b/README_ja.md
index 69b46b6..aa7ae37 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -1,6 +1,6 @@
# CogVideo & CogVideoX
-[Read this in English](./README_zh.md)
+[Read this in English](./README.md)
[äžæé
读](./README_zh.md)
@@ -22,9 +22,14 @@
## æŽæ°ãšãã¥ãŒã¹
-- ð¥ð¥ **ãã¥ãŒã¹**: ```2024/10/13```: ã³ã¹ãåæžã®ãããåäžã®4090 GPUã§`CogVideoX-5B`
+- ð¥ð¥ ãã¥ãŒã¹: ```2024/11/08```: `CogVideoX1.5` ã¢ãã«ããªãªãŒã¹ããŸãããCogVideoX1.5 㯠CogVideoX ãªãŒãã³ãœãŒã¹ã¢ãã«ã®ã¢ããã°ã¬ãŒãããŒãžã§ã³ã§ãã
+CogVideoX1.5-5B ã·ãªãŒãºã¢ãã«ã¯ã10ç§ é·ã®åç»ãšããé«ã解å床ããµããŒãããŠããã`CogVideoX1.5-5B-I2V` ã¯ä»»æã®è§£å床ã§ã®åç»çæã«å¯Ÿå¿ããŠããŸãã
+SAT ã³ãŒãã¯ãã§ã«æŽæ°ãããŠããã`diffusers` ããŒãžã§ã³ã¯çŸåšé©å¿äžã§ãã
+SAT ããŒãžã§ã³ã®ã³ãŒã㯠[ãã¡ã](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT) ããããŠã³ããŒãã§ããŸãã
+- ð¥ **ãã¥ãŒã¹**: ```2024/10/13```: ã³ã¹ãåæžã®ãããåäžã®4090 GPUã§`CogVideoX-5B`
ã埮調æŽã§ãããã¬ãŒã ã¯ãŒã¯ [cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory)
- ããªãªãŒã¹ãããŸãããè€æ°ã®è§£å床ã§ã®åŸ®èª¿æŽã«å¯Ÿå¿ããŠããŸãããã²ãå©çšãã ããïŒ- ð¥**ãã¥ãŒã¹**: ```2024/10/10```:
+ ããªãªãŒã¹ãããŸãããè€æ°ã®è§£å床ã§ã®åŸ®èª¿æŽã«å¯Ÿå¿ããŠããŸãããã²ãå©çšãã ããïŒ
+- ð¥**ãã¥ãŒã¹**: ```2024/10/10```:
æè¡å ±åæžãæŽæ°ãããã詳现ãªãã¬ãŒãã³ã°æ
å ±ãšãã¢ãè¿œå ããŸããã
- ð¥ **ãã¥ãŒã¹**: ```2024/10/10```: æè¡å ±åæžãæŽæ°ããŸããã[ãã¡ã](https://arxiv.org/pdf/2408.06072)
ãã¯ãªãã¯ããŠã芧ãã ãããããã«ãã¬ãŒãã³ã°ã®è©³çŽ°ãšãã¢ãè¿œå ããŸããããã¢ãèŠãã«ã¯[ãã¡ã](https://yzy-thu.github.io/CogVideoX-demo/)
@@ -34,7 +39,7 @@
- ð¥**ãã¥ãŒã¹**: ```2024/9/19```: CogVideoXã·ãªãŒãºã®ç»åçæãããªã¢ãã« **CogVideoX-5B-I2V**
ããªãŒãã³ãœãŒã¹åããŸããããã®ã¢ãã«ã¯ãç»åãèæ¯å
¥åãšããŠäœ¿çšããããã³ããã¯ãŒããšçµã¿åãããŠãããªãçæããããšãã§ããããé«ãå¶åŸ¡æ§ãæäŸããŸããããã«ãããCogVideoXã·ãªãŒãºã®ã¢ãã«ã¯ãããã¹ããããããªçæããããªã®ç¶ç¶ãç»åãããããªçæã®3ã€ã®ã¿ã¹ã¯ããµããŒãããããã«ãªããŸããããªã³ã©ã€ã³ã§ã®[äœéš](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space)
ãã楜ãã¿ãã ããã
-- ð¥ð¥ **ãã¥ãŒã¹**: ```2024/9/19```:
+- ð¥ **ãã¥ãŒã¹**: ```2024/9/19```:
CogVideoXã®ãã¬ãŒãã³ã°ããã»ã¹ã§ãããªããŒã¿ãããã¹ãèšè¿°ã«å€æããããã«äœ¿çšããããã£ãã·ã§ã³ã¢ãã« [CogVLM2-Caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption)
ããªãŒãã³ãœãŒã¹åããŸãããããŠã³ããŒãããŠãå©çšãã ããã
- ð¥ ```2024/8/27```: CogVideoXã·ãªãŒãºã®ãã倧ããªã¢ãã« **CogVideoX-5B**
@@ -63,11 +68,10 @@
- [ãããžã§ã¯ãæ§é ](#ãããžã§ã¯ãæ§é )
- [æšè«](#æšè«)
- [sat](#sat)
- - [ããŒã«](#ããŒã«)
-- [ãããžã§ã¯ãèšç»](#ãããžã§ã¯ãèšç»)
-- [ã¢ãã«ã©ã€ã»ã³ã¹](#ã¢ãã«ã©ã€ã»ã³ã¹)
+ - [ããŒã«](#ããŒã«)=
- [CogVideo(ICLR'23)ã¢ãã«çŽ¹ä»](#CogVideoICLR23)
- [åŒçš](#åŒçš)
+- [ã©ã€ã»ã³ã¹å¥çŽ](#ã©ã€ã»ã³ã¹å¥çŽ)
## ã¯ã€ãã¯ã¹ã¿ãŒã
@@ -156,79 +160,91 @@ pip install -r requirements.txt
CogVideoXã¯ã[æž
圱](https://chatglm.cn/video?fr=osm_cogvideox) ãšåæºã®ãªãŒãã³ãœãŒã¹çãããªçæã¢ãã«ã§ãã
以äžã®è¡šã«ãæäŸããŠãããããªçæã¢ãã«ã®åºæ¬æ
å ±ã瀺ããŸã:
-
+
ã¢ãã«å |
CogVideoX-2B |
CogVideoX-5B |
- CogVideoX-5B-I2V |
+ CogVideoX-5B-I2V |
+ CogVideoX1.5-5B |
+ CogVideoX1.5-5B-I2V |
+
+
+ ãªãªãŒã¹æ¥ |
+ 2024幎8æ6æ¥ |
+ 2024幎8æ27æ¥ |
+ 2024幎9æ19æ¥ |
+ 2024幎11æ8æ¥ |
+ 2024幎11æ8æ¥ |
+
+
+ ãããªè§£å床 |
+ 720 * 480 |
+ 1360 * 768 |
+ 256 <= W <=1360 256 <= H <=768 W,H % 16 == 0 |
æšè«ç²ŸåºŠ |
FP16*(æšå¥š), BF16, FP32, FP8*, INT8, INT4ã¯éå¯Ÿå¿ |
BF16(æšå¥š), FP16, FP32, FP8*, INT8, INT4ã¯éå¯Ÿå¿ |
-
-
- åäžGPUã®ã¡ã¢ãªæ¶è²»
|
- SAT FP16: 18GB diffusers FP16: 4GBãã* diffusers INT8(torchao): 3.6GBãã* |
- SAT BF16: 26GB diffusers BF16 : 5GBãã* diffusers INT8(torchao): 4.4GBãã* |
-
-
- ãã«ãGPUã®ã¡ã¢ãªæ¶è²» |
- FP16: 10GB* using diffusers
|
- BF16: 15GB* using diffusers
|
-
-
- æšè«é床 (ã¹ããã = 50, FP/BF16) |
- åäžA100: çŽ90ç§ åäžH100: çŽ45ç§ |
- åäžA100: çŽ180ç§ åäžH100: çŽ90ç§ |
-
-
- ãã¡ã€ã³ãã¥ãŒãã³ã°ç²ŸåºŠ |
- FP16 |
BF16 |
- ãã¡ã€ã³ãã¥ãŒãã³ã°æã®ã¡ã¢ãªæ¶è²» |
- 47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT) |
- 63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)
|
- 78 GB (bs=1, LORA) 75GB (bs=1, SFT, 16GPU)
|
+ ã·ã³ã°ã«GPUã¡ã¢ãªæ¶è²» |
+ SAT FP16: 18GB diffusers FP16: 4GBãã* diffusers INT8(torchao): 3.6GBãã* |
+ SAT BF16: 26GB diffusers BF16: 5GBãã* diffusers INT8(torchao): 4.4GBãã* |
+ SAT BF16: 66GB
|
+
+
+ ãã«ãGPUã¡ã¢ãªæ¶è²» |
+ FP16: 10GB* using diffusers
|
+ BF16: 15GB* using diffusers
|
+ ãµããŒããªã
|
+
+
+ æšè«é床 (ã¹ãããæ° = 50, FP/BF16) |
+ åäžA100: çŽ90ç§ åäžH100: çŽ45ç§ |
+ åäžA100: çŽ180ç§ åäžH100: çŽ90ç§ |
+ åäžA100: çŽ1000ç§(5ç§åç») åäžH100: çŽ550ç§(5ç§åç») |
ããã³ããèšèª |
- è±èª* |
+ è±èª* |
- ããã³ããã®æ倧ããŒã¯ã³æ° |
+ ããã³ããããŒã¯ã³å¶é |
226ããŒã¯ã³ |
+ 224ããŒã¯ã³ |
ãããªã®é·ã |
6ç§ |
+ 5ç§ãŸãã¯10ç§ |
ãã¬ãŒã ã¬ãŒã |
- 8ãã¬ãŒã /ç§ |
-
-
- ãããªè§£å床 |
- 720 * 480ãä»ã®è§£å床ã¯é察å¿(ãã¡ã€ã³ãã¥ãŒãã³ã°å«ã) |
+ 8 ãã¬ãŒã / ç§ |
+ 16 ãã¬ãŒã / ç§ |
äœçœ®ãšã³ã³ãŒãã£ã³ã° |
3d_sincos_pos_embed |
3d_sincos_pos_embed |
3d_rope_pos_embed + learnable_pos_embed |
+ 3d_sincos_pos_embed |
+ 3d_rope_pos_embed + learnable_pos_embed |
ããŠã³ããŒããªã³ã¯ (Diffusers) |
ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
+ è¿æ¥å
Ž |
ããŠã³ããŒããªã³ã¯ (SAT) |
- SAT |
+ SAT |
+ ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
diff --git a/README_zh.md b/README_zh.md
index 9f84f84..3574e7d 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -1,10 +1,9 @@
# CogVideo & CogVideoX
-[Read this in English](./README_zh.md)
+[Read this in English](./README.md)
[æ¥æ¬èªã§èªã](./README_ja.md)
-
@@ -23,7 +22,9 @@
## 项ç®æŽæ°
-- ð¥ð¥ **News**: ```2024/10/13```: ææ¬æŽäœïŒåå¡4090å¯åŸ®è°`CogVideoX-5B`
+- ð¥ð¥ **News**: ```2024/11/08```: æ们ååž `CogVideoX1.5` æš¡åãCogVideoX1.5 æ¯ CogVideoX åŒæºæš¡åçå级çæ¬ã
+CogVideoX1.5-5B ç³»åæš¡åæ¯æ **10ç§** é¿åºŠçè§é¢åæŽé«çå蟚çïŒå
¶äž `CogVideoX1.5-5B-I2V` æ¯æ **ä»»æå蟚ç** çè§é¢çæïŒSAT代ç å·²ç»æŽæ°ã`diffusers`çæ¬è¿åšéé
äžãSATçæ¬ä»£ç ååŸ [è¿é](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT) äžèœœã
+- ð¥**News**: ```2024/10/13```: ææ¬æŽäœïŒåå¡4090å¯åŸ®è° `CogVideoX-5B`
ç埮è°æ¡æ¶[cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory)å·²ç»æšåºïŒå€ç§å蟚ç埮è°ïŒæ¬¢è¿äœ¿çšã
- ð¥ **News**: ```2024/10/10```: æ们æŽæ°äºæ们çææ¯æ¥å,请ç¹å» [è¿é](https://arxiv.org/pdf/2408.06072)
æ¥çïŒéäžäºæŽå€çè®ç»ç»èådemoïŒå
³äºdemoïŒç¹å»[è¿é](https://yzy-thu.github.io/CogVideoX-demo/) æ¥çã
@@ -58,10 +59,9 @@
- [Inference](#inference)
- [SAT](#sat)
- [Tools](#tools)
-- [åŒæºé¡¹ç®è§å](#åŒæºé¡¹ç®è§å)
-- [æš¡ååè®®](#æš¡ååè®®)
- [CogVideo(ICLR'23)æš¡åä»ç»](#cogvideoiclr23)
- [åŒçš](#åŒçš)
+- [æš¡ååè®®](#æš¡ååè®®)
## å¿«éåŒå§
@@ -157,62 +157,72 @@ CogVideoXæ¯ [æž
圱](https://chatglm.cn/video?fr=osm_cogvideox) åæºçåŒæº
CogVideoX-2B |
CogVideoX-5B |
CogVideoX-5B-I2V |
+ CogVideoX1.5-5B |
+ CogVideoX1.5-5B-I2V |
+
+
+ ååžæ¶éŽ |
+ 2024幎8æ6æ¥ |
+ 2024幎8æ27æ¥ |
+ 2024幎9æ19æ¥ |
+ 2024幎11æ8æ¥ |
+ 2024幎11æ8æ¥ |
+
+
+ è§é¢å蟚ç |
+ 720 * 480 |
+ 1360 * 768 |
+ 256 <= W <=1360 256 <= H <=768 W,H % 16 == 0 |
æšç粟床 |
FP16*(æšè), BF16, FP32ïŒFP8*ïŒINT8ïŒäžæ¯æINT4 |
BF16(æšè), FP16, FP32ïŒFP8*ïŒINT8ïŒäžæ¯æINT4 |
+ BF16 |
åGPUæŸåæ¶è
|
SAT FP16: 18GB diffusers FP16: 4GBèµ·* diffusers INT8(torchao): 3.6Gèµ·* |
SAT BF16: 26GB diffusers BF16 : 5GBèµ·* diffusers INT8(torchao): 4.4Gèµ·* |
+ SAT BF16: 66GB
|
å€GPUæšçæŸåæ¶è |
FP16: 10GB* using diffusers
|
BF16: 15GB* using diffusers
|
+ Not support
|
æšçé床 (Step = 50, FP/BF16) |
åå¡A100: ~90ç§ åå¡H100: ~45ç§ |
åå¡A100: ~180ç§ åå¡H100: ~90ç§ |
-
-
- 埮è°ç²ŸåºŠ |
- FP16 |
- BF16 |
-
-
- 埮è°æŸåæ¶è |
- 47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT) |
- 63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)
|
- 78 GB (bs=1, LORA) 75GB (bs=1, SFT, 16GPU)
|
+ åå¡A100: ~1000ç§(5ç§è§é¢) åå¡H100: ~550ç§(5ç§è§é¢) |
æ瀺è¯è¯èš |
- English* |
+ English* |
æ瀺è¯é¿åºŠäžé |
226 Tokens |
+ 224 Tokens |
è§é¢é¿åºŠ |
6 ç§ |
+ 5 ç§ æ 10 ç§ |
垧ç |
8 垧 / ç§ |
+ 16 垧 / ç§ |
- è§é¢å蟚ç |
- 720 * 480ïŒäžæ¯æå
¶ä»å蟚ç(å«åŸ®è°) |
-
-
äœçœ®çŒç |
3d_sincos_pos_embed |
- 3d_sincos_pos_embed |
+ 3d_sincos_pos_embed |
+ 3d_rope_pos_embed + learnable_pos_embed |
+ 3d_sincos_pos_embed |
3d_rope_pos_embed + learnable_pos_embed |
@@ -220,10 +230,13 @@ CogVideoXæ¯ [æž
圱](https://chatglm.cn/video?fr=osm_cogvideox) åæºçåŒæº
ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
+ å³å°æšåº |
äžèœœéŸæ¥ (SAT) |
SAT |
+ ð€ HuggingFace ð€ ModelScope ð£ WiseModel |
+
diff --git a/sat/README.md b/sat/README.md
index 48c4552..c67e15c 100644
--- a/sat/README.md
+++ b/sat/README.md
@@ -1,29 +1,39 @@
-# SAT CogVideoX-2B
+# SAT CogVideoX
-[äžæé
读](./README_zh.md)
+[Read this in English.](./README_zh.md)
[æ¥æ¬èªã§èªã](./README_ja.md)
-This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the
-fine-tuning code for SAT weights.
+This folder contains inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights, along with fine-tuning code for SAT weights.
-This code is the framework used by the team to train the model. It has few comments and requires careful study.
+This code framework was used by our team during model training. There are few comments, so careful study is required.
## Inference Model
-### 1. Ensure that you have correctly installed the dependencies required by this folder.
+### 1. Make sure you have installed all dependencies in this folder
-```shell
+```
pip install -r requirements.txt
```
-### 2. Download the model weights
+### 2. Download the Model Weights
-### 2. Download model weights
+First, download the model weights from the SAT mirror.
-First, go to the SAT mirror to download the model weights. For the CogVideoX-2B model, please download as follows:
+#### CogVideoX1.5 Model
-```shell
+```
+git lfs install
+git clone https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT
+```
+
+This command downloads three models: Transformers, VAE, and T5 Encoder.
+
+#### CogVideoX Model
+
+For the CogVideoX-2B model, download as follows:
+
+```
mkdir CogVideoX-2b-sat
cd CogVideoX-2b-sat
wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
@@ -34,13 +44,12 @@ mv 'index.html?dl=1' transformer.zip
unzip transformer.zip
```
-For the CogVideoX-5B model, please download the `transformers` file as follows link:
-(VAE files are the same as 2B)
+Download the `transformers` file for the CogVideoX-5B model (the VAE file is the same as for 2B):
+ [CogVideoX-5B](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list)
+ [CogVideoX-5B-I2V](https://cloud.tsinghua.edu.cn/d/5cc62a2d6e7d45c0a2f6/?p=%2F1&mode=list)
-Next, you need to format the model files as follows:
+Arrange the model files in the following structure:
```
.
@@ -52,20 +61,24 @@ Next, you need to format the model files as follows:
âââ 3d-vae.pt
```
-Due to large size of model weight file, using `git lfs` is recommended. Installation of `git lfs` can be
-found [here](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing)
+Since model weight files are large, itâs recommended to use `git lfs`.
+See [here](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing) for `git lfs` installation.
-Next, clone the T5 model, which is not used for training and fine-tuning, but must be used.
-> T5 model is available on [Modelscope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) as well.
+```
+git lfs install
+```
-```shell
-git clone https://huggingface.co/THUDM/CogVideoX-2b.git
+Next, clone the T5 model, which is used as an encoder and doesnât require training or fine-tuning.
+> You may also use the model file location on [Modelscope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b).
+
+```
+git clone https://huggingface.co/THUDM/CogVideoX-2b.git # Download model from Huggingface
+# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git # Download from Modelscope
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
```
-By following the above approach, you will obtain a safetensor format T5 file. Ensure that there are no errors when
-loading it into Deepspeed in Finetune.
+This will yield a safetensor format T5 file that can be loaded without error during Deepspeed fine-tuning.
```
âââ added_tokens.json
@@ -80,11 +93,11 @@ loading it into Deepspeed in Finetune.
0 directories, 8 files
```
-### 3. Modify the file in `configs/cogvideox_2b.yaml`.
+### 3. Modify `configs/cogvideox_*.yaml` file.
```yaml
model:
- scale_factor: 1.15258426
+ scale_factor: 1.55258426
disable_first_stage_autocast: true
log_keys:
- txt
@@ -160,14 +173,14 @@ model:
ucg_rate: 0.1
target: sgm.modules.encoders.modules.FrozenT5Embedder
params:
- model_dir: "t5-v1_1-xxl" # Absolute path to the CogVideoX-2b/t5-v1_1-xxl weights folder
+ model_dir: "t5-v1_1-xxl" # absolute path to CogVideoX-2b/t5-v1_1-xxl weight folder
max_length: 226
first_stage_config:
target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
params:
cp_size: 1
- ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # Absolute path to the CogVideoX-2b-sat/vae/3d-vae.pt folder
+ ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # absolute path to CogVideoX-2b-sat/vae/3d-vae.pt file
ignore_keys: [ 'loss' ]
loss_config:
@@ -239,48 +252,46 @@ model:
num_steps: 50
```
-### 4. Modify the file in `configs/inference.yaml`.
+### 4. Modify `configs/inference.yaml` file.
```yaml
args:
latent_channels: 16
mode: inference
- load: "{absolute_path/to/your}/transformer" # Absolute path to the CogVideoX-2b-sat/transformer folder
+ load: "{absolute_path/to/your}/transformer" # Absolute path to CogVideoX-2b-sat/transformer folder
# load: "{your lora folder} such as zRzRzRzRzRzRzR/lora-disney-08-20-13-28" # This is for Full model without lora adapter
batch_size: 1
- input_type: txt # You can choose txt for pure text input, or change to cli for command line input
- input_file: configs/test.txt # Pure text file, which can be edited
- sampling_num_frames: 13 # Must be 13, 11 or 9
+ input_type: txt # You can choose "txt" for plain text input or change to "cli" for command-line input
+ input_file: configs/test.txt # Plain text file, can be edited
+ sampling_num_frames: 13 # For CogVideoX1.5-5B it must be 42 or 22. For CogVideoX-5B / 2B, it must be 13, 11, or 9.
sampling_fps: 8
fp16: True # For CogVideoX-2B
- # bf16: True # For CogVideoX-5B
+ # bf16: True # For CogVideoX-5B
output_dir: outputs/
force_inference: True
```
-+ Modify `configs/test.txt` if multiple prompts is required, in which each line makes a prompt.
-+ For better prompt formatting, refer to [convert_demo.py](../inference/convert_demo.py), for which you should set the
- OPENAI_API_KEY as your environmental variable.
-+ Modify `input_type` in `configs/inference.yaml` if use command line as prompt iuput.
++ If using a text file to save multiple prompts, modify `configs/test.txt` as needed. One prompt per line. If you are unsure how to write prompts, use [this code](../inference/convert_demo.py) to call an LLM for refinement.
++ To use command-line input, modify:
-```yaml
+```
input_type: cli
```
-This allows input from the command line as prompts.
+This allows you to enter prompts from the command line.
-Change `output_dir` if you wish to modify the address of the output video
+To modify the output video location, change:
-```yaml
+```
output_dir: outputs/
```
-It is saved by default in the `.outputs/` folder.
+The default location is the `.outputs/` folder.
-### 5. Run the inference code to perform inference.
+### 5. Run the Inference Code to Perform Inference
-```shell
+```
bash inference.sh
```
@@ -288,95 +299,91 @@ bash inference.sh
### Preparing the Dataset
-The dataset format should be as follows:
+The dataset should be structured as follows:
```
.
âââ labels
-â  âââ 1.txt
-â  âââ 2.txt
-â  âââ ...
+â âââ 1.txt
+â âââ 2.txt
+â âââ ...
âââ videos
âââ 1.mp4
âââ 2.mp4
âââ ...
```
-Each text file shares the same name as its corresponding video, serving as the label for that video. Videos and labels
-should be matched one-to-one. Generally, a single video should not be associated with multiple labels.
+Each txt file should have the same name as the corresponding video file and contain the label for that video. The videos and labels should correspond one-to-one. Generally, avoid using one video with multiple labels.
-For style fine-tuning, please prepare at least 50 videos and labels with similar styles to ensure proper fitting.
+For style fine-tuning, prepare at least 50 videos and labels with a similar style to facilitate fitting.
-### Modifying Configuration Files
+### Modifying the Configuration File
-We support two fine-tuning methods: `Lora` and full-parameter fine-tuning. Please note that both methods only fine-tune
-the `transformer` part and do not modify the `VAE` section. `T5` is used solely as an Encoder. Please modify
-the `configs/sft.yaml` (for full-parameter fine-tuning) file as follows:
+We support two fine-tuning methods: `Lora` and full-parameter fine-tuning. Note that both methods only fine-tune the `transformer` part. The `VAE` part is not modified, and `T5` is only used as an encoder.
+Modify the files in `configs/sft.yaml` (full fine-tuning) as follows:
-```
- # checkpoint_activations: True ## Using gradient checkpointing (Both checkpoint_activations in the config file need to be set to True)
+```yaml
+ # checkpoint_activations: True ## using gradient checkpointing (both `checkpoint_activations` in the config file need to be set to True)
model_parallel_size: 1 # Model parallel size
- experiment_name: lora-disney # Experiment name (do not modify)
- mode: finetune # Mode (do not modify)
- load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer model path
- no_load_rng: True # Whether to load random seed
+ experiment_name: lora-disney # Experiment name (do not change)
+ mode: finetune # Mode (do not change)
+ load: "{your_CogVideoX-2b-sat_path}/transformer" ## Path to Transformer model
+ no_load_rng: True # Whether to load random number seed
train_iters: 1000 # Training iterations
eval_iters: 1 # Evaluation iterations
eval_interval: 100 # Evaluation interval
eval_batch_size: 1 # Evaluation batch size
- save: ckpts # Model save path
- save_interval: 100 # Model save interval
+ save: ckpts # Model save path
+ save_interval: 100 # Save interval
log_interval: 20 # Log output interval
train_data: [ "your train data path" ]
- valid_data: [ "your val data path" ] # Training and validation datasets can be the same
- split: 1,0,0 # Training, validation, and test set ratio
- num_workers: 8 # Number of worker threads for data loader
- force_train: True # Allow missing keys when loading checkpoint (T5 and VAE are loaded separately)
- only_log_video_latents: True # Avoid memory overhead caused by VAE decode
+ valid_data: [ "your val data path" ] # Training and validation sets can be the same
+ split: 1,0,0 # Proportion for training, validation, and test sets
+ num_workers: 8 # Number of data loader workers
+ force_train: True # Allow missing keys when loading checkpoint (T5 and VAE loaded separately)
+ only_log_video_latents: True # Avoid memory usage from VAE decoding
deepspeed:
bf16:
- enabled: False # For CogVideoX-2B set to False and for CogVideoX-5B set to True
+ enabled: False # For CogVideoX-2B Turn to False and For CogVideoX-5B Turn to True
fp16:
- enabled: True # For CogVideoX-2B set to True and for CogVideoX-5B set to False
+ enabled: True # For CogVideoX-2B Turn to True and For CogVideoX-5B Turn to False
```
-If you wish to use Lora fine-tuning, you also need to modify the `cogvideox__lora` file:
+``` To use Lora fine-tuning, you also need to modify `cogvideox__lora` file:
-Here, take `CogVideoX-2B` as a reference:
+Here's an example using `CogVideoX-2B`:
```
model:
- scale_factor: 1.15258426
+ scale_factor: 1.55258426
disable_first_stage_autocast: true
- not_trainable_prefixes: [ 'all' ] ## Uncomment
+ not_trainable_prefixes: [ 'all' ] ## Uncomment to unlock
log_keys:
- - txt'
+ - txt
- lora_config: ## Uncomment
+ lora_config: ## Uncomment to unlock
target: sat.model.finetune.lora2.LoraMixin
params:
r: 256
```
-### Modifying Run Scripts
+### Modify the Run Script
-Edit `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` to select the configuration file. Below are two examples:
+Edit `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` and select the config file. Below are two examples:
-1. If you want to use the `CogVideoX-2B` model and the `Lora` method, you need to modify `finetune_single_gpu.sh`
- or `finetune_multi_gpus.sh`:
+1. If you want to use the `CogVideoX-2B` model with `Lora`, modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` as follows:
```
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
```
-2. If you want to use the `CogVideoX-2B` model and the `full-parameter fine-tuning` method, you need to
- modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh`:
+2. If you want to use the `CogVideoX-2B` model with full fine-tuning, modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` as follows:
```
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM"
```
-### Fine-Tuning and Evaluation
+### Fine-tuning and Validation
Run the inference code to start fine-tuning.
@@ -385,45 +392,42 @@ bash finetune_single_gpu.sh # Single GPU
bash finetune_multi_gpus.sh # Multi GPUs
```
-### Using the Fine-Tuned Model
+### Using the Fine-tuned Model
-The fine-tuned model cannot be merged; here is how to modify the inference configuration file `inference.sh`:
+The fine-tuned model cannot be merged. Hereâs how to modify the inference configuration file `inference.sh`
```
-run_cmd="$environs python sample_video.py --base configs/cogvideox__lora.yaml configs/inference.yaml --seed 42"
+run_cmd="$environs python sample_video.py --base configs/cogvideox__lora.yaml configs/inference.yaml --seed 42"
```
-Then, execute the code:
+Then, run the code:
```
bash inference.sh
```
-### Converting to Huggingface Diffusers Supported Weights
+### Converting to Huggingface Diffusers-compatible Weights
-The SAT weight format is different from Huggingface's weight format and needs to be converted. Please run:
+The SAT weight format is different from Huggingfaceâs format and requires conversion. Run
-```shell
+```
python ../tools/convert_weight_sat2hf.py
```
-### Exporting Huggingface Diffusers lora LoRA Weights from SAT Checkpoints
+### Exporting Lora Weights from SAT to Huggingface Diffusers
-After completing the training using the above steps, we get a SAT checkpoint with LoRA weights. You can find the file
-at `{args.save}/1000/1000/mp_rank_00_model_states.pt`.
+Support is provided for exporting Lora weights from SAT to Huggingface Diffusers format.
+ After training with the above steps, youâll find the SAT model with Lora weights in {args.save}/1000/1000/mp_rank_00_model_states.pt
-The script for exporting LoRA weights can be found in the CogVideoX repository at `tools/export_sat_lora_weight.py`.
-After exporting, you can use `load_cogvideox_lora.py` for inference.
+The export script `export_sat_lora_weight.py` is located in the CogVideoX repository under `tools/`. After exporting, use `load_cogvideox_lora.py` for inference.
Export command:
-```bash
-python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory {args.save}/export_hf_lora_weights_1/
+```
+python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory {args.save}/export_hf_lora_weights_1/
```
-This training mainly modified the following model structures. The table below lists the corresponding structure mappings
-for converting to the HF (Hugging Face) format LoRA structure. As you can see, LoRA adds a low-rank weight to the
-model's attention structure.
+The following model structures were modified during training. Here is the mapping between SAT and HF Lora structures. Lora adds a low-rank weight to the attention structure of the model.
```
'attention.query_key_value.matrix_A.0': 'attn1.to_q.lora_A.weight',
@@ -436,5 +440,5 @@ model's attention structure.
'attention.dense.matrix_B.0': 'attn1.to_out.0.lora_B.weight'
```
-Using export_sat_lora_weight.py, you can convert the SAT checkpoint into the HF LoRA format.
-
+Using `export_sat_lora_weight.py` will convert these to the HF format Lora structure.
+
\ No newline at end of file
diff --git a/sat/README_ja.md b/sat/README_ja.md
index ee1abcd..3685ba3 100644
--- a/sat/README_ja.md
+++ b/sat/README_ja.md
@@ -1,27 +1,37 @@
-# SAT CogVideoX-2B
+# SAT CogVideoX
-[Read this in English.](./README_zh)
+[Read this in English.](./README.md)
[äžæé
读](./README_zh.md)
-ãã®ãã©ã«ãã«ã¯ã[SAT](https://github.com/THUDM/SwissArmyTransformer) ãŠã§ã€ãã䜿çšããæšè«ã³ãŒããšãSAT
-ãŠã§ã€ãã®ãã¡ã€ã³ãã¥ãŒãã³ã°ã³ãŒããå«ãŸããŠããŸãã
-
-ãã®ã³ãŒãã¯ãããŒã ãã¢ãã«ããã¬ãŒãã³ã°ããããã«äœ¿çšãããã¬ãŒã ã¯ãŒã¯ã§ããã³ã¡ã³ããå°ãªãã泚ææ·±ãç 究ããå¿
èŠããããŸãã
+ãã®ãã©ã«ãã«ã¯ã[SAT](https://github.com/THUDM/SwissArmyTransformer)ã®éã¿ã䜿çšããæšè«ã³ãŒããšãSATéã¿ã®ãã¡ã€ã³ãã¥ãŒãã³ã°ã³ãŒããå«ãŸããŠããŸãã
+ãã®ã³ãŒãã¯ãããŒã ãã¢ãã«ãèšç·Žããéã«äœ¿çšãããã¬ãŒã ã¯ãŒã¯ã§ããã³ã¡ã³ããå°ãªãããã泚ææ·±ã確èªããå¿
èŠããããŸãã
## æšè«ã¢ãã«
-### 1. ãã®ãã©ã«ãã«å¿
èŠãªäŸåé¢ä¿ãæ£ããã€ã³ã¹ããŒã«ãããŠããããšã確èªããŠãã ããã
+### 1. ãã®ãã©ã«ãå
ã®å¿
èŠãªäŸåé¢ä¿ããã¹ãŠã€ã³ã¹ããŒã«ãããŠããããšã確èªããŠãã ãã
-```shell
+```
pip install -r requirements.txt
```
-### 2. ã¢ãã«ãŠã§ã€ããããŠã³ããŒãããŸã
+### 2. ã¢ãã«ã®éã¿ãããŠã³ããŒã
+ ãŸããSATãã©ãŒããã¢ãã«ã®éã¿ãããŠã³ããŒãããŠãã ããã
-ãŸããSAT ãã©ãŒã«ç§»åããŠã¢ãã«ã®éã¿ãããŠã³ããŒãããŸãã CogVideoX-2B ã¢ãã«ã®å Žåã¯ã次ã®ããã«ããŠã³ããŒãããŠãã ããã
+#### CogVideoX1.5 ã¢ãã«
-```shell
+```
+git lfs install
+git clone https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT
+```
+
+ããã«ãããTransformersãVAEãT5 Encoderã®3ã€ã®ã¢ãã«ãããŠã³ããŒããããŸãã
+
+#### CogVideoX ã¢ãã«
+
+CogVideoX-2B ã¢ãã«ã«ã€ããŠã¯ã以äžã®ããã«ããŠã³ããŒãããŠãã ããïŒ
+
+```
mkdir CogVideoX-2b-sat
cd CogVideoX-2b-sat
wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
@@ -32,12 +42,12 @@ mv 'index.html?dl=1' transformer.zip
unzip transformer.zip
```
-CogVideoX-5B ã¢ãã«ã® `transformers` ãã¡ã€ã«ã以äžã®ãªã³ã¯ããããŠã³ããŒãããŠãã ãã ïŒVAE ãã¡ã€ã«ã¯ 2B ãšåãã§ãïŒïŒ
+CogVideoX-5B ã¢ãã«ã® `transformers` ãã¡ã€ã«ãããŠã³ããŒãããŠãã ããïŒVAEãã¡ã€ã«ã¯2Bãšåãã§ãïŒïŒ
+ [CogVideoX-5B](https://cloud.tsinghua.edu.cn/d/fcef5b3904294a6885e5/?p=%2F&mode=list)
+ [CogVideoX-5B-I2V](https://cloud.tsinghua.edu.cn/d/5cc62a2d6e7d45c0a2f6/?p=%2F1&mode=list)
-次ã«ãã¢ãã«ãã¡ã€ã«ã以äžã®åœ¢åŒã«ãã©ãŒãããããå¿
èŠããããŸãïŒ
+ã¢ãã«ãã¡ã€ã«ã以äžã®ããã«é
眮ããŠãã ããïŒ
```
.
@@ -49,24 +59,24 @@ CogVideoX-5B ã¢ãã«ã® `transformers` ãã¡ã€ã«ã以äžã®ãªã³ã¯ãã
âââ 3d-vae.pt
```
-ã¢ãã«ã®éã¿ãã¡ã€ã«ã倧ããããã`git lfs`ã䜿çšããããšããå§ãããããŸãã`git lfs`
-ã®ã€ã³ã¹ããŒã«ã«ã€ããŠã¯ã[ãã¡ã](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing)ããåç
§ãã ããã
+ã¢ãã«ã®éã¿ãã¡ã€ã«ã倧ããããã`git lfs`ã®äœ¿çšããå§ãããŸãã
+`git lfs`ã®ã€ã³ã¹ããŒã«æ¹æ³ã¯[ãã¡ã](https://github.com/git-lfs/git-lfs?tab=readme-ov-file#installing)ãåç
§ããŠãã ããã
-```shell
+```
git lfs install
```
-次ã«ãT5 ã¢ãã«ãã¯ããŒã³ããŸããããã¯ãã¬ãŒãã³ã°ããã¡ã€ã³ãã¥ãŒãã³ã°ã«ã¯äœ¿çšãããŸãããã䜿çšããå¿
èŠããããŸãã
-> ã¢ãã«ãè€è£œããéã«ã¯ã[Modelscope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b)ã®ã¢ãã«ãã¡ã€ã«ã®å Žæãã䜿çšããã ããŸãã
+次ã«ãT5ã¢ãã«ãã¯ããŒã³ããŸãããã®ã¢ãã«ã¯EncoderãšããŠã®ã¿äœ¿çšãããèšç·Žããã¡ã€ã³ãã¥ãŒãã³ã°ã¯å¿
èŠãããŸããã
+> [Modelscope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b)äžã®ã¢ãã«ãã¡ã€ã«ã䜿çšå¯èœã§ãã
-```shell
-git clone https://huggingface.co/THUDM/CogVideoX-2b.git #ãã®ã³ã°ãã§ã€ã¹(huggingface.org)ããã¢ãã«ãããŠã³ããŒãããã ããŸã
-# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git #Modelscopeããã¢ãã«ãããŠã³ããŒãããã ããŸã
+```
+git clone https://huggingface.co/THUDM/CogVideoX-2b.git # Huggingfaceããã¢ãã«ãããŠã³ããŒã
+# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git # ModelscopeããããŠã³ããŒã
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
```
-äžèšã®æ¹æ³ã«åŸãããšã§ãsafetensor 圢åŒã® T5 ãã¡ã€ã«ãååŸã§ããŸããããã«ãããDeepspeed ã§ã®ãã¡ã€ã³ãã¥ãŒãã³ã°äžã«ãšã©ãŒãçºçããªãããã«ããŸãã
+ããã«ãããDeepspeedãã¡ã€ã³ãã¥ãŒãã³ã°äžã«ãšã©ãŒãªãããŒãã§ããsafetensor圢åŒã®T5ãã¡ã€ã«ãäœæãããŸãã
```
âââ added_tokens.json
@@ -81,11 +91,11 @@ mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
0 directories, 8 files
```
-### 3. `configs/cogvideox_2b.yaml` ãã¡ã€ã«ãå€æŽããŸãã
+### 3. `configs/cogvideox_*.yaml`ãã¡ã€ã«ãç·šé
```yaml
model:
- scale_factor: 1.15258426
+ scale_factor: 1.55258426
disable_first_stage_autocast: true
log_keys:
- txt
@@ -123,7 +133,7 @@ model:
num_attention_heads: 30
transformer_args:
- checkpoint_activations: True ## ã°ã©ããŒã·ã§ã³ ãã§ãã¯ãã€ã³ãã䜿çšãã
+ checkpoint_activations: True ## using gradient checkpointing
vocab_size: 1
max_sequence_length: 64
layernorm_order: pre
@@ -161,14 +171,14 @@ model:
ucg_rate: 0.1
target: sgm.modules.encoders.modules.FrozenT5Embedder
params:
- model_dir: "t5-v1_1-xxl" # CogVideoX-2b/t5-v1_1-xxlãã©ã«ãã®çµ¶å¯Ÿãã¹
+ model_dir: "t5-v1_1-xxl" # CogVideoX-2b/t5-v1_1-xxl éã¿ãã©ã«ãã®çµ¶å¯Ÿãã¹
max_length: 226
first_stage_config:
target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
params:
cp_size: 1
- ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # CogVideoX-2b-sat/vae/3d-vae.ptãã©ã«ãã®çµ¶å¯Ÿãã¹
+ ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt" # CogVideoX-2b-sat/vae/3d-vae.ptãã¡ã€ã«ã®çµ¶å¯Ÿãã¹
ignore_keys: [ 'loss' ]
loss_config:
@@ -240,7 +250,7 @@ model:
num_steps: 50
```
-### 4. `configs/inference.yaml` ãã¡ã€ã«ãå€æŽããŸãã
+### 4. `configs/inference.yaml`ãã¡ã€ã«ãç·šé
```yaml
args:
@@ -250,38 +260,36 @@ args:
# load: "{your lora folder} such as zRzRzRzRzRzRzR/lora-disney-08-20-13-28" # This is for Full model without lora adapter
batch_size: 1
- input_type: txt #TXTã®ããã¹ããã¡ã€ã«ãå
¥åãšããŠéžæãããããCLIã³ãã³ãã©ã€ã³ãå
¥åãšããŠå€æŽããããããã ããŸã
- input_file: configs/test.txt #ããã¹ããã¡ã€ã«ã®ãã¹ã§ãããã«å¯ŸããŠç·šéããããŠããã ããŸã
- sampling_num_frames: 13 # Must be 13, 11 or 9
+ input_type: txt # "txt"ã§ãã¬ãŒã³ããã¹ãå
¥åã"cli"ã§ã³ãã³ãã©ã€ã³å
¥åãéžæå¯èœ
+ input_file: configs/test.txt # ãã¬ãŒã³ããã¹ããã¡ã€ã«ãç·šéå¯èœ
+ sampling_num_frames: 13 # CogVideoX1.5-5Bã§ã¯42ãŸãã¯22ãCogVideoX-5B / 2Bã§ã¯13, 11, ãŸãã¯9
sampling_fps: 8
- fp16: True # For CogVideoX-2B
- # bf16: True # For CogVideoX-5B
+ fp16: True # CogVideoX-2Bçš
+ # bf16: True # CogVideoX-5Bçš
output_dir: outputs/
force_inference: True
```
-+ è€æ°ã®ããã³ãããä¿åããããã« txt ã䜿çšããå Žåã¯ã`configs/test.txt`
- ãåç
§ããŠå€æŽããŠãã ããã1è¡ã«1ã€ã®ããã³ãããèšè¿°ããŸããããã³ããã®æžãæ¹ãããããªãå Žåã¯ãæåã« [ãã®ã³ãŒã](../inference/convert_demo.py)
- ã䜿çšã㊠LLM ã«ãããªãã¡ã€ã³ã¡ã³ããåŒã³åºãããšãã§ããŸãã
-+ ã³ãã³ãã©ã€ã³ãå
¥åãšããŠäœ¿çšããå Žåã¯ã次ã®ããã«å€æŽããŸãã
++ è€æ°ã®ããã³ãããå«ãããã¹ããã¡ã€ã«ã䜿çšããå Žåã`configs/test.txt`ãé©å®ç·šéããŠãã ããã1è¡ã«ã€ã1ããã³ããã§ããããã³ããã®æžãæ¹ãåãããªãå Žåã¯ã[ãã¡ãã®ã³ãŒã](../inference/convert_demo.py)ã䜿çšããŠLLMã§è£æ£ã§ããŸãã
++ ã³ãã³ãã©ã€ã³å
¥åã䜿çšããå Žåã以äžã®ããã«å€æŽããŸãïŒ
-```yaml
+```
input_type: cli
```
ããã«ãããã³ãã³ãã©ã€ã³ããããã³ãããå
¥åã§ããŸãã
-åºåãããªã®ãã£ã¬ã¯ããªãå€æŽãããå Žåã¯ã次ã®ããã«å€æŽã§ããŸãïŒ
+åºåãããªã®ä¿åå Žæãå€æŽããå Žåã¯ã以äžãç·šéããŠãã ããïŒ
-```yaml
+```
output_dir: outputs/
```
-ããã©ã«ãã§ã¯ `.outputs/` ãã©ã«ãã«ä¿åãããŸãã
+ããã©ã«ãã§ã¯`.outputs/`ãã©ã«ãã«ä¿åãããŸãã
-### 5. æšè«ã³ãŒããå®è¡ããŠæšè«ãéå§ããŸãã
+### 5. æšè«ã³ãŒããå®è¡ããŠæšè«ãéå§
-```shell
+```
bash inference.sh
```
@@ -289,7 +297,7 @@ bash inference.sh
### ããŒã¿ã»ããã®æºå
-ããŒã¿ã»ããã®åœ¢åŒã¯æ¬¡ã®ããã«ãªããŸãïŒ
+ããŒã¿ã»ããã¯ä»¥äžã®æ§é ã§ããå¿
èŠããããŸãïŒ
```
.
@@ -303,123 +311,215 @@ bash inference.sh
âââ ...
```
-å txt ãã¡ã€ã«ã¯å¯Ÿå¿ãããããªãã¡ã€ã«ãšåãååã§ããããã®ãããªã®ã©ãã«ãå«ãã§ããŸããåãããªã¯ã©ãã«ãšäžå¯Ÿäžã§å¯Ÿå¿ããå¿
èŠããããŸããéåžžã1ã€ã®ãããªã«è€æ°ã®ã©ãã«ãæãããããšã¯ãããŸããã
+åtxtãã¡ã€ã«ã¯å¯Ÿå¿ãããããªãã¡ã€ã«ãšåãååã§ããããªã®ã©ãã«ãå«ãã§ããŸãããããªãšã©ãã«ã¯äžå¯Ÿäžã§å¯Ÿå¿ãããå¿
èŠããããŸããéåžžã1ã€ã®ãããªã«è€æ°ã®ã©ãã«ã䜿çšããããšã¯é¿ããŠãã ããã
-ã¹ã¿ã€ã«ãã¡ã€ã³ãã¥ãŒãã³ã°ã®å Žåãå°ãªããšã50æ¬ã®ã¹ã¿ã€ã«ã䌌ããããªãšã©ãã«ãæºåãããã£ããã£ã³ã°ã容æã«ããŸãã
+ã¹ã¿ã€ã«ã®ãã¡ã€ã³ãã¥ãŒãã³ã°ã®å Žåãã¹ã¿ã€ã«ã䌌ããããªãšã©ãã«ãå°ãªããšã50æ¬æºåãããã£ããã£ã³ã°ãä¿é²ããŸãã
-### èšå®ãã¡ã€ã«ã®å€æŽ
+### èšå®ãã¡ã€ã«ã®ç·šé
-`Lora` ãšãã«ãã©ã¡ãŒã¿åŸ®èª¿æŽã®2ã€ã®æ¹æ³ããµããŒãããŠããŸããäž¡æ¹ã®åŸ®èª¿æŽæ¹æ³ã¯ã`transformer` éšåã®ã¿ã埮調æŽãã`VAE`
-éšåã«ã¯å€æŽãå ããªãããšã«æ³šæããŠãã ããã`T5` ã¯ãšã³ã³ãŒããŒãšããŠã®ã¿äœ¿çšãããŸãã以äžã®ããã« `configs/sft.yaml` (
-ãã«ãã©ã¡ãŒã¿åŸ®èª¿æŽçš) ãã¡ã€ã«ãå€æŽããŠãã ããã
+``` `Lora`ãšå
šãã©ã¡ãŒã¿ã®ãã¡ã€ã³ãã¥ãŒãã³ã°ã®2çš®é¡ããµããŒãããŠããŸããã©ã¡ãã`transformer`éšåã®ã¿ããã¡ã€ã³ãã¥ãŒãã³ã°ãã`VAE`éšåã¯å€æŽãããã`T5`ã¯ãšã³ã³ãŒããŒãšããŠã®ã¿äœ¿çšãããŸãã
+``` 以äžã®ããã«ããŠ`configs/sft.yaml`ïŒå
šéãã¡ã€ã³ãã¥ãŒãã³ã°ïŒãã¡ã€ã«ãç·šéããŠãã ããïŒ
```
- # checkpoint_activations: True ## åŸé
ãã§ãã¯ãã€ã³ãã䜿çšããå Žå (èšå®ãã¡ã€ã«å
ã®2ã€ã® checkpoint_activations ã True ã«èšå®ããå¿
èŠããããŸã)
+ # checkpoint_activations: True ## using gradient checkpointing (configãã¡ã€ã«å
ã®2ã€ã®`checkpoint_activations`ãäž¡æ¹Trueã«èšå®)
model_parallel_size: 1 # ã¢ãã«äžŠåãµã€ãº
- experiment_name: lora-disney # å®éšå (å€æŽããªãã§ãã ãã)
- mode: finetune # ã¢ãŒã (å€æŽããªãã§ãã ãã)
- load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer ã¢ãã«ã®ãã¹
- no_load_rng: True # ä¹±æ°ã·ãŒããèªã¿èŸŒããã©ãã
+ experiment_name: lora-disney # å®éšåïŒå€æŽäžèŠïŒ
+ mode: finetune # ã¢ãŒãïŒå€æŽäžèŠïŒ
+ load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformerã¢ãã«ã®ãã¹
+ no_load_rng: True # ä¹±æ°ã·ãŒããããŒããããã©ãã
train_iters: 1000 # ãã¬ãŒãã³ã°ã€ãã¬ãŒã·ã§ã³æ°
- eval_iters: 1 # è©äŸ¡ã€ãã¬ãŒã·ã§ã³æ°
- eval_interval: 100 # è©äŸ¡éé
- eval_batch_size: 1 # è©äŸ¡ããããµã€ãº
- save: ckpts # ã¢ãã«ä¿åãã¹
- save_interval: 100 # ã¢ãã«ä¿åéé
+ eval_iters: 1 # æ€èšŒã€ãã¬ãŒã·ã§ã³æ°
+ eval_interval: 100 # æ€èšŒéé
+ eval_batch_size: 1 # æ€èšŒããããµã€ãº
+ save: ckpts # ã¢ãã«ä¿åãã¹
+ save_interval: 100 # ä¿åéé
log_interval: 20 # ãã°åºåéé
train_data: [ "your train data path" ]
- valid_data: [ "your val data path" ] # ãã¬ãŒãã³ã°ããŒã¿ãšè©äŸ¡ããŒã¿ã¯åãã§ãæ§ããŸãã
- split: 1,0,0 # ãã¬ãŒãã³ã°ã»ãããè©äŸ¡ã»ããããã¹ãã»ããã®å²å
- num_workers: 8 # ããŒã¿ããŒããŒã®ã¯ãŒã«ãŒã¹ã¬ããæ°
- force_train: True # ãã§ãã¯ãã€ã³ããããŒããããšãã«æ¬ èœããããŒãèš±å¯ (T5 ãš VAE ã¯å¥ã
ã«ããŒããããŸã)
- only_log_video_latents: True # VAE ã®ãã³ãŒãã«ããã¡ã¢ãªãªãŒããŒããããåé¿
+ valid_data: [ "your val data path" ] # ãã¬ãŒãã³ã°ã»ãããšæ€èšŒã»ããã¯åãã§ãæ§ããŸãã
+ split: 1,0,0 # ãã¬ãŒãã³ã°ã»ãããæ€èšŒã»ããããã¹ãã»ããã®å²å
+ num_workers: 8 # ããŒã¿ããŒããŒã®ã¯ãŒã«ãŒæ°
+ force_train: True # ãã§ãã¯ãã€ã³ããããŒãããéã«`missing keys`ãèš±å¯ïŒT5ãšVAEã¯å¥éããŒãïŒ
+ only_log_video_latents: True # VAEã®ãã³ãŒãã«ããã¡ã¢ãªäœ¿çšéãæãã
deepspeed:
bf16:
- enabled: False # CogVideoX-2B ã®å Žå㯠False ã«èšå®ããCogVideoX-5B ã®å Žå㯠True ã«èšå®
+ enabled: False # CogVideoX-2B çšã¯ FalseãCogVideoX-5B çšã¯ True ã«èšå®
fp16:
- enabled: True # CogVideoX-2B ã®å Žå㯠True ã«èšå®ããCogVideoX-5B ã®å Žå㯠False ã«èšå®
+ enabled: True # CogVideoX-2B çšã¯ TrueãCogVideoX-5B çšã¯ False ã«èšå®
+```
+```yaml
+args:
+ latent_channels: 16
+ mode: inference
+ load: "{absolute_path/to/your}/transformer" # Absolute path to CogVideoX-2b-sat/transformer folder
+ # load: "{your lora folder} such as zRzRzRzRzRzRzR/lora-disney-08-20-13-28" # This is for Full model without lora adapter
+
+ batch_size: 1
+ input_type: txt # You can choose "txt" for plain text input or change to "cli" for command-line input
+ input_file: configs/test.txt # Plain text file, can be edited
+ sampling_num_frames: 13 # For CogVideoX1.5-5B it must be 42 or 22. For CogVideoX-5B / 2B, it must be 13, 11, or 9.
+ sampling_fps: 8
+ fp16: True # For CogVideoX-2B
+ # bf16: True # For CogVideoX-5B
+ output_dir: outputs/
+ force_inference: True
```
-Lora 埮調æŽã䜿çšãããå Žåã¯ã`cogvideox__lora` ãã¡ã€ã«ãå€æŽããå¿
èŠããããŸãã
-
-ããã§ã¯ã`CogVideoX-2B` ãåèã«ããŸãã
++ If using a text file to save multiple prompts, modify `configs/test.txt` as needed. One prompt per line. If you are unsure how to write prompts, use [this code](../inference/convert_demo.py) to call an LLM for refinement.
++ To use command-line input, modify:
```
+input_type: cli
+```
+
+This allows you to enter prompts from the command line.
+
+To modify the output video location, change:
+
+```
+output_dir: outputs/
+```
+
+The default location is the `.outputs/` folder.
+
+### 5. Run the Inference Code to Perform Inference
+
+```
+bash inference.sh
+```
+
+## Fine-tuning the Model
+
+### Preparing the Dataset
+
+The dataset should be structured as follows:
+
+```
+.
+âââ labels
+â âââ 1.txt
+â âââ 2.txt
+â âââ ...
+âââ videos
+ âââ 1.mp4
+ âââ 2.mp4
+ âââ ...
+```
+
+Each txt file should have the same name as the corresponding video file and contain the label for that video. The videos and labels should correspond one-to-one. Generally, avoid using one video with multiple labels.
+
+For style fine-tuning, prepare at least 50 videos and labels with a similar style to facilitate fitting.
+
+### Modifying the Configuration File
+
+We support two fine-tuning methods: `Lora` and full-parameter fine-tuning. Note that both methods only fine-tune the `transformer` part. The `VAE` part is not modified, and `T5` is only used as an encoder.
+Modify the files in `configs/sft.yaml` (full fine-tuning) as follows:
+
+```yaml
+ # checkpoint_activations: True ## using gradient checkpointing (both `checkpoint_activations` in the config file need to be set to True)
+ model_parallel_size: 1 # Model parallel size
+ experiment_name: lora-disney # Experiment name (do not change)
+ mode: finetune # Mode (do not change)
+ load: "{your_CogVideoX-2b-sat_path}/transformer" ## Path to Transformer model
+ no_load_rng: True # Whether to load random number seed
+ train_iters: 1000 # Training iterations
+ eval_iters: 1 # Evaluation iterations
+ eval_interval: 100 # Evaluation interval
+ eval_batch_size: 1 # Evaluation batch size
+ save: ckpts # Model save path
+ save_interval: 100 # Save interval
+ log_interval: 20 # Log output interval
+ train_data: [ "your train data path" ]
+ valid_data: [ "your val data path" ] # Training and validation sets can be the same
+ split: 1,0,0 # Proportion for training, validation, and test sets
+ num_workers: 8 # Number of data loader workers
+ force_train: True # Allow missing keys when loading checkpoint (T5 and VAE loaded separately)
+ only_log_video_latents: True # Avoid memory usage from VAE decoding
+ deepspeed:
+ bf16:
+ enabled: False # For CogVideoX-2B Turn to False and For CogVideoX-5B Turn to True
+ fp16:
+ enabled: True # For CogVideoX-2B Turn to True and For CogVideoX-5B Turn to False
+```
+
+``` To use Lora fine-tuning, you also need to modify `cogvideox__lora` file:
+
+Here's an example using `CogVideoX-2B`:
+
+```yaml
model:
- scale_factor: 1.15258426
+ scale_factor: 1.55258426
disable_first_stage_autocast: true
- not_trainable_prefixes: [ 'all' ] ## ã³ã¡ã³ãã解é€
+ not_trainable_prefixes: [ 'all' ] ## Uncomment to unlock
log_keys:
- - txt'
+ - txt
- lora_config: ## ã³ã¡ã³ãã解é€
+ lora_config: ## Uncomment to unlock
target: sat.model.finetune.lora2.LoraMixin
params:
r: 256
```
-### å®è¡ã¹ã¯ãªããã®å€æŽ
+### Modify the Run Script
-èšå®ãã¡ã€ã«ãéžæããããã« `finetune_single_gpu.sh` ãŸã㯠`finetune_multi_gpus.sh` ãç·šéããŸãã以äžã«2ã€ã®äŸã瀺ããŸãã
+Edit `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` and select the config file. Below are two examples:
-1. `CogVideoX-2B` ã¢ãã«ã䜿çšãã`Lora` ææ³ãå©çšããå Žåã¯ã`finetune_single_gpu.sh` ãŸã㯠`finetune_multi_gpus.sh`
- ãå€æŽããå¿
èŠããããŸãã
+1. If you want to use the `CogVideoX-2B` model with `Lora`, modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` as follows:
```
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
```
-2. `CogVideoX-2B` ã¢ãã«ã䜿çšãã`ãã«ãã©ã¡ãŒã¿åŸ®èª¿æŽ` ææ³ãå©çšããå Žåã¯ã`finetune_single_gpu.sh`
- ãŸã㯠`finetune_multi_gpus.sh` ãå€æŽããå¿
èŠããããŸãã
+2. If you want to use the `CogVideoX-2B` model with full fine-tuning, modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` as follows:
```
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM"
```
-### 埮調æŽãšè©äŸ¡
+### Fine-tuning and Validation
-æšè«ã³ãŒããå®è¡ããŠåŸ®èª¿æŽãéå§ããŸãã
+Run the inference code to start fine-tuning.
```
-bash finetune_single_gpu.sh # ã·ã³ã°ã«GPU
-bash finetune_multi_gpus.sh # ãã«ãGPU
+bash finetune_single_gpu.sh # Single GPU
+bash finetune_multi_gpus.sh # Multi GPUs
```
-### 埮調æŽåŸã®ã¢ãã«ã®äœ¿çš
+### Using the Fine-tuned Model
-埮調æŽãããã¢ãã«ã¯çµ±åã§ããŸãããããã§ã¯ãæšè«èšå®ãã¡ã€ã« `inference.sh` ãå€æŽããæ¹æ³ã瀺ããŸãã
+The fine-tuned model cannot be merged. Hereâs how to modify the inference configuration file `inference.sh`
```
-run_cmd="$environs python sample_video.py --base configs/cogvideox__lora.yaml configs/inference.yaml --seed 42"
+run_cmd="$environs python sample_video.py --base configs/cogvideox__lora.yaml configs/inference.yaml --seed 42"
```
-ãã®åŸã次ã®ã³ãŒããå®è¡ããŸãã
+Then, run the code:
```
bash inference.sh
```
-### Huggingface Diffusers ãµããŒãã®ãŠã§ã€ãã«å€æ
+### Converting to Huggingface Diffusers-compatible Weights
-SAT ãŠã§ã€ã圢åŒã¯ Huggingface ã®ãŠã§ã€ã圢åŒãšç°ãªããå€æãå¿
èŠã§ãã次ã®ã³ãã³ããå®è¡ããŠãã ããïŒ
+The SAT weight format is different from Huggingfaceâs format and requires conversion. Run
-```shell
+```
python ../tools/convert_weight_sat2hf.py
```
-### SATãã§ãã¯ãã€ã³ãããHuggingface Diffusers lora LoRAãŠã§ã€ãããšã¯ã¹ããŒã
+### Exporting Lora Weights from SAT to Huggingface Diffusers
-äžèšã®ã¹ããããå®äºãããšãLoRAãŠã§ã€ãä»ãã®SATãã§ãã¯ãã€ã³ããåŸãããŸãããã¡ã€ã«ã¯ `{args.save}/1000/1000/mp_rank_00_model_states.pt` ã«ãããŸãã
+Support is provided for exporting Lora weights from SAT to Huggingface Diffusers format.
+After training with the above steps, youâll find the SAT model with Lora weights in {args.save}/1000/1000/mp_rank_00_model_states.pt
-LoRAãŠã§ã€ãããšã¯ã¹ããŒãããããã®ã¹ã¯ãªããã¯ãCogVideoXãªããžããªã® `tools/export_sat_lora_weight.py` ã«ãããŸãããšã¯ã¹ããŒãåŸã`load_cogvideox_lora.py` ã䜿çšããŠæšè«ãè¡ãããšãã§ããŸãã
+The export script `export_sat_lora_weight.py` is located in the CogVideoX repository under `tools/`. After exporting, use `load_cogvideox_lora.py` for inference.
-ãšã¯ã¹ããŒãã³ãã³ã:
+Export command:
-```bash
-python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory {args.save}/export_hf_lora_weights_1/
+```
+python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_name}-09-09-21-10/1000/mp_rank_00_model_states.pt --lora_save_directory {args.save}/export_hf_lora_weights_1/
```
-ãã®ãã¬ãŒãã³ã°ã§ã¯äž»ã«ä»¥äžã®ã¢ãã«æ§é ãå€æŽãããŸããã以äžã®è¡šã¯ãHF (Hugging Face) 圢åŒã®LoRAæ§é ã«å€æããéã®å¯Ÿå¿é¢ä¿ã瀺ããŠããŸããã芧ã®éããLoRAã¯ã¢ãã«ã®æ³šæã¡ã«ããºã ã«äœã©ã³ã¯ã®éã¿ãè¿œå ããŠããŸãã
+The following model structures were modified during training. Here is the mapping between SAT and HF Lora structures. Lora adds a low-rank weight to the attention structure of the model.
```
'attention.query_key_value.matrix_A.0': 'attn1.to_q.lora_A.weight',
@@ -431,8 +531,6 @@ python tools/export_sat_lora_weight.py --sat_pt_path {args.save}/{experiment_nam
'attention.dense.matrix_A.0': 'attn1.to_out.0.lora_A.weight',
'attention.dense.matrix_B.0': 'attn1.to_out.0.lora_B.weight'
```
-
-export_sat_lora_weight.py ã䜿çšããŠãSATãã§ãã¯ãã€ã³ããHF LoRA圢åŒã«å€æã§ããŸãã
-
-
+Using `export_sat_lora_weight.py` will convert these to the HF format Lora structure.
+
\ No newline at end of file
diff --git a/sat/README_zh.md b/sat/README_zh.md
index c605da8..c25c6b7 100644
--- a/sat/README_zh.md
+++ b/sat/README_zh.md
@@ -1,6 +1,6 @@
-# SAT CogVideoX-2B
+# SAT CogVideoX
-[Read this in English.](./README_zh)
+[Read this in English.](./README.md)
[æ¥æ¬èªã§èªã](./README_ja.md)
@@ -20,6 +20,15 @@ pip install -r requirements.txt
éŠå
ïŒååŸ SAT éåäžèœœæš¡åæéã
+#### CogVideoX1.5 æš¡å
+
+```shell
+git lfs install
+git clone https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT
+```
+æ€æäœäŒäžèœœ Transformers, VAE, T5 Encoder è¿äžäžªæš¡åã
+
+#### CogVideoX æš¡å
å¯¹äº CogVideoX-2B æš¡åïŒè¯·æç
§åŠäžæ¹åŒäžèœœ:
```shell
@@ -82,11 +91,11 @@ mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
0 directories, 8 files
```
-### 3. ä¿®æ¹`configs/cogvideox_2b.yaml`äžçæ件ã
+### 3. ä¿®æ¹`configs/cogvideox_*.yaml`äžçæ件ã
```yaml
model:
- scale_factor: 1.15258426
+ scale_factor: 1.55258426
disable_first_stage_autocast: true
log_keys:
- txt
@@ -253,7 +262,7 @@ args:
batch_size: 1
input_type: txt #å¯ä»¥éæ©txt纯æåæ¡£äœäžºèŸå
¥ïŒæè
æ¹æcliåœä»€è¡äœäžºèŸå
¥
input_file: configs/test.txt #纯æåæ¡£ïŒå¯ä»¥å¯¹æ€åçŒèŸ
- sampling_num_frames: 13 # Must be 13, 11 or 9
+ sampling_num_frames: 13 #CogVideoX1.5-5B å¿
é¡»æ¯ 42 æ 22ã CogVideoX-5B / 2B å¿
é¡»æ¯ 13 11 æ 9ã
sampling_fps: 8
fp16: True # For CogVideoX-2B
# bf16: True # For CogVideoX-5B
@@ -346,7 +355,7 @@ Encoder 䜿çšã
```yaml
model:
- scale_factor: 1.15258426
+ scale_factor: 1.55258426
disable_first_stage_autocast: true
not_trainable_prefixes: [ 'all' ] ## 解é€æ³šé
log_keys:
diff --git a/sat/configs/cogvideox1.5_5b.yaml b/sat/configs/cogvideox1.5_5b.yaml
new file mode 100644
index 0000000..0000ec2
--- /dev/null
+++ b/sat/configs/cogvideox1.5_5b.yaml
@@ -0,0 +1,149 @@
+model:
+ scale_factor: 0.7
+ disable_first_stage_autocast: true
+ latent_input: true
+ log_keys:
+ - txt
+
+ denoiser_config:
+ target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
+ params:
+ num_idx: 1000
+ quantize_c_noise: False
+
+ weighting_config:
+ target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
+ scaling_config:
+ target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
+ discretization_config:
+ target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
+
+ network_config:
+ target: dit_video_concat.DiffusionTransformer
+ params:
+ time_embed_dim: 512
+ elementwise_affine: True
+ num_frames: 81
+ time_compressed_rate: 4
+ latent_width: 300
+ latent_height: 300
+ num_layers: 42
+ patch_size: [2, 2, 2]
+ in_channels: 16
+ out_channels: 16
+ hidden_size: 3072
+ adm_in_channels: 256
+ num_attention_heads: 48
+
+ transformer_args:
+ checkpoint_activations: True
+ vocab_size: 1
+ max_sequence_length: 64
+ layernorm_order: pre
+ skip_init: false
+ model_parallel_size: 1
+ is_decoder: false
+
+ modules:
+ pos_embed_config:
+ target: dit_video_concat.Rotary3DPositionEmbeddingMixin
+ params:
+ hidden_size_head: 64
+ text_length: 224
+
+ patch_embed_config:
+ target: dit_video_concat.ImagePatchEmbeddingMixin
+ params:
+ text_hidden_size: 4096
+
+ adaln_layer_config:
+ target: dit_video_concat.AdaLNMixin
+ params:
+ qk_ln: True
+
+ final_layer_config:
+ target: dit_video_concat.FinalLayerMixin
+
+ conditioner_config:
+ target: sgm.modules.GeneralConditioner
+ params:
+ emb_models:
+ - is_trainable: false
+ input_key: txt
+ ucg_rate: 0.1
+ target: sgm.modules.encoders.modules.FrozenT5Embedder
+ params:
+ model_dir: "google/t5-v1_1-xxl"
+ max_length: 224
+
+
+ first_stage_config:
+ target : vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
+ params:
+ cp_size: 1
+ ckpt_path: "cogvideox-5b-sat/vae/3d-vae.pt"
+ ignore_keys: ['loss']
+
+ loss_config:
+ target: torch.nn.Identity
+
+ regularizer_config:
+ target: vae_modules.regularizers.DiagonalGaussianRegularizer
+
+ encoder_config:
+ target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
+ params:
+ double_z: true
+ z_channels: 16
+ resolution: 256
+ in_channels: 3
+ out_ch: 3
+ ch: 128
+ ch_mult: [1, 2, 2, 4]
+ attn_resolutions: []
+ num_res_blocks: 3
+ dropout: 0.0
+ gather_norm: True
+
+ decoder_config:
+ target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
+ params:
+ double_z: True
+ z_channels: 16
+ resolution: 256
+ in_channels: 3
+ out_ch: 3
+ ch: 128
+ ch_mult: [1, 2, 2, 4]
+ attn_resolutions: []
+ num_res_blocks: 3
+ dropout: 0.0
+ gather_norm: True
+
+ loss_fn_config:
+ target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
+ params:
+ offset_noise_level: 0
+ sigma_sampler_config:
+ target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
+ params:
+ uniform_sampling: True
+ group_num: 40
+ num_idx: 1000
+ discretization_config:
+ target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
+
+ sampler_config:
+ target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
+ params:
+ num_steps: 50
+ verbose: True
+
+ discretization_config:
+ target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
+ guider_config:
+ target: sgm.modules.diffusionmodules.guiders.DynamicCFG
+ params:
+ scale: 6
+ exp: 5
+ num_steps: 50
diff --git a/sat/configs/cogvideox1.5_5b_i2v.yaml b/sat/configs/cogvideox1.5_5b_i2v.yaml
new file mode 100644
index 0000000..c65f0b7
--- /dev/null
+++ b/sat/configs/cogvideox1.5_5b_i2v.yaml
@@ -0,0 +1,160 @@
+model:
+ scale_factor: 0.7
+ disable_first_stage_autocast: true
+ latent_input: false
+ noised_image_input: true
+ noised_image_all_concat: false
+ noised_image_dropout: 0.05
+ augmentation_dropout: 0.15
+ log_keys:
+ - txt
+
+ denoiser_config:
+ target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
+ params:
+ num_idx: 1000
+ quantize_c_noise: False
+
+ weighting_config:
+ target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting
+ scaling_config:
+ target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling
+ discretization_config:
+ target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
+
+ network_config:
+ target: dit_video_concat.DiffusionTransformer
+ params:
+# space_interpolation: 1.875
+ ofs_embed_dim: 512
+ time_embed_dim: 512
+ elementwise_affine: True
+ num_frames: 81
+ time_compressed_rate: 4
+ latent_width: 300
+ latent_height: 300
+ num_layers: 42
+ patch_size: [2, 2, 2]
+ in_channels: 32
+ out_channels: 16
+ hidden_size: 3072
+ adm_in_channels: 256
+ num_attention_heads: 48
+
+ transformer_args:
+ checkpoint_activations: True
+ vocab_size: 1
+ max_sequence_length: 64
+ layernorm_order: pre
+ skip_init: false
+ model_parallel_size: 1
+ is_decoder: false
+
+ modules:
+ pos_embed_config:
+ target: dit_video_concat.Rotary3DPositionEmbeddingMixin
+ params:
+ hidden_size_head: 64
+ text_length: 224
+
+ patch_embed_config:
+ target: dit_video_concat.ImagePatchEmbeddingMixin
+ params:
+ text_hidden_size: 4096
+
+
+ adaln_layer_config:
+ target: dit_video_concat.AdaLNMixin
+ params:
+ qk_ln: True
+
+ final_layer_config:
+ target: dit_video_concat.FinalLayerMixin
+
+ conditioner_config:
+ target: sgm.modules.GeneralConditioner
+ params:
+ emb_models:
+
+ - is_trainable: false
+ input_key: txt
+ ucg_rate: 0.1
+ target: sgm.modules.encoders.modules.FrozenT5Embedder
+ params:
+ model_dir: "google/t5-v1_1-xxl"
+ max_length: 224
+
+
+ first_stage_config:
+ target : vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
+ params:
+ cp_size: 1
+ ckpt_path: "cogvideox-5b-i2v-sat/vae/3d-vae.pt"
+ ignore_keys: ['loss']
+
+ loss_config:
+ target: torch.nn.Identity
+
+ regularizer_config:
+ target: vae_modules.regularizers.DiagonalGaussianRegularizer
+
+ encoder_config:
+ target: vae_modules.cp_enc_dec.ContextParallelEncoder3D
+ params:
+ double_z: true
+ z_channels: 16
+ resolution: 256
+ in_channels: 3
+ out_ch: 3
+ ch: 128
+ ch_mult: [1, 2, 2, 4]
+ attn_resolutions: []
+ num_res_blocks: 3
+ dropout: 0.0
+ gather_norm: True
+
+ decoder_config:
+ target: vae_modules.cp_enc_dec.ContextParallelDecoder3D
+ params:
+ double_z: True
+ z_channels: 16
+ resolution: 256
+ in_channels: 3
+ out_ch: 3
+ ch: 128
+ ch_mult: [1, 2, 2, 4]
+ attn_resolutions: []
+ num_res_blocks: 3
+ dropout: 0.0
+ gather_norm: True
+
+ loss_fn_config:
+ target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss
+ params:
+ fixed_frames: 0
+ offset_noise_level: 0.0
+ sigma_sampler_config:
+ target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling
+ params:
+ uniform_sampling: True
+ group_num: 40
+ num_idx: 1000
+ discretization_config:
+ target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
+
+ sampler_config:
+ target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler
+ params:
+ fixed_frames: 0
+ num_steps: 50
+ verbose: True
+
+ discretization_config:
+ target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization
+
+ guider_config:
+ target: sgm.modules.diffusionmodules.guiders.DynamicCFG
+ params:
+ scale: 6
+ exp: 5
+ num_steps: 50
\ No newline at end of file
diff --git a/sat/configs/test.txt b/sat/configs/test.txt
index 8d035c0..94ad730 100644
--- a/sat/configs/test.txt
+++ b/sat/configs/test.txt
@@ -1,4 +1,4 @@
In the haunting backdrop of a warIn the haunting backdrop of a war-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.
The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from its tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds.
A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.
-A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.
\ No newline at end of file
+A street artist, clad in a worn-out denim jacket and a colorful banana, stands before a vast concrete wall in the heart, holding a can of spray paint, spray-painting a colorful bird on a mottled wall.-torn city, where ruins and crumbled walls tell a story of devastation, a poignant close-up frames a young girl. Her face is smudged with ash, a silent testament to the chaos around her. Her eyes glistening with a mix of sorrow and resilience, capturing the raw emotion of a world that has lost its innocence to the ravages of conflict.
\ No newline at end of file