diff --git a/README.md b/README.md
index 63ac86a..441f01c 100644
--- a/README.md
+++ b/README.md
@@ -21,9 +21,11 @@
## Update and News
-- ð¥ **News**: ```2024/8/7```: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be performed
+- ð¥ **News**: ```2024/8/7```: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be
+ performed
on a single 3090 GPU. For more details, please refer to the [code](inference/cli_demo.py).
-- ð¥ **News**: ```2024/8/6```: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can reconstruct
+- ð¥ **News**: ```2024/8/6```: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can
+ reconstruct
the video almost losslessly.
- ð¥ **News**: ```2024/8/6```: We have open-sourced **CogVideoX-2B**ïŒthe first model in the CogVideoX series of video
generation models.
@@ -55,9 +57,9 @@ Jump to a specific section:
### Prompt Optimization
-Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use the GLM-4 model to
-optimize the prompt. This is crucial because the model is trained with long prompts, and a good prompt directly affects
-the quality of the generated video.
+Before running the model, please refer to [this guide](inference/convert_demo.py) to see how we use large models like
+GLM-4 (or other comparable products, such as GPT-4) to optimize the model. This is crucial because the model is trained
+with long prompts, and a good prompt directly impacts the quality of the video generation.
### SAT
@@ -124,6 +126,15 @@ along with related basic information:
| Download Link (HF diffusers Model) | ð€ [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [ð€ ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) [ð« WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) |
| Download Link (SAT Model) | [SAT](./sat/README.md) |
+## Friendly Links
+
+We highly welcome contributions from the community and actively contribute to the open-source community. The following
+works have already been adapted for CogVideoX, and we invite everyone to use them:
+
++ [Xorbits Inference](https://github.com/xorbitsai/inference): A powerful and comprehensive distributed inference
+ framework, allowing you to easily deploy your own models or the latest cutting-edge open-source models with just one
+ click.
+
## Project Structure
This open-source repository will guide developers to quickly get started with the basic usage and fine-tuning examples
diff --git a/README_ja.md b/README_ja.md
index f697e48..de7dabf 100644
--- a/README_ja.md
+++ b/README_ja.md
@@ -21,10 +21,13 @@
## æŽæ°ãšãã¥ãŒã¹
-- ð¥ **ãã¥ãŒã¹**: ```2024/8/7```: CogVideoX 㯠`diffusers` ããŒãžã§ã³ 0.30.0 ã«çµ±åãããŸãããåäžã® 3090 GPU ã§æšè«ãå®è¡ã§ããŸãã詳现ã«ã€ããŠã¯ [ã³ãŒã](inference/cli_demo.py) ãåç
§ããŠãã ããã
+- ð¥ **ãã¥ãŒã¹**: ```2024/8/7```: CogVideoX 㯠`diffusers` ããŒãžã§ã³ 0.30.0 ã«çµ±åãããŸãããåäžã® 3090 GPU
+ ã§æšè«ãå®è¡ã§ããŸãã詳现ã«ã€ããŠã¯ [ã³ãŒã](inference/cli_demo.py) ãåç
§ããŠãã ããã
- ð¥ **ãã¥ãŒã¹**: ```2024/8/6```: **CogVideoX-2B** ã§äœ¿çšããã **3D Causal VAE** ããªãŒãã³ãœãŒã¹åããŸãããããã«ããããããªãã»ãŒç¡æ倱ã§åæ§ç¯ã§ããŸãã
- ð¥ **ãã¥ãŒã¹**: ```2024/8/6```: **CogVideoX-2B**ãCogVideoXã·ãªãŒãºã®ãããªçæã¢ãã«ã®æåã®ã¢ãã«ããªãŒãã³ãœãŒã¹åããŸããã
-- ð± **ãœãŒã¹**: ```2022/5/19```: **CogVideo** (çŸåš `CogVideo` ãã©ã³ãã§ç¢ºèªã§ããŸã) ããªãŒãã³ãœãŒã¹åããŸãããããã¯ãæåã®ãªãŒãã³ãœãŒã¹ã®äºååŠç¿æžã¿ããã¹ããããããªçæã¢ãã«ã§ãããæè¡çãªè©³çŽ°ã«ã€ããŠã¯ [ICLR'23 CogVideo è«æ](https://arxiv.org/abs/2205.15868) ãã芧ãã ããã
+- ð± **ãœãŒã¹**: ```2022/5/19```: **CogVideo** (çŸåš `CogVideo` ãã©ã³ãã§ç¢ºèªã§ããŸã)
+ ããªãŒãã³ãœãŒã¹åããŸãããããã¯ãæåã®ãªãŒãã³ãœãŒã¹ã®äºååŠç¿æžã¿ããã¹ããããããªçæã¢ãã«ã§ãããæè¡çãªè©³çŽ°ã«ã€ããŠã¯ [ICLR'23 CogVideo è«æ](https://arxiv.org/abs/2205.15868)
+ ãã芧ãã ããã
**ãã匷åãªã¢ãã«ãããã倧ããªãã©ã¡ãŒã¿ãµã€ãºã§ç»å Žäºå®ã§ããã楜ãã¿ã«ïŒ**
@@ -50,11 +53,13 @@
### ããã³ããã®æé©å
-ã¢ãã«ãå®è¡ããåã«ã[ãã®ã¬ã€ã](inference/convert_demo.py) ãåç
§ããŠãGLM-4 ã¢ãã«ã䜿çšããŠããã³ãããæé©åããæ¹æ³ã確èªããŠãã ãããããã¯éèŠã§ããã¢ãã«ã¯é·ãããã³ããã§ãã¬ãŒãã³ã°ãããŠãããããè¯ãããã³ããã¯çæããããããªã®å質ã«çŽæ¥åœ±é¿ããŸãã
+ã¢ãã«ãå®è¡ããåã«ã[ãã¡ã](inference/convert_demo.py)
+ãåèã«ããŠãGLM-4ïŒãŸãã¯åçã®è£œåãäŸãã°GPT-4ïŒã®å€§èŠæš¡ã¢ãã«ã䜿çšããŠã©ã®ããã«ã¢ãã«ãæé©åããããã確èªãã ãããããã¯éåžžã«éèŠã§ããã¢ãã«ã¯é·ãããã³ããã§ãã¬ãŒãã³ã°ãããŠãããããè¯ãããã³ããããããªçæã®å質ã«çŽæ¥åœ±é¿ãäžããŸãã
### SAT
-[sat_demo](sat/README.md) ã®æ瀺ã«åŸã£ãŠãã ãã: SATãŠã§ã€ãã®æšè«ã³ãŒããšåŸ®èª¿æŽã³ãŒããå«ãŸããŠããŸããCogVideoXã¢ãã«æ§é ã«åºã¥ããŠæ¹åããããšããå§ãããŸããé©æ°çãªç 究è
ã¯ããã®ã³ãŒãã䜿çšããŠè¿
éãªã¹ã¿ããã³ã°ãšéçºãè¡ãããšãã§ããŸãã
+[sat_demo](sat/README.md) ã®æ瀺ã«åŸã£ãŠãã ãã:
+SATãŠã§ã€ãã®æšè«ã³ãŒããšåŸ®èª¿æŽã³ãŒããå«ãŸããŠããŸããCogVideoXã¢ãã«æ§é ã«åºã¥ããŠæ¹åããããšããå§ãããŸããé©æ°çãªç 究è
ã¯ããã®ã³ãŒãã䜿çšããŠè¿
éãªã¹ã¿ããã³ã°ãšéçºãè¡ãããšãã§ããŸãã
(æšè«ã«ã¯18GBãlora埮調æŽã«ã¯40GBãå¿
èŠã§ã)
### Diffusers
@@ -94,19 +99,26 @@ CogVideoXã¯ã[æž
圱](https://chatglm.cn/video?fr=osm_cogvideox) ãšåæºã®
以äžã®è¡šã¯ãçŸåšæäŸããŠãããããªçæã¢ãã«ã®ãªã¹ããšé¢é£ããåºæ¬æ
å ±ã瀺ããŠããŸã:
-| ã¢ãã«å | CogVideoX-2B |
-|-------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| ããã³ããèšèª | è±èª |
-| åäžGPUæšè« (FP16) | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)
23.9GB using diffusers |
+| ã¢ãã«å | CogVideoX-2B |
+|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| ããã³ããèšèª | è±èª |
+| åäžGPUæšè« (FP16) | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)
23.9GB using diffusers |
| è€æ°GPUæšè« (FP16) | 20GB minimum per GPU using diffusers |
-| 埮調æŽã«å¿
èŠãªGPUã¡ã¢ãª(bs=1) | 40GB |
-| ããã³ããã®æå€§é· | 226 ããŒã¯ã³ |
-| ãããªã®é·ã | 6ç§ |
-| ãã¬ãŒã ã¬ãŒã | 8ãã¬ãŒã |
-| 解å床 | 720 * 480 |
-| éååæšè« | ãµããŒããããŠããŸãã |
-| ããŠã³ããŒããªã³ã¯ (HF diffusers ã¢ãã«) | ð€ [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [ð€ ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) [ð« WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) |
-| ããŠã³ããŒããªã³ã¯ (SAT ã¢ãã«) | [SAT](./sat/README.md) |
+| 埮調æŽã«å¿
èŠãªGPUã¡ã¢ãª(bs=1) | 40GB |
+| ããã³ããã®æå€§é· | 226 ããŒã¯ã³ |
+| ãããªã®é·ã | 6ç§ |
+| ãã¬ãŒã ã¬ãŒã | 8ãã¬ãŒã |
+| 解å床 | 720 * 480 |
+| éååæšè« | ãµããŒããããŠããŸãã |
+| ããŠã³ããŒããªã³ã¯ (HF diffusers ã¢ãã«) | ð€ [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [ð€ ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) [ð« WiseModel](https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b) |
+| ããŠã³ããŒããªã³ã¯ (SAT ã¢ãã«) | [SAT](./sat/README.md) |
+
+## å奜çãªã³ã¯
+
+ã³ãã¥ããã£ããã®è²¢ç®ã倧æè¿ããç§ãã¡ããªãŒãã³ãœãŒã¹ã³ãã¥ããã£ã«ç©æ¥µçã«è²¢ç®ããŠããŸãã以äžã®äœåã¯ãã§ã«CogVideoXã«å¯Ÿå¿ããŠããããã²ãå©çšãã ããïŒ
+
++ [Xorbits Inference](https://github.com/xorbitsai/inference):
+ 匷åã§å
æ¬çãªåæ£æšè«ãã¬ãŒã ã¯ãŒã¯ã§ãããã¯ã³ã¯ãªãã¯ã§ç¬èªã®ã¢ãã«ãææ°ã®ãªãŒãã³ãœãŒã¹ã¢ãã«ãç°¡åã«ãããã€ã§ããŸãã
## ãããžã§ã¯ãæ§é
@@ -116,14 +128,17 @@ CogVideoXã¯ã[æž
圱](https://chatglm.cn/video?fr=osm_cogvideox) ãšåæºã®
+ [diffusers_demo](inference/cli_demo.py): æšè«ã³ãŒãã®è©³çŽ°ãªèª¬æãå«ãŸããŠãããäžè¬çãªãã©ã¡ãŒã¿ã®æå³ã«ã€ããŠãèšåããŠããŸãã
+ [diffusers_vae_demo](inference/cli_vae_demo.py): VAEæšè«ã³ãŒãã®å®è¡ã«ã¯çŸåš71GBã®ã¡ã¢ãªãå¿
èŠã§ãããå°æ¥çã«ã¯æé©åãããäºå®ã§ãã
-+ [convert_demo](inference/convert_demo.py): ãŠãŒã¶ãŒå
¥åãCogVideoXã«é©ãã圢åŒã«å€æããæ¹æ³ãCogVideoXã¯é·ããã£ãã·ã§ã³ã§ãã¬ãŒãã³ã°ãããŠãããããå
¥åããã¹ããLLMã䜿çšããŠãã¬ãŒãã³ã°ååžãšäžèŽãããå¿
èŠããããŸããããã©ã«ãã§ã¯GLM4ã䜿çšããŸãããGPTãGeminiãªã©ã®ä»ã®LLMã«çœ®ãæããããšãã§ããŸãã
-+ [gradio_web_demo](inference/gradio_web_demo.py): CogVideoX-2Bã¢ãã«ã䜿çšããŠãããªãçæããæ¹æ³ã瀺ãã·ã³ãã«ãªgradio Web UIã
++ [convert_demo](inference/convert_demo.py):
+ ãŠãŒã¶ãŒå
¥åãCogVideoXã«é©ãã圢åŒã«å€æããæ¹æ³ãCogVideoXã¯é·ããã£ãã·ã§ã³ã§ãã¬ãŒãã³ã°ãããŠãããããå
¥åããã¹ããLLMã䜿çšããŠãã¬ãŒãã³ã°ååžãšäžèŽãããå¿
èŠããããŸããããã©ã«ãã§ã¯GLM4ã䜿çšããŸãããGPTãGeminiãªã©ã®ä»ã®LLMã«çœ®ãæããããšãã§ããŸãã
++ [gradio_web_demo](inference/gradio_web_demo.py): CogVideoX-2Bã¢ãã«ã䜿çšããŠãããªãçæããæ¹æ³ã瀺ãã·ã³ãã«ãªgradio
+ Web UIã
-+ [streamlit_web_demo](inference/streamlit_web_demo.py): CogVideoX-2Bã¢ãã«ã䜿çšããŠãããªãçæããæ¹æ³ã瀺ãã·ã³ãã«ãªstreamlit Webã¢ããªã±ãŒã·ã§ã³ã
++ [streamlit_web_demo](inference/streamlit_web_demo.py): CogVideoX-2Bã¢ãã«ã䜿çšããŠãããªãçæããæ¹æ³ã瀺ãã·ã³ãã«ãªstreamlit
+ Webã¢ããªã±ãŒã·ã§ã³ã

@@ -131,13 +146,14 @@ CogVideoXã¯ã[æž
圱](https://chatglm.cn/video?fr=osm_cogvideox) ãšåæºã®
### sat
-+ [sat_demo](sat/README.md): SATãŠã§ã€ãã®æšè«ã³ãŒããšåŸ®èª¿æŽã³ãŒããå«ãŸããŠããŸããCogVideoXã¢ãã«æ§é ã«åºã¥ããŠæ¹åããããšããå§ãããŸããé©æ°çãªç 究è
ã¯ããã®ã³ãŒãã䜿çšããŠè¿
éãªã¹ã¿ããã³ã°ãšéçºãè¡ãããšãã§ããŸãã
++ [sat_demo](sat/README.md):
+ SATãŠã§ã€ãã®æšè«ã³ãŒããšåŸ®èª¿æŽã³ãŒããå«ãŸããŠããŸããCogVideoXã¢ãã«æ§é ã«åºã¥ããŠæ¹åããããšããå§ãããŸããé©æ°çãªç 究è
ã¯ããã®ã³ãŒãã䜿çšããŠè¿
éãªã¹ã¿ããã³ã°ãšéçºãè¡ãããšãã§ããŸãã
### ããŒã«
ãã®ãã©ã«ãã«ã¯ãã¢ãã«å€æ/ãã£ãã·ã§ã³çæãªã©ã®ããŒã«ãå«ãŸããŠããŸãã
-+ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): SATã¢ãã«ã®ãŠã§ã€ããHuggingfaceã¢ãã«ã®ãŠã§ã€ãã«å€æããŸãã
++ [convert_weight_sat2hf](tools/convert_weight_sat2hf.py): SATã¢ãã«ã®ãŠã§ã€ããHuggingfaceã¢ãã«ã®ãŠã§ã€ãã«å€æããŸãã
+ [caption_demo](tools/caption): ãã£ãã·ã§ã³ããŒã«ããããªãç解ããããã¹ãã§åºåããã¢ãã«ã
## ãããžã§ã¯ãèšç»
@@ -161,7 +177,9 @@ CogVideoXã¯ã[æž
圱](https://chatglm.cn/video?fr=osm_cogvideox) ãšåæºã®
ã¢ãã«ã®ãŠã§ã€ããšå®è£
ã³ãŒã㯠[CogVideoX LICENSE](MODEL_LICENSE) ã®äžã§å
¬éãããŠããŸãã
## CogVideo(ICLR'23)
-è«æã®å
¬åŒãªããžããª: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868) 㯠[CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo) ã«ãããŸãã
+
+è«æã®å
¬åŒãªããžããª: [CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers](https://arxiv.org/abs/2205.15868)
+㯠[CogVideo branch](https://github.com/THUDM/CogVideo/tree/CogVideo) ã«ãããŸãã
**CogVideoã¯æ¯èŒçé«ãã¬ãŒã ã¬ãŒãã®ãããªãçæããããšãã§ããŸãã**
32ãã¬ãŒã ã®4ç§éã®ã¯ãªããã以äžã«ç€ºãããŠããŸãã
@@ -174,8 +192,8 @@ CogVideoXã¯ã[æž
圱](https://chatglm.cn/video?fr=osm_cogvideox) ãšåæºã®
-CogVideoã®ãã¢ã¯ [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/) ã§äœéšã§ããŸãã*å
ã®å
¥åã¯äžåœèªã§ãã*
-
+CogVideoã®ãã¢ã¯ [https://models.aminer.cn/cogvideo](https://models.aminer.cn/cogvideo/) ã§äœéšã§ããŸãã
+*å
ã®å
¥åã¯äžåœèªã§ãã*
## åŒçš
diff --git a/README_zh.md b/README_zh.md
index 410a443..33025d1 100644
--- a/README_zh.md
+++ b/README_zh.md
@@ -22,7 +22,8 @@
## 项ç®æŽæ°
-- ð¥ **News**: ```2024/8/7```: CogVideoX å·²ç»å并å
¥ `diffusers` 0.30.0çæ¬ïŒååŒ 3090å¯ä»¥æšçïŒè¯Šæ
请è§[代ç ](inference/cli_demo.py)ã
+- ð¥ **News**: ```2024/8/7```: CogVideoX å·²ç»å并å
¥ `diffusers`
+ 0.30.0çæ¬ïŒååŒ 3090å¯ä»¥æšçïŒè¯Šæ
请è§[代ç ](inference/cli_demo.py)ã
- ð¥ **News**: ```2024/8/6```: æ们åŒæº **3D Causal VAE**ïŒçšäº **CogVideoX-2B**ïŒå¯ä»¥å ä¹æ æå°éæè§é¢ã
- ð¥ **News**: ```2024/8/6```: æ们åŒæº CogVideoX ç³»åè§é¢çææš¡åç第äžäžªæš¡å, **CogVideoX-2B**ã
- ð± **Source**: ```2022/5/19```: æ们åŒæºäº CogVideo è§é¢çææš¡åïŒç°åšäœ å¯ä»¥åš `CogVideo` åæ¯äžçå°ïŒïŒè¿æ¯éŠäžªåŒæºçåºäº
@@ -51,8 +52,8 @@
### æ瀺è¯äŒå
-åšåŒå§è¿è¡æš¡åä¹åïŒè¯·åè[è¿é](inference/convert_demo.py) æ¥çæ们æ¯æä¹äœ¿çšGLM-4倧暡å对暡åè¿è¡äŒåçïŒè¿åŸéèŠïŒ
-ç±äºæš¡åæ¯åšé¿æ瀺è¯äžè®ç»çïŒäžé¢å¥œççŽæ¥åœ±åäºè§é¢çæç莚éã
+åšåŒå§è¿è¡æš¡åä¹åïŒè¯·åè[è¿é](inference/convert_demo.py) æ¥çæ们æ¯æä¹äœ¿çšGLM-4(æè
å级å«çå
¶ä»äº§åïŒäŸåŠGPT-4)倧暡å对暡åè¿è¡äŒåçïŒè¿åŸéèŠïŒ
+ç±äºæš¡åæ¯åšé¿æ瀺è¯äžè®ç»çïŒäžäžªå¥œçæ瀺è¯çŽæ¥åœ±åäºè§é¢çæç莚éã
### SAT
@@ -96,19 +97,25 @@ CogVideoXæ¯ [æž
圱](https://chatglm.cn/video?fr=osm_cogvideox) åæºçåŒæº
äžè¡šå±ç€ºç®åæ们æäŸçè§é¢çææš¡ååè¡šïŒä»¥åçžå
³åºç¡ä¿¡æ¯:
-| æš¡åå | CogVideoX-2B |
-|---------------------|-------------------------------------------------------------------------------------------------------------------------------|
-| æ瀺è¯è¯èš | English |
-| åGPUæšç (FP-16) æŸåæ¶è | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)
23.9GB using diffusers |
-| å€GPUæšç (FP-16) æŸåæ¶è | 20GB minimum per GPU using diffusers |
-| 埮è°æŸåæ¶è (bs=1) | 42GB |
-| æ瀺è¯é¿åºŠäžé | 226 Tokens |
-| è§é¢é¿åºŠ | 6 seconds |
-| 垧çïŒæ¯ç§ïŒ | 8 frames |
-| è§é¢å蟚ç | 720 * 480 |
-| éåæšç | äžæ¯æ |
+| æš¡åå | CogVideoX-2B |
+|---------------------|---------------------------------------------------------------------------------------------------------------------------------|
+| æ瀺è¯è¯èš | English |
+| åGPUæšç (FP-16) æŸåæ¶è | 18GB using [SAT](https://github.com/THUDM/SwissArmyTransformer)
23.9GB using diffusers |
+| å€GPUæšç (FP-16) æŸåæ¶è | 20GB minimum per GPU using diffusers |
+| 埮è°æŸåæ¶è (bs=1) | 42GB |
+| æ瀺è¯é¿åºŠäžé | 226 Tokens |
+| è§é¢é¿åºŠ | 6 seconds |
+| 垧çïŒæ¯ç§ïŒ | 8 frames |
+| è§é¢å蟚ç | 720 * 480 |
+| éåæšç | äžæ¯æ |
| äžèœœå°å (Diffusers æš¡å) | ð€ [Huggingface](https://huggingface.co/THUDM/CogVideoX-2B) [ð€ ModelScope](https://modelscope.cn/models/ZhipuAI/CogVideoX-2b) |
-| äžèœœå°å (SAT æš¡å) | [SAT](./sat/README_zh.md) |
+| äžèœœå°å (SAT æš¡å) | [SAT](./sat/README_zh.md) |
+
+## åæ
éŸæ¥
+
+æ们é垞欢è¿æ¥èªç€Ÿåºç莡ç®ïŒå¹¶ç§¯æç莡ç®åŒæºç€Ÿåºã以äžäœåå·²ç»å¯¹CogVideoXè¿è¡äºéé
ïŒæ¬¢è¿å€§å®¶äœ¿çš:
+
++ [Xorbits Inference](https://github.com/xorbitsai/inference): æ§èœåŒºå€§äžåèœå
šé¢çååžåŒæšçæ¡æ¶ïŒèœ»æŸäžé®éšçœ²äœ èªå·±çæš¡åæå
眮çå沿åŒæºæš¡åã
## å®æŽé¡¹ç®ä»£ç ç»æ
diff --git a/inference/cli_demo.py b/inference/cli_demo.py
index d069f02..8b0813e 100644
--- a/inference/cli_demo.py
+++ b/inference/cli_demo.py
@@ -22,7 +22,7 @@ from diffusers import CogVideoXPipeline
def export_to_video_imageio(
- video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
+ video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
) -> str:
"""
Export the video frames to a video file using imageio lib to Avoid "green screen" issue (for example CogVideoX)
@@ -38,14 +38,14 @@ def export_to_video_imageio(
def generate_video(
- prompt: str,
- model_path: str,
- output_path: str = "./output.mp4",
- num_inference_steps: int = 50,
- guidance_scale: float = 6.0,
- num_videos_per_prompt: int = 1,
- device: str = "cuda",
- dtype: torch.dtype = torch.float16,
+ prompt: str,
+ model_path: str,
+ output_path: str = "./output.mp4",
+ num_inference_steps: int = 50,
+ guidance_scale: float = 6.0,
+ num_videos_per_prompt: int = 1,
+ device: str = "cuda",
+ dtype: torch.dtype = torch.float16,
):
"""
Generates a video based on the given prompt and saves it to the specified path.
diff --git a/inference/cli_vae_demo.py b/inference/cli_vae_demo.py
index b133f20..18b9a95 100644
--- a/inference/cli_vae_demo.py
+++ b/inference/cli_vae_demo.py
@@ -1,14 +1,24 @@
"""
-This script demonstrates how to encode video frames using a pre-trained CogVideoX model with ð€ Huggingface Diffusers.
+This script is designed to demonstrate how to use the CogVideoX-2b VAE model for video encoding and decoding.
+It allows you to encode a video into a latent representation, decode it back into a video, or perform both operations sequentially.
+Before running the script, make sure to clone the CogVideoX Hugging Face model repository and set the `{your local diffusers path}` argument to the path of the cloned repository.
-Note:
- This script requires the `diffusers>=0.30.0` library to be installed.
- If the video appears âcompletely greenâ and cannot be viewed, please switch to a different player to watch it. This is a normal phenomenon.
- Cost 71GB of GPU memory for encoding a 6s video at 720p resolution.
+Command 1: Encoding Video
+Encodes the video located at ../resources/videos/1.mp4 using the CogVideoX-2b VAE model.
+Memory Usage: ~34GB of GPU memory for encoding.
+If you do not have enough GPU memory, we provide a pre-encoded tensor file (encoded.pt) in the resources folder and you can still run the decoding command.
+$ python cli_vae_demo.py --model_path {your local diffusers path}/CogVideoX-2b/vae/ --video_path ../resources/videos/1.mp4 --mode encode
-Run the script:
- $ python cli_demo.py --model_path THUDM/CogVideoX-2b --video_path path/to/video.mp4 --output_path path/to/output
+Command 2: Decoding Video
+Decodes the latent representation stored in encoded.pt back into a video.
+Memory Usage: ~19GB of GPU memory for decoding.
+$ python cli_vae_demo.py --model_path {your local diffusers path}/CogVideoX-2b/vae/ --encoded_path ./encoded.pt --mode decode
+
+Command 3: Encoding and Decoding Video
+Encodes the video located at ../resources/videos/1.mp4 and then immediately decodes it.
+Memory Usage: 34GB for encoding + 19GB for decoding (sequentially).
+$ python cli_vae_demo.py --model_path {your local diffusers path}/CogVideoX-2b/vae/ --video_path ../resources/videos/1.mp4 --mode both
"""
import argparse
@@ -19,7 +29,7 @@ from diffusers import AutoencoderKLCogVideoX
from torchvision import transforms
-def vae_demo(model_path, video_path, dtype, device):
+def encode_video(model_path, video_path, dtype, device):
"""
Loads a pre-trained AutoencoderKLCogVideoX model and encodes the video frames.
@@ -32,50 +42,58 @@ def vae_demo(model_path, video_path, dtype, device):
Returns:
- torch.Tensor: The encoded video frames.
"""
- # Load the pre-trained model
model = AutoencoderKLCogVideoX.from_pretrained(model_path, torch_dtype=dtype).to(device)
-
- # Load video frames
video_reader = imageio.get_reader(video_path, "ffmpeg")
- frames = []
- for frame in video_reader:
- frames.append(frame)
+
+ frames = [transforms.ToTensor()(frame) for frame in video_reader]
video_reader.close()
- # Transform frames to Tensor
- transform = transforms.Compose(
- [
- transforms.ToTensor(),
- ]
- )
- frames_tensor = torch.stack([transform(frame) for frame in frames]).to(device)
+ frames_tensor = torch.stack(frames).to(device).permute(1, 0, 2, 3).unsqueeze(0).to(dtype)
- # Add batch dimension and reshape to [1, 3, 49, 480, 720]
- frames_tensor = frames_tensor.permute(1, 0, 2, 3).unsqueeze(0).to(dtype).to(device)
-
- # Run the model with Encoder and Decoder
with torch.no_grad():
- output = model(frames_tensor)
+ encoded_frames = model.encode(frames_tensor)[0].sample()
+ return encoded_frames
- return output
+
+def decode_video(model_path, encoded_tensor_path, dtype, device):
+ """
+ Loads a pre-trained AutoencoderKLCogVideoX model and decodes the encoded video frames.
+
+ Parameters:
+ - model_path (str): The path to the pre-trained model.
+ - encoded_tensor_path (str): The path to the encoded tensor file.
+ - dtype (torch.dtype): The data type for computation.
+ - device (str): The device to use for computation (e.g., "cuda" or "cpu").
+
+ Returns:
+ - torch.Tensor: The decoded video frames.
+ """
+ model = AutoencoderKLCogVideoX.from_pretrained(model_path, torch_dtype=dtype).to(device)
+ encoded_frames = torch.load(encoded_tensor_path, weights_only=True).to(device).to(dtype)
+ with torch.no_grad():
+ decoded_frames = []
+ for i in range(6): # 6 seconds
+ start_frame, end_frame = (0, 3) if i == 0 else (2 * i + 1, 2 * i + 3)
+ current_frames = model.decode(encoded_frames[:, :, start_frame:end_frame]).sample
+ decoded_frames.append(current_frames)
+ model.clear_fake_context_parallel_cache()
+
+ decoded_frames = torch.cat(decoded_frames, dim=2)
+ return decoded_frames
def save_video(tensor, output_path):
"""
- Saves the encoded video frames to a video file.
+ Saves the video frames to a video file.
Parameters:
- - tensor (torch.Tensor): The encoded video frames.
+ - tensor (torch.Tensor): The video frames tensor.
- output_path (str): The path to save the output video.
"""
- # Remove batch dimension and permute back to [49, 480, 720, 3]
frames = tensor[0].squeeze(0).permute(1, 2, 3, 0).cpu().numpy()
+ frames = np.clip(frames, 0, 1) * 255
+ frames = frames.astype(np.uint8)
- # Clip values to [0, 1] and convert to uint8
- frames = np.clip(frames, 0, 1)
- frames = (frames * 255).astype(np.uint8)
-
- # Save frames to video
writer = imageio.get_writer(output_path + "/output.mp4", fps=30)
for frame in frames:
writer.append_data(frame)
@@ -83,10 +101,14 @@ def save_video(tensor, output_path):
if __name__ == "__main__":
- parser = argparse.ArgumentParser(description="Convert a CogVideoX model to Diffusers")
+ parser = argparse.ArgumentParser(description="CogVideoX encode/decode demo")
parser.add_argument("--model_path", type=str, required=True, help="The path to the CogVideoX model")
- parser.add_argument("--video_path", type=str, required=True, help="The path to the video file")
- parser.add_argument("--output_path", type=str, default="./", help="The path to save the output video")
+ parser.add_argument("--video_path", type=str, help="The path to the video file (for encoding)")
+ parser.add_argument("--encoded_path", type=str, help="The path to the encoded tensor file (for decoding)")
+ parser.add_argument("--output_path", type=str, default=".", help="The path to save the output file")
+ parser.add_argument(
+ "--mode", type=str, choices=["encode", "decode", "both"], required=True, help="Mode: encode, decode, or both"
+ )
parser.add_argument(
"--dtype", type=str, default="float16", help="The data type for computation (e.g., 'float16' or 'float32')"
)
@@ -95,9 +117,21 @@ if __name__ == "__main__":
)
args = parser.parse_args()
- # Set device and dtype
device = torch.device(args.device)
dtype = torch.float16 if args.dtype == "float16" else torch.float32
- output = vae_demo(args.model_path, args.video_path, dtype, device)
- save_video(output, args.output_path)
+ if args.mode == "encode":
+ assert args.video_path, "Video path must be provided for encoding."
+ encoded_output = encode_video(args.model_path, args.video_path, dtype, device)
+ torch.save(encoded_output, args.output_path + "/encoded.pt")
+ print(f"Finished encoding the video to a tensor, save it to a file at {encoded_output}/encoded.pt")
+ elif args.mode == "decode":
+ assert args.encoded_path, "Encoded tensor path must be provided for decoding."
+ decoded_output = decode_video(args.model_path, args.encoded_path, dtype, device)
+ save_video(decoded_output, args.output_path)
+ print(f"Finished decoding the video and saved it to a file at {args.output_path}/output.mp4")
+ elif args.mode == "both":
+ assert args.video_path, "Video path must be provided for encoding."
+ encoded_output = encode_video(args.model_path, args.video_path, dtype, device)
+ decoded_output = decode_video(args.model_path, args.output_path + "/encoded.pt", dtype, device)
+ save_video(decoded_output, args.output_path)
diff --git a/inference/encoded.pt b/inference/encoded.pt
new file mode 100644
index 0000000..4367fec
Binary files /dev/null and b/inference/encoded.pt differ
diff --git a/inference/gradio_web_demo.py b/inference/gradio_web_demo.py
index 4b4cad0..b81c0ba 100644
--- a/inference/gradio_web_demo.py
+++ b/inference/gradio_web_demo.py
@@ -34,7 +34,7 @@ Video descriptions must have the same num of words as examples below. Extra word
def export_to_video_imageio(
- video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
+ video_frames: Union[List[np.ndarray], List[PIL.Image.Image]], output_video_path: str = None, fps: int = 8
) -> str:
"""
Export the video frames to a video file using imageio lib to Avoid "green screen" issue (for example CogVideoX)
@@ -62,20 +62,34 @@ def convert_prompt(prompt: str, retry_times: int = 3) -> str:
response = client.chat.completions.create(
messages=[
{"role": "system", "content": sys_prompt},
- {"role": "user",
- "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : "a girl is on the beach"'},
- {"role": "assistant",
- "content": "A radiant woman stands on a deserted beach, arms outstretched, wearing a beige trench coat, white blouse, light blue jeans, and chic boots, against a backdrop of soft sky and sea. Moments later, she is seen mid-twirl, arms exuberant, with the lighting suggesting dawn or dusk. Then, she runs along the beach, her attire complemented by an off-white scarf and black ankle boots, the tranquil sea behind her. Finally, she holds a paper airplane, her pose reflecting joy and freedom, with the ocean's gentle waves and the sky's soft pastel hues enhancing the serene ambiance."},
- {"role": "user",
- "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : "A man jogging on a football field"'},
- {"role": "assistant",
- "content": "A determined man in athletic attire, including a blue long-sleeve shirt, black shorts, and blue socks, jogs around a snow-covered soccer field, showcasing his solitary exercise in a quiet, overcast setting. His long dreadlocks, focused expression, and the serene winter backdrop highlight his dedication to fitness. As he moves, his attire, consisting of a blue sports sweatshirt, black athletic pants, gloves, and sneakers, grips the snowy ground. He is seen running past a chain-link fence enclosing the playground area, with a basketball hoop and children's slide, suggesting a moment of solitary exercise amidst the empty field."},
- {"role": "user",
- "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : " A woman is dancing, HD footage, close-up"'},
- {"role": "assistant",
- "content": "A young woman with her hair in an updo and wearing a teal hoodie stands against a light backdrop, initially looking over her shoulder with a contemplative expression. She then confidently makes a subtle dance move, suggesting rhythm and movement. Next, she appears poised and focused, looking directly at the camera. Her expression shifts to one of introspection as she gazes downward slightly. Finally, she dances with confidence, her left hand over her heart, symbolizing a poignant moment, all while dressed in the same teal hoodie against a plain, light-colored background."},
- {"role": "user",
- "content": f'Create an imaginative video descriptive caption or modify an earlier caption in ENGLISH for the user input: "{text}"'},
+ {
+ "role": "user",
+ "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : "a girl is on the beach"',
+ },
+ {
+ "role": "assistant",
+ "content": "A radiant woman stands on a deserted beach, arms outstretched, wearing a beige trench coat, white blouse, light blue jeans, and chic boots, against a backdrop of soft sky and sea. Moments later, she is seen mid-twirl, arms exuberant, with the lighting suggesting dawn or dusk. Then, she runs along the beach, her attire complemented by an off-white scarf and black ankle boots, the tranquil sea behind her. Finally, she holds a paper airplane, her pose reflecting joy and freedom, with the ocean's gentle waves and the sky's soft pastel hues enhancing the serene ambiance.",
+ },
+ {
+ "role": "user",
+ "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : "A man jogging on a football field"',
+ },
+ {
+ "role": "assistant",
+ "content": "A determined man in athletic attire, including a blue long-sleeve shirt, black shorts, and blue socks, jogs around a snow-covered soccer field, showcasing his solitary exercise in a quiet, overcast setting. His long dreadlocks, focused expression, and the serene winter backdrop highlight his dedication to fitness. As he moves, his attire, consisting of a blue sports sweatshirt, black athletic pants, gloves, and sneakers, grips the snowy ground. He is seen running past a chain-link fence enclosing the playground area, with a basketball hoop and children's slide, suggesting a moment of solitary exercise amidst the empty field.",
+ },
+ {
+ "role": "user",
+ "content": 'Create an imaginative video descriptive caption or modify an earlier caption for the user input : " A woman is dancing, HD footage, close-up"',
+ },
+ {
+ "role": "assistant",
+ "content": "A young woman with her hair in an updo and wearing a teal hoodie stands against a light backdrop, initially looking over her shoulder with a contemplative expression. She then confidently makes a subtle dance move, suggesting rhythm and movement. Next, she appears poised and focused, looking directly at the camera. Her expression shifts to one of introspection as she gazes downward slightly. Finally, she dances with confidence, her left hand over her heart, symbolizing a poignant moment, all while dressed in the same teal hoodie against a plain, light-colored background.",
+ },
+ {
+ "role": "user",
+ "content": f'Create an imaginative video descriptive caption or modify an earlier caption in ENGLISH for the user input: "{text}"',
+ },
],
model="glm-4-0520",
temperature=0.01,
@@ -88,12 +102,7 @@ def convert_prompt(prompt: str, retry_times: int = 3) -> str:
return prompt
-def infer(
- prompt: str,
- num_inference_steps: int,
- guidance_scale: float,
- progress=gr.Progress(track_tqdm=True)
-):
+def infer(prompt: str, num_inference_steps: int, guidance_scale: float, progress=gr.Progress(track_tqdm=True)):
torch.cuda.empty_cache()
prompt_embeds, _ = pipe.encode_prompt(
@@ -113,7 +122,6 @@ def infer(
negative_prompt_embeds=torch.zeros_like(prompt_embeds),
).frames[0]
-
return video
@@ -124,11 +132,12 @@ def save_video(tensor):
export_to_video_imageio(tensor[1:], video_path)
return video_path
+
def convert_to_gif(video_path):
clip = mp.VideoFileClip(video_path)
clip = clip.set_fps(8)
clip = clip.resize(height=240)
- gif_path = video_path.replace('.mp4', '.gif')
+ gif_path = video_path.replace(".mp4", ".gif")
clip.write_gif(gif_path, fps=8)
return gif_path
@@ -137,7 +146,7 @@ def delete_old_files():
while True:
now = datetime.now()
cutoff = now - timedelta(minutes=10)
- output_dir = './output'
+ output_dir = "./output"
for filename in os.listdir(output_dir):
file_path = os.path.join(output_dir, filename)
if os.path.isfile(file_path):
@@ -169,13 +178,16 @@ with gr.Blocks() as demo:
prompt = gr.Textbox(label="Prompt (Less than 200 Words)", placeholder="Enter your prompt here", lines=5)
with gr.Row():
gr.Markdown(
- "âšUpon pressing the enhanced prompt button, we will use [GLM-4 Model](https://github.com/THUDM/GLM-4) to polish the prompt and overwrite the original one.")
+ "âšUpon pressing the enhanced prompt button, we will use [GLM-4 Model](https://github.com/THUDM/GLM-4) to polish the prompt and overwrite the original one."
+ )
enhance_button = gr.Button("âš Enhance Prompt(Optional)")
with gr.Column():
- gr.Markdown("**Optional Parameters** (default values are recommended)
"
- "Turn Inference Steps larger if you want more detailed video, but it will be slower.
"
- "50 steps are recommended for most cases. will cause 120 seconds for inference.
")
+ gr.Markdown(
+ "**Optional Parameters** (default values are recommended)
"
+ "Turn Inference Steps larger if you want more detailed video, but it will be slower.
"
+ "50 steps are recommended for most cases. will cause 120 seconds for inference.
"
+ )
with gr.Row():
num_inference_steps = gr.Number(label="Inference Steps", value=50)
guidance_scale = gr.Number(label="Guidance Scale", value=6.0)
@@ -222,7 +234,6 @@ with gr.Blocks() as demo:
""")
-
def generate(prompt, num_inference_steps, guidance_scale, progress=gr.Progress(track_tqdm=True)):
tensor = infer(prompt, num_inference_steps, guidance_scale, progress=progress)
video_path = save_video(tensor)
@@ -232,22 +243,16 @@ with gr.Blocks() as demo:
return video_path, video_update, gif_update
-
def enhance_prompt_func(prompt):
return convert_prompt(prompt, retry_times=1)
-
generate_button.click(
generate,
inputs=[prompt, num_inference_steps, guidance_scale],
- outputs=[video_output, download_video_button, download_gif_button]
+ outputs=[video_output, download_video_button, download_gif_button],
)
- enhance_button.click(
- enhance_prompt_func,
- inputs=[prompt],
- outputs=[prompt]
- )
+ enhance_button.click(enhance_prompt_func, inputs=[prompt], outputs=[prompt])
if __name__ == "__main__":
demo.launch(server_name="127.0.0.1", server_port=7870, share=True)
diff --git a/sat/README_ja.md b/sat/README_ja.md
index de5def7..deb830f 100644
--- a/sat/README_ja.md
+++ b/sat/README_ja.md
@@ -1,6 +1,7 @@
# SAT CogVideoX-2B
-ãã®ãã©ã«ãã«ã¯ã[SAT](https://github.com/THUDM/SwissArmyTransformer) ãŠã§ã€ãã䜿çšããæšè«ã³ãŒããšãSAT ãŠã§ã€ãã®ãã¡ã€ã³ãã¥ãŒãã³ã°ã³ãŒããå«ãŸããŠããŸãã
+ãã®ãã©ã«ãã«ã¯ã[SAT](https://github.com/THUDM/SwissArmyTransformer) ãŠã§ã€ãã䜿çšããæšè«ã³ãŒããšãSAT
+ãŠã§ã€ãã®ãã¡ã€ã³ãã¥ãŒãã³ã°ã³ãŒããå«ãŸããŠããŸãã
ãã®ã³ãŒãã¯ãããŒã ãã¢ãã«ããã¬ãŒãã³ã°ããããã«äœ¿çšãããã¬ãŒã ã¯ãŒã¯ã§ããã³ã¡ã³ããå°ãªãã泚ææ·±ãç 究ããå¿
èŠããããŸãã
@@ -86,7 +87,9 @@ first_stage_config:
ckpt_path: "{your_CogVideoX-2b-sat_path}/vae/3d-vae.pt" ## VAE ã¢ãã«ãã¹
```
-+ è€æ°ã®ããã³ãããä¿åããããã« txt ã䜿çšããå Žåã¯ã`configs/test.txt` ãåç
§ããŠå€æŽããŠãã ããã1è¡ã«1ã€ã®ããã³ãããèšè¿°ããŸããããã³ããã®æžãæ¹ãããããªãå Žåã¯ãæåã« [ãã®ã³ãŒã](../inference/convert_demo.py) ã䜿çšã㊠LLM ã«ãããªãã¡ã€ã³ã¡ã³ããåŒã³åºãããšãã§ããŸãã
++ è€æ°ã®ããã³ãããä¿åããããã« txt ã䜿çšããå Žåã¯ã`configs/test.txt`
+ ãåç
§ããŠå€æŽããŠãã ããã1è¡ã«1ã€ã®ããã³ãããèšè¿°ããŸããããã³ããã®æžãæ¹ãããããªãå Žåã¯ãæåã« [ãã®ã³ãŒã](../inference/convert_demo.py)
+ ã䜿çšã㊠LLM ã«ãããªãã¡ã€ã³ã¡ã³ããåŒã³åºãããšãã§ããŸãã
+ ã³ãã³ãã©ã€ã³ãå
¥åãšããŠäœ¿çšããå Žåã¯ã次ã®ããã«å€æŽããŸãã
```yaml
@@ -113,7 +116,8 @@ bash inference.sh
### ç°å¢ã®æºå
-çŸåšãSAT ã¯ãœãŒã¹ã³ãŒãããã€ã³ã¹ããŒã«ããå¿
èŠããããæ£åžžã«ãã¡ã€ã³ãã¥ãŒãã³ã°ãè¡ãããã«ã¯ãããå¿
èŠã§ãããã®åé¡ã¯å°æ¥ã®å®å®çã§è§£æ±ºãããäºå®ã§ãã
+ã泚æãã ãããçŸåšãSATãæ£åžžã«ãã¡ã€ã³ãã¥ãŒãã³ã°ããããã«ã¯ããœãŒã¹ã³ãŒãããã€ã³ã¹ããŒã«ããå¿
èŠããããŸãã
+ããã¯ããŸã pipããã±ãŒãžããŒãžã§ã³ã«ãªãªãŒã¹ãããŠããªãææ°ã®æ©èœã䜿çšããå¿
èŠãããããã§ãããã®åé¡ã¯ãä»åŸã®å®å®çã§è§£æ±ºããäºå®ã§ãã
```
git clone https://github.com/THUDM/SwissArmyTransformer.git
@@ -143,7 +147,9 @@ pip install -e .
### èšå®ãã¡ã€ã«ã®å€æŽ
-`Lora` ãš å
šãã©ã¡ãŒã¿ãã¡ã€ã³ãã¥ãŒãã³ã°ã®2ã€ã®æ¹æ³ããµããŒãããŠããŸãããããã®ãã¡ã€ã³ãã¥ãŒãã³ã°æ¹æ³ã¯ `transformer` éšåã«ã®ã¿é©çšãããŸãã`VAE` éšåã¯å€æŽãããŸããã`T5` ã¯ãšã³ã³ãŒããŒãšããŠã®ã¿äœ¿çšãããŸãã
+`Lora` ãš
+å
šãã©ã¡ãŒã¿ãã¡ã€ã³ãã¥ãŒãã³ã°ã®2ã€ã®æ¹æ³ããµããŒãããŠããŸãããããã®ãã¡ã€ã³ãã¥ãŒãã³ã°æ¹æ³ã¯ `transformer`
+éšåã«ã®ã¿é©çšãããŸãã`VAE` éšåã¯å€æŽãããŸããã`T5` ã¯ãšã³ã³ãŒããŒãšããŠã®ã¿äœ¿çšãããŸãã
`configs/cogvideox_2b_sft.yaml` (å
šéãã¡ã€ã³ãã¥ãŒãã³ã°çš) ã次ã®ããã«å€æŽããŸãã
@@ -190,7 +196,8 @@ model:
1. æšè«ã³ãŒããå®è¡ããŠãã¡ã€ã³ãã¥ãŒãã³ã°ãéå§ããŸãã
```shell
-bash finetune.sh
+bash finetune_single_gpu.sh # Single GPU
+bash finetune_multi_gpus.sh # Multi GPUs
```
### Huggingface Diffusers ãµããŒãã®ãŠã§ã€ãã«å€æ
diff --git a/sat/README_zh.md b/sat/README_zh.md
index 8566313..3335e52 100644
--- a/sat/README_zh.md
+++ b/sat/README_zh.md
@@ -112,7 +112,6 @@ bash inference.sh
### åå€ç¯å¢
-
请泚æïŒç®åïŒSATéèŠä»æºç å®è£
ïŒæèœæ£åžžåŸ®è°ã
è¿æ¯å äžºäœ éèŠäœ¿çšè¿æ²¡ååå°pipå
çæ¬çææ°ä»£ç ææ¯æçåèœã
æ们å°äŒåšæªæ¥ççš³å®çæ¬è§£å³è¿äžªé®é¢ã