Merge pull request #143 from THUDM/CogVideoX_dev

finetune yaml update
This commit is contained in:
zR 2024-08-20 21:21:10 +08:00 committed by GitHub
commit 6a93ac614c
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
17 changed files with 588 additions and 230 deletions

View File

@ -14,7 +14,7 @@
📚 Check here to view <a href="https://arxiv.org/abs/2408.06072" target="_blank">Paper</a>
</p>
<p align="center">
👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a>
👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/B94UfuhN" target="_blank">Discord</a>
</p>
<p align="center">
📍 Visit <a href="https://chatglm.cn/video?fr=osm_cogvideox">清影</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> to experience larger-scale commercial video generation models.
@ -22,18 +22,22 @@
## Update and News
- 🔥🔥 **News**: ```2024/8/15```: The `SwissArmyTransformer` dependency in CogVideoX has been upgraded to `0.4.12`. Fine-tuning
- 🔥🔥 **News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) now supports enhancing videos generated by
CogVideoX, achieving higher resolution and higher quality video rendering. We welcome you to try it out by following
the [tutorial](tools/venhancer/README_zh.md).
- 🔥 **News**: ```2024/8/15```: The `SwissArmyTransformer` dependency in CogVideoX has been upgraded to `0.4.12`.
Fine-tuning
no longer requires installing `SwissArmyTransformer` from source. Additionally, the `Tied VAE` technique has been
applied in the implementation within the `diffusers` library. Please install `diffusers` and `accelerate` libraries
from source. Inference for CogVideoX now requires only 12GB of VRAM.
from source. Inference for CogVideoX now requires only 12GB of VRAM. The inference code needs to be modified. Please
check [cli_demo](inference/cli_demo.py).
- 🔥 **News**: ```2024/8/12```: The CogVideoX paper has been uploaded to arxiv. Feel free to check out
the [paper](https://arxiv.org/abs/2408.06072).
- 🔥 **News**: ```2024/8/7```: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be
performed
on a single 3090 GPU. For more details, please refer to the [code](inference/cli_demo.py).
- 🔥 **News**: ```2024/8/6```: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can
reconstruct
the video almost losslessly.
reconstruct the video almost losslessly.
- 🔥 **News**: ```2024/8/6```: We have open-sourced **CogVideoX-2B**the first model in the CogVideoX series of video
generation models.
- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch)the **first**

View File

@ -14,15 +14,24 @@
📚 <a href="https://arxiv.org/abs/2408.06072" target="_blank">論文</a> をチェック
</p>
<p align="center">
👋 <a href="resources/WECHAT.md" target="_blank">WeChat</a><a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a> に参加
👋 <a href="resources/WECHAT.md" target="_blank">WeChat</a><a href="https://discord.gg/B94UfuhN" target="_blank">Discord</a> に参加
</p>
<p align="center">
📍 <a href="https://chatglm.cn/video?fr=osm_cogvideox">清影</a><a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">APIプラットフォーム</a> を訪問して、より大規模な商用ビデオ生成モデルを体験
</p>
## 更新とニュース
- 🔥🔥 **ニュース**: 2024/8/15: CogVideoX の依存関係である`SwissArmyTransformer`の依存が`0.4.12`にアップグレードされました。これにより、微調整の際に`SwissArmyTransformer`をソースコードからインストールする必要がなくなりました。同時に、`Tied VAE` 技術が `diffusers` ライブラリの実装に適用されました。`diffusers``accelerate` ライブラリをソースコードからインストールしてください。CogVdideoX の推論には 12GB の VRAM だけが必要です。
- 🔥 **ニュース**: ```2024/8/12```: CogVideoX 論文がarxivにアップロードされました。ぜひ[論文](https://arxiv.org/abs/2408.06072)をご覧ください。
- 🔥🔥 **ニュース**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) は CogVideoX
が生成したビデオの強化をサポートしました。より高い解像度とより高品質なビデオレンダリングを実現します。[チュートリアル](tools/venhancer/README_ja.md)
に従って、ぜひお試しください。
- 🔥**ニュース**: 2024/8/15: CogVideoX の依存関係である`SwissArmyTransformer`の依存が`0.4.12`
にアップグレードされました。これにより、微調整の際に`SwissArmyTransformer`
をソースコードからインストールする必要がなくなりました。同時に、`Tied VAE` 技術が `diffusers`
ライブラリの実装に適用されました。`diffusers``accelerate` ライブラリをソースコードからインストールしてください。CogVdideoX
の推論には 12GB の VRAM だけが必要です。 推論コードの修正が必要です。[cli_demo](inference/cli_demo.py)をご確認ください。
- 🔥 **ニュース**: ```2024/8/12```: CogVideoX
論文がarxivにアップロードされました。ぜひ[論文](https://arxiv.org/abs/2408.06072)をご覧ください。
- 🔥 **ニュース**: ```2024/8/7```: CogVideoX は `diffusers` バージョン 0.30.0 に統合されました。単一の 3090 GPU
で推論を実行できます。詳細については [コード](inference/cli_demo.py) を参照してください。
- 🔥 **ニュース**: ```2024/8/6```: **CogVideoX-2B** で使用される **3D Causal VAE** もオープンソース化しました。これにより、ビデオをほぼ無損失で再構築できます。

View File

@ -15,7 +15,7 @@
📚 查看 <a href="https://arxiv.org/abs/2408.06072" target="_blank">论文</a>
</p>
<p align="center">
👋 加入我们的 <a href="resources/WECHAT.md" target="_blank">微信</a><a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a>
👋 加入我们的 <a href="resources/WECHAT.md" target="_blank">微信</a><a href="https://discord.gg/B94UfuhN" target="_blank">Discord</a>
</p>
<p align="center">
📍 前往<a href="https://chatglm.cn/video?fr=osm_cogvideox"> 清影</a><a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9"> API平台</a> 体验更大规模的商业版视频生成模型。
@ -23,9 +23,12 @@
## 项目更新
- 🔥🔥 **News**: ```2024/8/15```: CogVideoX 依赖中`SwissArmyTransformer`依赖升级到`0.4.12`,
- 🔥🔥**News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) 已经支持对 CogVideoX
生成的视频进行增强,实现更高分辨率,更高质量的视频渲染。欢迎大家按照[教程](tools/venhancer/README_zh.md)体验使用。
- 🔥**News**: ```2024/8/15```: CogVideoX 依赖中`SwissArmyTransformer`依赖升级到`0.4.12`,
微调不再需要从源代码安装`SwissArmyTransformer`。同时,`Tied VAE` 技术已经被应用到 `diffusers`
库中的实现,请从源代码安装 `diffusers``accelerate` 库,推理 CogVdideoX 仅需 12GB显存。
库中的实现,请从源代码安装 `diffusers``accelerate` 库,推理 CogVdideoX 仅需
12GB显存。推理代码需要修改请查看 [cli_demo](inference/cli_demo.py)
- 🔥 **News**: ```2024/8/12```: CogVideoX 论文已上传到arxiv欢迎查看[论文](https://arxiv.org/abs/2408.06072)。
- 🔥 **News**: ```2024/8/7```: CogVideoX 已经合并入 `diffusers`
0.30.0版本单张3090可以推理详情请见[代码](inference/cli_demo.py)。

View File

@ -1,6 +1,6 @@
diffusers==0.30.0
git+https://github.com/huggingface/diffusers.git@main#egg=diffusers
transformers==4.44.0
accelerate==0.33.0
git+https://github.com/huggingface/accelerate.git@main#egg=accelerate
sentencepiece==0.2.0 # T5
SwissArmyTransformer==0.4.12 # Inference
torch==2.4.0 # Tested in 2.2 2.3 2.4 and 2.5

View File

@ -4,7 +4,6 @@
[日本語で読む](./README_ja.md)
This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the
fine-tuning code for SAT weights.
@ -69,110 +68,49 @@ loading it into Deepspeed in Finetune.
0 directories, 8 files
```
3. Modify the file `configs/cogvideox_2b_infer.yaml`.
Each text file shares the same name as its corresponding video, serving as the label for that video. Videos and labels
should be matched one-to-one. Generally, a single video should not be associated with multiple labels.
```yaml
load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer model path
For style fine-tuning, please prepare at least 50 videos and labels with similar styles to ensure proper fitting.
conditioner_config:
target: sgm.modules.GeneralConditioner
params:
emb_models:
- is_trainable: false
input_key: txt
ucg_rate: 0.1
target: sgm.modules.encoders.modules.FrozenT5Embedder
params:
model_dir: "google/t5-v1_1-xxl" ## T5 model path
max_length: 226
### Modifying Configuration Files
first_stage_config:
target: sgm.models.autoencoder.VideoAutoencoderInferenceWrapper
params:
cp_size: 1
ckpt_path: "{your_CogVideoX-2b-sat_path}/vae/3d-vae.pt" ## VAE model path
```
+ If using txt to save multiple prompts, please refer to `configs/test.txt` for modification. One prompt per line. If
you don't know how to write prompts, you can first use [this code](../inference/convert_demo.py) to call LLM for
refinement.
+ If using the command line as input, modify
```yaml
input_type: cli
```
so that prompts can be entered from the command line.
If you want to change the output video directory, you can modify:
```yaml
output_dir: outputs/
```
The default is saved in the `.outputs/` folder.
4. Run the inference code to start inference
```shell
bash inference.sh
```
## Fine-Tuning the Model
### Preparing the Dataset
The dataset format should be as follows:
We support two fine-tuning methods: `Lora` and full-parameter fine-tuning. Please note that both methods only fine-tune
the `transformer` part and do not modify the `VAE` section. `T5` is used solely as an Encoder. Please modify
the `configs/sft.yaml` (for full-parameter fine-tuning) file as follows:
```
.
├── labels
│   ├── 1.txt
│   ├── 2.txt
│   ├── ...
└── videos
├── 1.mp4
├── 2.mp4
├── ...
```
Each txt file should have the same name as its corresponding video file and contain the labels for that video. Each
video should have a one-to-one correspondence with a label. Typically, a video should not have multiple labels.
For style fine-tuning, please prepare at least 50 videos and labels with similar styles to facilitate fitting.
### Modifying the Configuration File
We support both `Lora` and `full-parameter fine-tuning` methods. Please note that both fine-tuning methods only apply to
the `transformer` part. The `VAE part` is not modified. `T5` is only used as an Encoder.
the `configs/cogvideox_2b_sft.yaml` (for full fine-tuning) as follows.
```yaml
# checkpoint_activations: True ## using gradient checkpointing (both checkpoint_activations in the configuration file need to be set to True)
# checkpoint_activations: True ## Using gradient checkpointing (Both checkpoint_activations in the config file need to be set to True)
model_parallel_size: 1 # Model parallel size
experiment_name: lora-disney # Experiment name (do not change)
mode: finetune # Mode (do not change)
load: "{your_CogVideoX-2b-sat_path}/transformer" # Transformer model path
no_load_rng: True # Whether to load the random seed
train_iters: 1000 # Number of training iterations
eval_iters: 1 # Number of evaluation iterations
eval_interval: 100 # Evaluation interval
eval_batch_size: 1 # Batch size for evaluation
experiment_name: lora-disney # Experiment name (do not modify)
mode: finetune # Mode (do not modify)
load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer model path
no_load_rng: True # Whether to load random seed
train_iters: 1000 # Training iterations
eval_iters: 1 # Evaluation iterations
eval_interval: 100 # Evaluation interval
eval_batch_size: 1 # Evaluation batch size
save: ckpts # Model save path
save_interval: 100 # Model save interval
log_interval: 20 # Log output interval
train_data: [ "your train data path" ]
valid_data: [ "your val data path" ] # Training and validation sets can be the same
split: 1,0,0 # Ratio of training, validation, and test sets
num_workers: 8 # Number of worker threads for data loading
force_train: True # Allow missing keys when loading ckpt (refer to T5 and VAE which are loaded independently)
only_log_video_latents: True # Avoid using VAE decoder when eval to save memory
valid_data: [ "your val data path" ] # Training and validation datasets can be the same
split: 1,0,0 # Training, validation, and test set ratio
num_workers: 8 # Number of worker threads for data loader
force_train: True # Allow missing keys when loading checkpoint (T5 and VAE are loaded separately)
only_log_video_latents: True # Avoid memory overhead caused by VAE decode
deepspeed:
bf16:
enabled: False # For CogVideoX-2B set to False and for CogVideoX-5B set to True
fp16:
enabled: True # For CogVideoX-2B set to True and for CogVideoX-5B set to False
```
If you wish to use Lora fine-tuning, you also need to modify:
If you wish to use Lora fine-tuning, you also need to modify the `cogvideox_<model_parameters>_lora` file:
```yaml
Here, take `CogVideoX-2B` as a reference:
```
model:
scale_factor: 1.15258426
disable_first_stage_autocast: true
@ -186,15 +124,47 @@ model:
r: 256
```
### Fine-Tuning and Validation
### Modifying Run Scripts
1. Run the inference code to start fine-tuning.
Edit `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` to select the configuration file. Below are two examples:
```shell
1. If you want to use the `CogVideoX-2B` model and the `Lora` method, you need to modify `finetune_single_gpu.sh`
or `finetune_multi_gpus.sh`:
```
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
```
2. If you want to use the `CogVideoX-2B` model and the `full-parameter fine-tuning` method, you need to
modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh`:
```
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM"
```
### Fine-Tuning and Evaluation
Run the inference code to start fine-tuning.
```
bash finetune_single_gpu.sh # Single GPU
bash finetune_multi_gpus.sh # Multi GPUs
```
### Using the Fine-Tuned Model
The fine-tuned model cannot be merged; here is how to modify the inference configuration file `inference.sh`:
```
run_cmd="$environs python sample_video.py --base configs/cogvideox_<model_parameters>_lora.yaml configs/inference.yaml --seed 42"
```
Then, execute the code:
```
bash inference.sh
```
### Converting to Huggingface Diffusers Supported Weights
The SAT weight format is different from Huggingface's weight format and needs to be converted. Please run:

View File

@ -140,57 +140,94 @@ bash inference.sh
### 設定ファイルの変更
`Lora`
全パラメータファインチューニングの2つの方法をサポートしています。これらのファインチューニング方法は `transformer`
部分にのみ適用されます。`VAE` 部分は変更されません。`T5` はエンコーダーとしてのみ使用されます
`Lora`フルパラメータ微調整の2つの方法をサポートしています。両方の微調整方法は、`transformer` 部分のみを微調整し、`VAE`
部分には変更を加えないことに注意してください。`T5` はエンコーダーとしてのみ使用されます。以下のように `configs/sft.yaml` (
フルパラメータ微調整用) ファイルを変更してください
`configs/cogvideox_2b_sft.yaml` (全量ファインチューニング用) を次のように変更します。
```yaml
# checkpoint_activations: True ## using gradient checkpointing (設定ファイル内の2つのcheckpoint_activationsを両方Trueに設定する必要があります)
```
# checkpoint_activations: True ## 勾配チェックポイントを使用する場合 (設定ファイル内の2つの checkpoint_activations を True に設定する必要があります)
model_parallel_size: 1 # モデル並列サイズ
experiment_name: lora-disney # 実験名 (変更しないでください)
mode: finetune # モード (変更しないでください)
load: "{your_CogVideoX-2b-sat_path}/transformer" # Transformer モデルパス
no_load_rng: True # ランダムシードをロードするかどうか
load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer モデルパス
no_load_rng: True # 乱数シードを読み込むかどうか
train_iters: 1000 # トレーニングイテレーション数
eval_iters: 1 # 評価イテレーション数
eval_interval: 100 # 評価間隔
eval_batch_size: 1 # 評価バッチサイズ
eval_interval: 100 # 評価間隔
eval_batch_size: 1 # 評価バッチサイズ
save: ckpts # モデル保存パス
save_interval: 100 # モデル保存間隔
log_interval: 20 # ログ出力間隔
train_data: [ "your train data path" ]
valid_data: [ "your val data path" ] # トレーニングセットと検証セットは同じでもかまいません
split: 1,0,0 # トレーニングセット、検証セット、テストセットの比率
valid_data: [ "your val data path" ] # トレーニングデータと評価データは同じでも構いません
split: 1,0,0 # トレーニングセット、評価セット、テストセットの割合
num_workers: 8 # データローダーのワーカースレッド数
force_train: True # ckpt をロードする際に missing keys を許可するかどうか (T5 と VAE は独立してロードされます)
only_log_video_latents: True # VAE デコーダーを使用しないようにしてメモリを節約します
force_train: True # チェックポイントをロードするときに欠落したキーを許可 (T5 と VAE は別々にロードされます)
only_log_video_latents: True # VAE のデコードによるメモリオーバーヘッドを回避
deepspeed:
bf16:
enabled: False # CogVideoX-2B の場合は False に設定し、CogVideoX-5B の場合は True に設定
fp16:
enabled: True # CogVideoX-2B の場合は True に設定し、CogVideoX-5B の場合は False に設定
```
Lora ファインチューニングを使用する場合は、次のように変更する必要があります:
Lora 微調整を使用したい場合は、`cogvideox_<model_parameters>_lora` ファイルも変更する必要があります。
```yaml
ここでは、`CogVideoX-2B` を参考にします。
```
model:
scale_factor: 1.15258426
disable_first_stage_autocast: true
not_trainable_prefixes: [ 'all' ] ## コメント解除
not_trainable_prefixes: [ 'all' ] ## コメント解除
log_keys:
- txt'
lora_config: ## コメント解除
lora_config: ## コメント解除
target: sat.model.finetune.lora2.LoraMixin
params:
r: 256
```
### ファインチューニングと検証
### 実行スクリプトの変更
1. 推論コードを実行してファインチューニングを開始します。
設定ファイルを選択するために `finetune_single_gpu.sh` または `finetune_multi_gpus.sh` を編集します。以下に2つの例を示します。
```shell
bash finetune_single_gpu.sh # Single GPU
bash finetune_multi_gpus.sh # Multi GPUs
1. `CogVideoX-2B` モデルを使用し、`Lora` 手法を利用する場合は、`finetune_single_gpu.sh` または `finetune_multi_gpus.sh`
を変更する必要があります。
```
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
```
2. `CogVideoX-2B` モデルを使用し、`フルパラメータ微調整` 手法を利用する場合は、`finetune_single_gpu.sh`
または `finetune_multi_gpus.sh` を変更する必要があります。
```
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM"
```
### 微調整と評価
推論コードを実行して微調整を開始します。
```
bash finetune_single_gpu.sh # シングルGPU
bash finetune_multi_gpus.sh # マルチGPU
```
### 微調整後のモデルの使用
微調整されたモデルは統合できません。ここでは、推論設定ファイル `inference.sh` を変更する方法を示します。
```
run_cmd="$environs python sample_video.py --base configs/cogvideox_<model_parameters>_lora.yaml configs/inference.yaml --seed 42"
```
その後、次のコードを実行します。
```
bash inference.sh
```
### Huggingface Diffusers サポートのウェイトに変換

View File

@ -50,7 +50,9 @@ git clone https://huggingface.co/THUDM/CogVideoX-2b.git
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
```
通过上述方案,你将会得到一个 safetensor 格式的T5文件确保在 Deepspeed微调过程中读入的时候不会报错。
```
├── added_tokens.json
├── config.json
@ -63,6 +65,7 @@ mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* t5-v1_1-xxl
0 directories, 8 files
```
3. 修改`configs/cogvideox_2b_infer.yaml`中的文件。
```yaml
@ -138,7 +141,7 @@ bash inference.sh
我们支持 `Lora` 和 全参数微调两种方式。请注意,两种微调方式都仅仅对 `transformer` 部分进行微调。不改动 `VAE` 部分。`T5`仅作为
Encoder 使用。
部分。 请按照以下方式修改`configs/cogvideox_2b_sft.yaml`(全量微调) 中的文件。
部分。 请按照以下方式修改`configs/sft.yaml`(全量微调) 中的文件。
```yaml
# checkpoint_activations: True ## using gradient checkpointing (配置文件中的两个checkpoint_activations都需要设置为True)
@ -160,9 +163,16 @@ Encoder 使用。
num_workers: 8 # 数据加载器的工作线程数
force_train: True # 在加载checkpoint时允许missing keys (T5 和 VAE 单独加载)
only_log_video_latents: True # 避免VAE decode带来的显存开销
deepspeed:
bf16:
enabled: False # For CogVideoX-2B Turn to False and For CogVideoX-5B Turn to True
fp16:
enabled: True # For CogVideoX-2B Turn to True and For CogVideoX-5B Turn to False
```
如果你希望使用 Lora 微调,你还需要修改:
如果你希望使用 Lora 微调,你还需要修改`cogvideox_<模型参数>_lora` 文件:
这里以 `CogVideoX-2B` 为参考:
```yaml
model:
@ -178,15 +188,46 @@ model:
r: 256
```
### 修改运行脚本
编辑`finetune_single_gpu.sh` 或者 `finetune_multi_gpus.sh`,选择配置文件。下面是两个例子:
1. 如果您想使用 `CogVideoX-2B` 模型并使用`Lora`方案,您需要修改`finetune_single_gpu.sh` 或者 `finetune_multi_gpus.sh`:
```
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
```
2. 如果您想使用 `CogVideoX-2B` 模型并使用`全量微调`方案,您需要修改`finetune_single_gpu.sh`
或者 `finetune_multi_gpus.sh`:
```
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM"
```
### 微调和验证
1. 运行推理代码,即可开始微调。
运行推理代码,即可开始微调。
```shell
bash finetune_single_gpu.sh # Single GPU
bash finetune_multi_gpus.sh # Multi GPUs
```
### 使用微调后的模型
微调后的模型无法合并,这里展现了如何修改推理配置文件 `inference.sh`
```
run_cmd="$environs python sample_video.py --base configs/cogvideox_<模型参数>_lora.yaml configs/inference.yaml --seed 42"
```
然后,执行代码:
```
bash inference.sh
```
### 转换到 Huggingface Diffusers 库支持的权重
SAT 权重格式与 Huggingface 的权重格式不同,需要转换。请运行

View File

@ -1,75 +1,9 @@
args:
checkpoint_activations: True ## using gradient checkpointing
model_parallel_size: 1
experiment_name: lora-disney
mode: finetune
load: "CogVideoX-2b-sat/transformer"
no_load_rng: True
train_iters: 1000
eval_iters: 1
eval_interval: 100
eval_batch_size: 1
save: ckpts
save_interval: 100
log_interval: 20
train_data: ["disney"]
valid_data: ["disney"]
split: 1,0,0
num_workers: 8
force_train: True
only_log_video_latents: True
data:
target: data_video.SFTDataset
params:
video_size: [480, 720]
fps: 8
max_num_frames: 49
skip_frms_num: 3.
deepspeed:
train_micro_batch_size_per_gpu: 1
gradient_accumulation_steps: 1
steps_per_print: 50
gradient_clipping: 0.1
zero_optimization:
stage: 2
cpu_offload: false
contiguous_gradients: false
overlap_comm: true
reduce_scatter: true
reduce_bucket_size: 1000000000
allgather_bucket_size: 1000000000
load_from_fp32_weights: false
zero_allow_untested_optimizer: true
bf16:
enabled: False
fp16:
enabled: True
loss_scale: 0
loss_scale_window: 400
hysteresis: 2
min_loss_scale: 1
optimizer:
type: sat.ops.FusedEmaAdam
params:
lr: 0.0002
betas: [0.9, 0.95]
eps: 1e-8
weight_decay: 1e-4
activation_checkpointing:
partition_activations: false
contiguous_memory_optimization: false
wall_clock_breakdown: false
model:
scale_factor: 1.15258426
disable_first_stage_autocast: true
not_trainable_prefixes: ['all'] ## Using Lora
log_keys:
- txt
denoiser_config:
target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser
params:
@ -119,11 +53,6 @@ model:
height_interpolation: 1.875
width_interpolation: 1.875
lora_config: ## Using Lora
target: sat.model.finetune.lora2.LoraMixin
params:
r: 128
patch_embed_config:
target: dit_video_concat.ImagePatchEmbeddingMixin
params:
@ -146,14 +75,14 @@ model:
ucg_rate: 0.1
target: sgm.modules.encoders.modules.FrozenT5Embedder
params:
model_dir: "google/t5-v1_1-xxl"
model_dir: "t5-v1_1-xxl"
max_length: 226
first_stage_config:
target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
params:
cp_size: 1
ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt"
ckpt_path: "cogvideox-2b-sat/vae/3d-vae.pt"
ignore_keys: [ 'loss' ]
loss_config:
@ -190,7 +119,7 @@ model:
attn_resolutions: [ ]
num_res_blocks: 3
dropout: 0.0
gather_norm: false
gather_norm: False
loss_fn_config:
target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss

View File

@ -1,19 +1,7 @@
args:
latent_channels: 16
mode: inference
load: "CogVideoX-2b-sat/transformer"
batch_size: 1
input_type: txt
input_file: test.txt
sampling_num_frames: 13 # Must be 13, 11 or 9
sampling_fps: 8
fp16: True
output_dir: outputs/
force_inference: True
model:
scale_factor: 1.15258426
disable_first_stage_autocast: true
not_trainable_prefixes: ['all'] ## Using Lora
log_keys:
- txt
@ -50,6 +38,7 @@ model:
num_attention_heads: 30
transformer_args:
checkpoint_activations: True ## using gradient checkpointing
vocab_size: 1
max_sequence_length: 64
layernorm_order: pre
@ -65,6 +54,11 @@ model:
height_interpolation: 1.875
width_interpolation: 1.875
lora_config:
target: sat.model.finetune.lora2.LoraMixin
params:
r: 128
patch_embed_config:
target: dit_video_concat.ImagePatchEmbeddingMixin
params:
@ -87,14 +81,14 @@ model:
ucg_rate: 0.1
target: sgm.modules.encoders.modules.FrozenT5Embedder
params:
model_dir: "google/t5-v1_1-xxl"
model_dir: "t5-v1_1-xxl"
max_length: 226
first_stage_config:
target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper
params:
cp_size: 1
ckpt_path: "CogVideoX-2b-sat/vae/3d-vae.pt"
ckpt_path: "cogvideox-2b-sat/vae/3d-vae.pt"
ignore_keys: [ 'loss' ]
loss_config:
@ -131,7 +125,7 @@ model:
attn_resolutions: [ ]
num_res_blocks: 3
dropout: 0.0
gather_norm: false
gather_norm: False
loss_fn_config:
target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss

View File

@ -0,0 +1,15 @@
args:
latent_channels: 16
mode: inference
# load: "{your_CogVideoX-2b-sat_path}/transformer" # This is for Full model without lora adapter
# load: "{your lora folder} such as zRzRzRzRzRzRzR/lora-disney-08-20-13-28" # This is for Full model without lora adapter
batch_size: 1
input_type: txt
input_file: configs/test.txt
sampling_num_frames: 13 # Must be 13, 11 or 9
sampling_fps: 8
fp16: True # For CogVideoX-2B
# bf16: True # For CogVideoX-5B
output_dir: outputs/
force_inference: True

65
sat/configs/sft.yaml Normal file
View File

@ -0,0 +1,65 @@
args:
checkpoint_activations: True ## using gradient checkpointing
model_parallel_size: 1
experiment_name: lora-disney
mode: finetune
load: "cogvideox-2b-sat/transformer"
no_load_rng: True
train_iters: 1000 # Suggest more than 1000 For Lora and SFT For 500 is enough
eval_iters: 1
eval_interval: 100
eval_batch_size: 1
save: ckpts_2b_lora
save_interval: 500
log_interval: 20
train_data: [ "disney" ] # Train data path
valid_data: [ "disney" ] # Validation data path, can be the same as train_data(not recommended)
split: 1,0,0
num_workers: 8
force_train: True
only_log_video_latents: True
data:
target: data_video.SFTDataset
params:
video_size: [ 480, 720 ]
fps: 8
max_num_frames: 49
skip_frms_num: 3.
deepspeed:
# Minimun for 16 videos per batch for ALL GPUs, This setting is for 8 x A100 GPUs
train_micro_batch_size_per_gpu: 2
gradient_accumulation_steps: 1
steps_per_print: 50
gradient_clipping: 0.1
zero_optimization:
stage: 2
cpu_offload: false
contiguous_gradients: false
overlap_comm: true
reduce_scatter: true
reduce_bucket_size: 1000000000
allgather_bucket_size: 1000000000
load_from_fp32_weights: false
zero_allow_untested_optimizer: true
bf16:
enabled: False # For CogVideoX-2B Turn to False and For CogVideoX-5B Turn to True
fp16:
enabled: True # For CogVideoX-2B Turn to True and For CogVideoX-5B Turn to False
loss_scale: 0
loss_scale_window: 400
hysteresis: 2
min_loss_scale: 1
optimizer:
type: sat.ops.FusedEmaAdam
params:
lr: 0.001 # Between 1E-3 and 5E-4 For Lora and 1E-5 For SFT
betas: [ 0.9, 0.95 ]
eps: 1e-8
weight_decay: 1e-4
activation_checkpointing:
partition_activations: false
contiguous_memory_optimization: false
wall_clock_breakdown: false

View File

@ -1,8 +1,8 @@
#! /bin/bash
echo "RUN on `hostname`, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
echo "RUN on $(hostname), CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
run_cmd="torchrun --standalone --nproc_per_node=4 train_video.py --base configs/cogvideox_2b_sft.yaml --seed $RANDOM
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
echo ${run_cmd}
eval ${run_cmd}

View File

@ -4,7 +4,7 @@ echo "RUN on `hostname`, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
environs="WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1"
run_cmd="$environs python train_video.py --base configs/cogvideox_2b_sft.yaml --seed $RANDOM"
run_cmd="$environs python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
echo ${run_cmd}
eval ${run_cmd}

View File

@ -4,7 +4,7 @@ echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
environs="WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1"
run_cmd="$environs python sample_video.py --base configs/cogvideox_2b_infer.yaml"
run_cmd="$environs python sample_video.py --base configs/cogvideox_2b.yaml configs/inference.yaml --seed $RANDOM"
echo ${run_cmd}
eval ${run_cmd}

98
tools/venhancer/README.md Normal file
View File

@ -0,0 +1,98 @@
# Enhance CogVideoX Generated Videos with VEnhancer
This tutorial will guide you through using the VEnhancer tool to enhance videos generated by CogVideoX, including
achieving higher frame rates and higher resolutions.
## Model Introduction
VEnhancer implements spatial super-resolution, temporal super-resolution (frame interpolation), and video refinement in
a unified framework. It can flexibly adapt to different upsampling factors (e.g., 1x~8x) for spatial or temporal
super-resolution. Additionally, it provides flexible control to modify the refinement strength, enabling it to handle
diverse video artifacts.
VEnhancer follows the design of ControlNet, copying the architecture and weights of the multi-frame encoder and middle
block from a pre-trained video diffusion model to build a trainable conditional network. This video ControlNet accepts
low-resolution keyframes and noisy full-frame latents as inputs. In addition to the time step t and prompt, our proposed
video-aware conditioning also includes noise augmentation level σ and downscaling factor s as additional network
conditioning inputs.
## Hardware Requirements
+ Operating System: Linux (requires xformers dependency)
+ Hardware: NVIDIA GPU with at least 60GB of VRAM per card. Machines such as H100, A100 are recommended.
## Quick Start
1. Clone the repository and install dependencies as per the official instructions:
```shell
git clone https://github.com/Vchitect/VEnhancer.git
cd VEnhancer
## Torch and other dependencies can use those from CogVideoX. If you need to create a new environment, use the following commands:
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
## Install required dependencies
pip install -r requirements.txt
```
Where:
- `input_path` is the path to the input video
- `prompt` is the description of the video content. The prompt used by this tool should be shorter, not exceeding 77
words. You may need to simplify the prompt used for generating the CogVideoX video.
- `target_fps` is the target frame rate for the video. Typically, 16 fps is already smooth, with 24 fps as the default
value.
- `up_scale` is recommend to be set to 2,3,4. The target resolution is limited to be around 2k and below.
- `noise_aug` value depends on the input video quality. Lower quality needs higher noise levels, which corresponds to
stronger refinement. 250~300 is for very low-quality videos. good videos: <= 200.
- `steps` if you want fewer steps, please change solver_mode to "normal" first, then decline the number of steps. "
fast" solver_mode has fixed steps (15).
The code will automatically download the required models from Hugging Face during execution.
Typical runtime logs are as follows:
```shell
/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
2024-08-20 13:25:17,553 - video_to_video - INFO - checkpoint_path: ./ckpts/venhancer_paper.pt
/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/open_clip/factory.py:88: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(checkpoint_path, map_location=map_location)
2024-08-20 13:25:37,486 - video_to_video - INFO - Build encoder with FrozenOpenCLIPEmbedder
/share/home/zyx/Code/VEnhancer/video_to_video/video_to_video_model.py:35: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
load_dict = torch.load(cfg.model_path, map_location='cpu')
2024-08-20 13:25:55,391 - video_to_video - INFO - Load model path ./ckpts/venhancer_paper.pt, with local status <All keys matched successfully>
2024-08-20 13:25:55,392 - video_to_video - INFO - Build diffusion with GaussianDiffusion
2024-08-20 13:26:16,092 - video_to_video - INFO - input video path: inputs/000000.mp4
2024-08-20 13:26:16,093 - video_to_video - INFO - text: Wide-angle aerial shot at dawn,soft morning light casting long shadows,an elderly man walking his dog through a quiet,foggy park,trees and benches in the background,peaceful and serene atmosphere
2024-08-20 13:26:16,156 - video_to_video - INFO - input frames length: 49
2024-08-20 13:26:16,156 - video_to_video - INFO - input fps: 8.0
2024-08-20 13:26:16,156 - video_to_video - INFO - target_fps: 24.0
2024-08-20 13:26:16,311 - video_to_video - INFO - input resolution: (480, 720)
2024-08-20 13:26:16,312 - video_to_video - INFO - target resolution: (1320, 1982)
2024-08-20 13:26:16,312 - video_to_video - INFO - noise augmentation: 250
2024-08-20 13:26:16,312 - video_to_video - INFO - scale s is set to: 8
2024-08-20 13:26:16,399 - video_to_video - INFO - video_data shape: torch.Size([145, 3, 1320, 1982])
/share/home/zyx/Code/VEnhancer/video_to_video/video_to_video_model.py:108: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with amp.autocast(enabled=True):
2024-08-20 13:27:19,605 - video_to_video - INFO - step: 0
2024-08-20 13:30:12,020 - video_to_video - INFO - step: 1
2024-08-20 13:33:04,956 - video_to_video - INFO - step: 2
2024-08-20 13:35:58,691 - video_to_video - INFO - step: 3
2024-08-20 13:38:51,254 - video_to_video - INFO - step: 4
2024-08-20 13:41:44,150 - video_to_video - INFO - step: 5
2024-08-20 13:44:37,017 - video_to_video - INFO - step: 6
2024-08-20 13:47:30,037 - video_to_video - INFO - step: 7
2024-08-20 13:50:22,838 - video_to_video - INFO - step: 8
2024-08-20 13:53:15,844 - video_to_video - INFO - step: 9
2024-08-20 13:56:08,657 - video_to_video - INFO - step: 10
2024-08-20 13:59:01,648 - video_to_video - INFO - step: 11
2024-08-20 14:01:54,541 - video_to_video - INFO - step: 12
2024-08-20 14:04:47,488 - video_to_video - INFO - step: 13
2024-08-20 14:10:13,637 - video_to_video - INFO - sampling, finished.
```
Running on a single A100 GPU, enhancing each 6-second CogVideoX generated video with default settings will consume 60GB
of VRAM and take 40-50 minutes.

View File

@ -0,0 +1,92 @@
# VEnhancer で CogVideoX によって生成されたビデオを強化する
このチュートリアルでは、VEnhancer ツールを使用して、CogVideoX で生成されたビデオを強化し、より高いフレームレートと高い解像度を実現する方法を説明します。
## モデルの紹介
VEnhancer は、空間超解像、時間超解像フレーム補間、およびビデオのリファインメントを統一されたフレームワークで実現します。空間または時間の超解像のために、さまざまなアップサンプリング係数1x〜8xに柔軟に対応できます。さらに、多様なビデオアーティファクトを処理するために、リファインメント強度を変更する柔軟な制御を提供します。
VEnhancer は ControlNet の設計に従い、事前訓練されたビデオ拡散モデルのマルチフレームエンコーダーとミドルブロックのアーキテクチャとウェイトをコピーして、トレーニング可能な条件ネットワークを構築します。このビデオ ControlNet は、低解像度のキーフレームとノイズを含む完全なフレームを入力として受け取ります。さらに、タイムステップ t とプロンプトに加えて、提案されたビデオ対応条件により、ノイズ増幅レベル σ およびダウンスケーリングファクター s が追加のネットワーク条件として使用されます。
## ハードウェア要件
+ オペレーティングシステム: Linux (xformers 依存関係が必要)
+ ハードウェア: 単一カードあたり少なくとも 60GB の VRAM を持つ NVIDIA GPU。H100、A100 などのマシンを推奨します。
## クイックスタート
1. 公式の指示に従ってリポジトリをクローンし、依存関係をインストールします。
```shell
git clone https://github.com/Vchitect/VEnhancer.git
cd VEnhancer
## Torch などの依存関係は CogVideoX の依存関係を使用できます。新しい環境を作成する必要がある場合は、以下のコマンドを使用してください。
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
## 必須の依存関係をインストールします。
pip install -r requirements.txt
```
2. コードを実行します。
```shell
python enhance_a_video.py --up_scale 4 --target_fps 24 --noise_aug 250 --solver_mode 'fast' --steps 15 --input_path inputs/000000.mp4 --prompt 'Wide-angle aerial shot at dawn, soft morning light casting long shadows, an elderly man walking his dog through a quiet, foggy park, trees and benches in the background, peaceful and serene atmosphere' --save_dir 'results/'
```
次の設定を行います:
- `input_path` 是输入视频的路径
- `prompt` 是视频内容的描述。此工具使用的提示词应更短不超过77个字。您可能需要简化用于生成CogVideoX视频的提示词。
- `target_fps` 是视频的目标帧率。通常16 fps已经很流畅默认值为24 fps。
- `up_scale` 推荐设置为2、3或4。目标分辨率限制在2k左右及以下。
- `noise_aug` 的值取决于输入视频的质量。质量较低的视频需要更高的噪声级别这对应于更强的优化。250~300适用于非常低质量的视频。对于高质量视频设置为≤200。
- `steps` 如果想减少步数请先将solver_mode改为“normal”然后减少步数。“fast”模式的步数是固定的15步
代码在执行过程中会自动从Hugging Face下载所需的模型。
コードの実行中に、必要なモデルは Hugging Face から自動的にダウンロードされます。
```shell
/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
2024-08-20 13:25:17,553 - video_to_video - INFO - checkpoint_path: ./ckpts/venhancer_paper.pt
/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/open_clip/factory.py:88: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(checkpoint_path, map_location=map_location)
2024-08-20 13:25:37,486 - video_to_video - INFO - Build encoder with FrozenOpenCLIPEmbedder
/share/home/zyx/Code/VEnhancer/video_to_video/video_to_video_model.py:35: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
load_dict = torch.load(cfg.model_path, map_location='cpu')
2024-08-20 13:25:55,391 - video_to_video - INFO - Load model path ./ckpts/venhancer_paper.pt, with local status <All keys matched successfully>
2024-08-20 13:25:55,392 - video_to_video - INFO - Build diffusion with GaussianDiffusion
2024-08-20 13:26:16,092 - video_to_video - INFO - input video path: inputs/000000.mp4
2024-08-20 13:26:16,093 - video_to_video - INFO - text: Wide-angle aerial shot at dawn,soft morning light casting long shadows,an elderly man walking his dog through a quiet,foggy park,trees and benches in the background,peaceful and serene atmosphere
2024-08-20 13:26:16,156 - video_to_video - INFO - input frames length: 49
2024-08-20 13:26:16,156 - video_to_video - INFO - input fps: 8.0
2024-08-20 13:26:16,156 - video_to_video - INFO - target_fps: 24.0
2024-08-20 13:26:16,311 - video_to_video - INFO - input resolution: (480, 720)
2024-08-20 13:26:16,312 - video_to_video - INFO - target resolution: (1320, 1982)
2024-08-20 13:26:16,312 - video_to_video - INFO - noise augmentation: 250
2024-08-20 13:26:16,312 - video_to_video - INFO - scale s is set to: 8
2024-08-20 13:26:16,399 - video_to_video - INFO - video_data shape: torch.Size([145, 3, 1320, 1982])
/share/home/zyx/Code/VEnhancer/video_to_video/video_to_video_model.py:108: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with amp.autocast(enabled=True):
2024-08-20 13:27:19,605 - video_to_video - INFO - step: 0
2024-08-20 13:30:12,020 - video_to_video - INFO - step: 1
2024-08-20 13:33:04,956 - video_to_video - INFO - step: 2
2024-08-20 13:35:58,691 - video_to_video - INFO - step: 3
2024-08-20 13:38:51,254 - video_to_video - INFO - step: 4
2024-08-20 13:41:44,150 - video_to_video - INFO - step: 5
2024-08-20 13:44:37,017 - video_to_video - INFO - step: 6
2024-08-20 13:47:30,037 - video_to_video - INFO - step: 7
2024-08-20 13:50:22,838 - video_to_video - INFO - step: 8
2024-08-20 13:53:15,844 - video_to_video - INFO - step: 9
2024-08-20 13:56:08,657 - video_to_video - INFO - step: 10
2024-08-20 13:59:01,648 - video_to_video - INFO - step: 11
2024-08-20 14:01:54,541 - video_to_video - INFO - step: 12
2024-08-20 14:04:47,488 - video_to_video - INFO - step: 13
2024-08-20 14:10:13,637 - video_to_video - INFO - sampling, finished.
```
A100 GPU を単一で使用している場合、CogVideoX によって生成された 6 秒間のビデオを強化するには、デフォルト設定で 60GB の VRAM を消費し、40〜50 分かかります。

View File

@ -0,0 +1,101 @@
# 使用 VEnhancer 对 CogVdieoX 生成视频进行增强
本教程将要使用 VEnhancer 工具 对 CogVdieoX 生成视频进行增强, 包括更高的帧率和更高的分辨率
## 模型介绍
VEnhancer 在一个统一的框架中实现了空间超分辨率、时间超分辨率帧插值和视频优化。它可以灵活地适应不同的上采样因子例如1x~
8x用于空间或时间超分辨率。此外它提供了灵活的控制以修改优化强度从而处理多样化的视频伪影。
VEnhancer 遵循 ControlNet 的设计,复制了预训练的视频扩散模型的多帧编码器和中间块的架构和权重,构建了一个可训练的条件网络。这个视频
ControlNet 接受低分辨率关键帧和包含噪声的完整帧作为输入。此外,除了时间步 t 和提示词外,我们提出的视频感知条件还将噪声增强的噪声级别
σ 和降尺度因子 s 作为附加的网络条件输入。
## 硬件需求
+ 操作系统: Linux (需要依赖xformers)
+ 硬件: NVIDIA GPU 并至少保证单卡显存超过60G推荐使用 H100A100等机器。
## 快速上手
1. 按照官方指引克隆仓库并安装依赖
```shell
git clone https://github.com/Vchitect/VEnhancer.git
cd VEnhancer
## torch等依赖可以使用CogVideoX的依赖如果你需要创建一个新的环境可以使用以下命令
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
## 安装必须的依赖
pip install -r requirements.txt
```
2. 运行代码
```shell
python enhance_a_video.py \
--up_scale 4 --target_fps 24 --noise_aug 250 \
--solver_mode 'fast' --steps 15 \
--input_path inputs/000000.mp4 \
--prompt 'Wide-angle aerial shot at dawn,soft morning light casting long shadows,an elderly man walking his dog through a quiet,foggy park,trees and benches in the background,peaceful and serene atmosphere' \
--save_dir 'results/'
```
其中:
- `input_path` 是输入视频的路径
- `prompt` 是视频内容的描述。此工具使用的提示词应更短不超过77个字。您可能需要简化用于生成CogVideoX视频的提示词。
- `target_fps` 是视频的目标帧率。通常16 fps已经很流畅默认值为24 fps。
- `up_scale` 推荐设置为2、3或4。目标分辨率限制在2k左右及以下。
- `noise_aug` 的值取决于输入视频的质量。质量较低的视频需要更高的噪声级别这对应于更强的优化。250~300适用于非常低质量的视频。对于高质量视频设置为≤200。
- `steps` 如果想减少步数请先将solver_mode改为“normal”然后减少步数。“fast”模式的步数是固定的15步
代码在执行过程中会自动从Hugging Face下载所需的模型。
代码运行过程中会自动从Huggingface拉取需要的模型
运行日志通常如下:
```shell
/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
2024-08-20 13:25:17,553 - video_to_video - INFO - checkpoint_path: ./ckpts/venhancer_paper.pt
/share/home/zyx/.conda/envs/cogvideox/lib/python3.10/site-packages/open_clip/factory.py:88: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
checkpoint = torch.load(checkpoint_path, map_location=map_location)
2024-08-20 13:25:37,486 - video_to_video - INFO - Build encoder with FrozenOpenCLIPEmbedder
/share/home/zyx/Code/VEnhancer/video_to_video/video_to_video_model.py:35: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
load_dict = torch.load(cfg.model_path, map_location='cpu')
2024-08-20 13:25:55,391 - video_to_video - INFO - Load model path ./ckpts/venhancer_paper.pt, with local status <All keys matched successfully>
2024-08-20 13:25:55,392 - video_to_video - INFO - Build diffusion with GaussianDiffusion
2024-08-20 13:26:16,092 - video_to_video - INFO - input video path: inputs/000000.mp4
2024-08-20 13:26:16,093 - video_to_video - INFO - text: Wide-angle aerial shot at dawn,soft morning light casting long shadows,an elderly man walking his dog through a quiet,foggy park,trees and benches in the background,peaceful and serene atmosphere
2024-08-20 13:26:16,156 - video_to_video - INFO - input frames length: 49
2024-08-20 13:26:16,156 - video_to_video - INFO - input fps: 8.0
2024-08-20 13:26:16,156 - video_to_video - INFO - target_fps: 24.0
2024-08-20 13:26:16,311 - video_to_video - INFO - input resolution: (480, 720)
2024-08-20 13:26:16,312 - video_to_video - INFO - target resolution: (1320, 1982)
2024-08-20 13:26:16,312 - video_to_video - INFO - noise augmentation: 250
2024-08-20 13:26:16,312 - video_to_video - INFO - scale s is set to: 8
2024-08-20 13:26:16,399 - video_to_video - INFO - video_data shape: torch.Size([145, 3, 1320, 1982])
/share/home/zyx/Code/VEnhancer/video_to_video/video_to_video_model.py:108: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with amp.autocast(enabled=True):
2024-08-20 13:27:19,605 - video_to_video - INFO - step: 0
2024-08-20 13:30:12,020 - video_to_video - INFO - step: 1
2024-08-20 13:33:04,956 - video_to_video - INFO - step: 2
2024-08-20 13:35:58,691 - video_to_video - INFO - step: 3
2024-08-20 13:38:51,254 - video_to_video - INFO - step: 4
2024-08-20 13:41:44,150 - video_to_video - INFO - step: 5
2024-08-20 13:44:37,017 - video_to_video - INFO - step: 6
2024-08-20 13:47:30,037 - video_to_video - INFO - step: 7
2024-08-20 13:50:22,838 - video_to_video - INFO - step: 8
2024-08-20 13:53:15,844 - video_to_video - INFO - step: 9
2024-08-20 13:56:08,657 - video_to_video - INFO - step: 10
2024-08-20 13:59:01,648 - video_to_video - INFO - step: 11
2024-08-20 14:01:54,541 - video_to_video - INFO - step: 12
2024-08-20 14:04:47,488 - video_to_video - INFO - step: 13
2024-08-20 14:10:13,637 - video_to_video - INFO - sampling, finished.
```
使用A100单卡运行对于每个CogVideoX生产的6秒视频按照默认配置会消耗60G显存并用时40-50分钟。