Merge pull request #297 from THUDM/CogVideoX_dev

release cogvlm-llama3-caption
2025-06-22 17:09:16 +08:00 · 2024-09-19 00:41:46 +08:00 · 2024-09-19 00:41:46 +08:00 · 6a2efb844b
commit 6a2efb844b
parent f6b1ff42a7 ee9c0b38ab
5 changed files with 215 additions and 7 deletions
--- a/README.md
+++ b/README.md
@ -26,6 +26,9 @@ Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spac
  This model allows inputting an image as a background combined with prompts to generate videos, providing greater
  controllability. With this release, the CogVideoX series now supports three tasks: text-to-video, video extension, and
  image-to-video generation. Feel free to try it out [online](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space).
+- 🔥🔥 **News**: ```2024/9/19```: The caption model used in the CogVideoX training process to convert video data into text
+  descriptions, [cogvlm2-llama3-caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption), is now open-source. Feel
+  free to download and use it.
 - 🔥 **News**: ```2024/9/16```: We have added an automated video generation tool! You can now use local open-source
  models + FLUX + CogVideoX to automatically generate high-quality videos. Feel free
  to [try it out](tools/llm_flux_cogvideox/llm_flux_cogvideox.py).
@ -319,7 +322,8 @@ Here provide three projects that can be run directly on free Colab T4 instances:
  CogVideoX-5B Quantized Text-to-Video Inference Colab code, which takes about 30 minutes per run.
 + [CogVideoX-5B-I2V-Colab.ipynb](https://colab.research.google.com/drive/17CqYCqSwz39nZAX2YyonDxosVKUZGzcX?usp=sharing):
  CogVideoX-5B Image-to-Video Colab code.
-
+ [CogVideoX-5B-V2V-Colab.ipynb](https://colab.research.google.com/drive/1comfGAUJnChl5NwPuO8Ox5_6WCy4kbNN?usp=sharing):
+  CogVideoX-5B Video-to-Video Colab code.

 ### Inference

--- a/README_ja.md
+++ b/README_ja.md
@ -24,6 +24,9 @@

 - 🔥🔥 **ニュース**: ```2024/9/19```: CogVideoXシリーズの画像生成ビデオモデル **CogVideoX-5B-I2V**
  をオープンソース化しました。このモデルでは、背景として画像を入力し、プロンプトと組み合わせてビデオを生成でき、より強力なコントロール性を提供します。これで、CogVideoXシリーズは、テキスト生成ビデオ、ビデオ拡張、画像生成ビデオの3つのタスクをサポートしています。ぜひ [オンラインでお試しください](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space)。
+- 🔥🔥 **ニュース**: ```2024/9/19```：CogVideoX
+  のトレーニングプロセスで、ビデオデータをテキストに変換するためのキャプションモデル [cogvlm2-llama3-caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption)
+  がオープンソース化されました。ぜひダウンロードしてご利用ください。
 - 🔥 **ニュース**: ```2024/9/16```: 自動動画生成ツールを追加しました！オープンソースのローカルモデル + FLUX + CogVideoX
  を使用して、高品質な動画を自動生成できます。ぜひ[お試しください](tools/llm_flux_cogvideox/llm_flux_cogvideox.py)。
 - 🔥 **ニュース**: ```2024/9/15```: CogVideoXのLoRAファインチューニングの重みがエクスポートされ、`diffusers`
@ -286,6 +289,8 @@ pipe.vae.enable_tiling()
  CogVideoX-5B テキストからビデオへの量子化推論用Colabコード。1回の実行に約30分かかります。
 + [CogVideoX-5B-I2V-Colab.ipynb](https://colab.research.google.com/drive/17CqYCqSwz39nZAX2YyonDxosVKUZGzcX?usp=sharing):
  CogVideoX-5B 画像からビデオへの生成用Colabコード。
+ [CogVideoX-5B-V2V-Colab.ipynb](https://colab.research.google.com/drive/1comfGAUJnChl5NwPuO8Ox5_6WCy4kbNN?usp=sharing):
+  CogVideoX-5B ビデオからビデオへの生成用Colabコード。

 ### Inference

--- a/README_zh.md
+++ b/README_zh.md
@ -26,6 +26,9 @@
 - 🔥🔥 **News**: ```2024/9/19```: 我们开源 CogVideoX 系列图生视频模型 **CogVideoX-5B-I2V**
  。该模型可以将一张图像作为背景输入，结合提示词一起生成视频，具有更强的可控性。
  至此，CogVideoX系列模型已经支持文本生成视频，视频续写，图片生成视频三种任务。欢迎前往在线[体验](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space)。
+- 🔥🔥 **News**: ```2024/9/19```: CogVideoX 训练过程中用于将视频数据转换为文本描述的 Caption
+  模型 [cogvlm2-llama3-caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption)
+  已经开源。欢迎前往下载并使用。
 - 🔥 **News**: ```2024/9/16```: 我们添加自动化生成视频工具，你可以使用本地开源模型 + FLUX + CogVideoX
  实现自动生成优质视频，欢迎[体验](tools/llm_flux_cogvideox/llm_flux_cogvideox.py)
 - 🔥 **News**: ```2024/9/15```: CogVideoX LoRA微调权重导出并在`diffusers`库中测试通过，请查看[教程](sat/README_zh.md)。
@ -276,6 +279,8 @@ pipe.vae.enable_tiling()
  CogVideoX-5B 文字生成视频量化推理 Colab 代码，运行一次大约需要30分钟。
 + [CogVideoX-5B-I2V-Colab.ipynb](https://colab.research.google.com/drive/17CqYCqSwz39nZAX2YyonDxosVKUZGzcX?usp=sharing):
  CogVideoX-5B 图片生成视频 Colab 代码。
+ [CogVideoX-5B-V2V-Colab.ipynb](https://colab.research.google.com/drive/1comfGAUJnChl5NwPuO8Ox5_6WCy4kbNN?usp=sharing):
+  CogVideoX-5B 视频生成视频 Colab 代码。

 ### inference

--- a/inference/gradio_composite_demo/app.py
+++ b/inference/gradio_composite_demo/app.py
@ -40,10 +40,10 @@ device = "cuda" if torch.cuda.is_available() else "cpu"
 hf_hub_download(repo_id="ai-forever/Real-ESRGAN", filename="RealESRGAN_x4.pth", local_dir="model_real_esran")
 snapshot_download(repo_id="AlexWortega/RIFE", local_dir="model_rife")

-pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16).to(device)
+pipe = CogVideoXPipeline.from_pretrained("/share/official_pretrains/hf_home/CogVideoX-5b", torch_dtype=torch.bfloat16).to(device)
 pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
 pipe_video = CogVideoXVideoToVideoPipeline.from_pretrained(
-    "THUDM/CogVideoX-5b",
+    "/share/official_pretrains/hf_home/CogVideoX-5b",
    transformer=pipe.transformer,
    vae=pipe.vae,
    scheduler=pipe.scheduler,
@ -53,9 +53,9 @@ pipe_video = CogVideoXVideoToVideoPipeline.from_pretrained(
 ).to(device)

 pipe_image = CogVideoXImageToVideoPipeline.from_pretrained(
-    "THUDM/CogVideoX-5b-I2V",
+    "/share/official_pretrains/hf_home/CogVideoX-5b-I2V",
    transformer=CogVideoXTransformer3DModel.from_pretrained(
-        "THUDM/CogVideoX-5b-I2V", subfolder="transformer", torch_dtype=torch.bfloat16
+        "/share/official_pretrains/hf_home/CogVideoX-5b-I2V", subfolder="transformer", torch_dtype=torch.bfloat16
    ),
    vae=pipe.vae,
    scheduler=pipe.scheduler,
@ -322,11 +322,11 @@ with gr.Blocks() as demo:
        with gr.Column():
            with gr.Accordion("I2V: Image Input (cannot be used simultaneously with video input)", open=False):
                image_input = gr.Image(label="Input Image (will be cropped to 720 * 480)")
-                examples_component_images = gr.Examples(examples_images, inputs=[examples_images], cache_examples=False)
+                examples_component_images = gr.Examples(examples_images, inputs=[image_input], cache_examples=False)
            with gr.Accordion("V2V: Video Input (cannot be used simultaneously with image input)", open=False):
                video_input = gr.Video(label="Input Video (will be cropped to 49 frames, 6 seconds at 8fps)")
                strength = gr.Slider(0.1, 1.0, value=0.8, step=0.01, label="Strength")
-                examples_component_videos = gr.Examples(examples_videos, inputs=[examples_videos], cache_examples=False)
+                examples_component_videos = gr.Examples(examples_videos, inputs=[video_input], cache_examples=False)
            prompt = gr.Textbox(label="Prompt (Less than 200 Words)", placeholder="Enter your prompt here", lines=5)

            with gr.Row():
--- a/tools/llm_flux_cogvideox/gradio_page.py
+++ b/tools/llm_flux_cogvideox/gradio_page.py
@ -0,0 +1,194 @@
+import os
+import gradio as gr
+import gc
+import random
+import torch
+import numpy as np
+from PIL import Image
+import transformers
+from diffusers import CogVideoXImageToVideoPipeline, CogVideoXDPMScheduler, DiffusionPipeline
+from diffusers.utils import export_to_video
+from transformers import AutoTokenizer
+from datetime import datetime, timedelta
+import threading
+import time
+import moviepy.editor as mp
+
+torch.set_float32_matmul_precision("high")
+
+# Set default values
+caption_generator_model_id = "/share/home/zyx/Models/Meta-Llama-3.1-8B-Instruct"
+image_generator_model_id = "/share/home/zyx/Models/FLUX.1-dev"
+video_generator_model_id = "/share/official_pretrains/hf_home/CogVideoX-5b-I2V"
+seed = 1337
+
+os.makedirs("./output", exist_ok=True)
+os.makedirs("./gradio_tmp", exist_ok=True)
+
+tokenizer = AutoTokenizer.from_pretrained(caption_generator_model_id, trust_remote_code=True)
+caption_generator = transformers.pipeline(
+    "text-generation",
+    model=caption_generator_model_id,
+    device_map="balanced",
+    model_kwargs={
+        "local_files_only": True,
+        "torch_dtype": torch.bfloat16,
+    },
+    trust_remote_code=True,
+    tokenizer=tokenizer
+)
+
+image_generator = DiffusionPipeline.from_pretrained(
+    image_generator_model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="balanced"
+)
+# image_generator.to("cuda")
+
+video_generator = CogVideoXImageToVideoPipeline.from_pretrained(
+    video_generator_model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="balanced"
+)
+
+video_generator.vae.enable_slicing()
+video_generator.vae.enable_tiling()
+
+video_generator.scheduler = CogVideoXDPMScheduler.from_config(
+    video_generator.scheduler.config, timestep_spacing="trailing"
+)
+
+# Define prompts
+SYSTEM_PROMPT = """
+You are part of a team of people that create videos using generative models. You use a video-generation model that can generate a video about anything you describe.
+
+For example, if you respond with "A beautiful morning in the woods with the sun peaking through the trees", the video generation model will create a video of exactly as described. Your task is to summarize the descriptions of videos provided by users and create detailed prompts to feed into the generative model.
+
+There are a few rules to follow:
+- You will only ever output a single video description per request.
+- If the user mentions to summarize the prompt in [X] words, make sure not to exceed the limit.
+
+Your responses should just be the video generation prompt. Here are examples:
+- "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
+- "A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart of the city, holding a can of spray paint, spray-painting a colorful bird on a mottled wall."
+""".strip()
+
+USER_PROMPT = """
+Could you generate a prompt for a video generation model? Please limit the prompt to [{0}] words.
+""".strip()
+
+
+def generate_caption(prompt):
+    num_words = random.choice([25, 50, 75, 100])
+    user_prompt = USER_PROMPT.format(num_words)
+
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": prompt + "\n" + user_prompt},
+    ]
+
+    response = caption_generator(
+        messages,
+        max_new_tokens=226,
+        return_full_text=False
+    )
+    caption = response[0]["generated_text"]
+    if caption.startswith("\"") and caption.endswith("\""):
+        caption = caption[1:-1]
+    return caption
+
+
+def generate_image(caption, progress=gr.Progress(track_tqdm=True)):
+    image = image_generator(
+        prompt=caption,
+        height=480,
+        width=720,
+        num_inference_steps=30,
+        guidance_scale=3.5,
+    ).images[0]
+    return image, image  # One for output One for State
+
+
+def generate_video(
+        caption,
+        image,
+        progress=gr.Progress(track_tqdm=True)
+):
+    generator = torch.Generator().manual_seed(seed)
+    video_frames = video_generator(
+        image=image,
+        prompt=caption,
+        height=480,
+        width=720,
+        num_frames=49,
+        num_inference_steps=50,
+        guidance_scale=6,
+        use_dynamic_cfg=True,
+        generator=generator,
+    ).frames[0]
+    video_path = save_video(video_frames)
+    gif_path = convert_to_gif(video_path)
+    return video_path, gif_path
+
+
+def save_video(tensor):
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    video_path = f"./output/{timestamp}.mp4"
+    os.makedirs(os.path.dirname(video_path), exist_ok=True)
+    export_to_video(tensor, video_path, fps=8)
+    return video_path
+
+
+def convert_to_gif(video_path):
+    clip = mp.VideoFileClip(video_path)
+    clip = clip.set_fps(8)
+    clip = clip.resize(height=240)
+    gif_path = video_path.replace(".mp4", ".gif")
+    clip.write_gif(gif_path, fps=8)
+    return gif_path
+
+
+def delete_old_files():
+    while True:
+        now = datetime.now()
+        cutoff = now - timedelta(minutes=10)
+        directories = ["./output", "./gradio_tmp"]
+
+        for directory in directories:
+            for filename in os.listdir(directory):
+                file_path = os.path.join(directory, filename)
+                if os.path.isfile(file_path):
+                    file_mtime = datetime.fromtimestamp(os.path.getmtime(file_path))
+                    if file_mtime < cutoff:
+                        os.remove(file_path)
+        time.sleep(600)
+
+
+threading.Thread(target=delete_old_files, daemon=True).start()
+
+with gr.Blocks() as demo:
+    gr.Markdown("""
+           <div style="text-align: center; font-size: 32px; font-weight: bold; margin-bottom: 20px;">
+               LLM + FLUX + CogVideoX-I2V Space 🤗
+            </div>
+    """)
+    with gr.Row():
+        with gr.Column():
+            prompt = gr.Textbox(label="Prompt", placeholder="Enter your prompt here", lines=5)
+            generate_caption_button = gr.Button("Generate Caption")
+            caption = gr.Textbox(label="Caption", placeholder="Caption will appear here", lines=5)
+            generate_image_button = gr.Button("Generate Image")
+            image_output = gr.Image(label="Generated Image")
+            state_image = gr.State()
+            generate_caption_button.click(fn=generate_caption, inputs=prompt, outputs=caption)
+            generate_image_button.click(fn=generate_image, inputs=caption, outputs=[image_output, state_image])
+        with gr.Column():
+            video_output = gr.Video(label="Generated Video", width=720, height=480)
+            download_video_button = gr.File(label="📥 Download Video", visible=False)
+            download_gif_button = gr.File(label="📥 Download GIF", visible=False)
+            generate_video_button = gr.Button("Generate Video from Image")
+            generate_video_button.click(fn=generate_video, inputs=[caption, state_image],
+                                        outputs=[video_output, download_gif_button])
+
+if __name__ == "__main__":
+    demo.launch()