mirror of
https://github.com/THUDM/CogVideo.git
synced 2025-04-06 03:57:56 +08:00
Merge pull request #297 from THUDM/CogVideoX_dev
release cogvlm-llama3-caption
This commit is contained in:
commit
6a2efb844b
@ -26,6 +26,9 @@ Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spac
|
||||
This model allows inputting an image as a background combined with prompts to generate videos, providing greater
|
||||
controllability. With this release, the CogVideoX series now supports three tasks: text-to-video, video extension, and
|
||||
image-to-video generation. Feel free to try it out [online](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space).
|
||||
- 🔥🔥 **News**: ```2024/9/19```: The caption model used in the CogVideoX training process to convert video data into text
|
||||
descriptions, [cogvlm2-llama3-caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption), is now open-source. Feel
|
||||
free to download and use it.
|
||||
- 🔥 **News**: ```2024/9/16```: We have added an automated video generation tool! You can now use local open-source
|
||||
models + FLUX + CogVideoX to automatically generate high-quality videos. Feel free
|
||||
to [try it out](tools/llm_flux_cogvideox/llm_flux_cogvideox.py).
|
||||
@ -319,7 +322,8 @@ Here provide three projects that can be run directly on free Colab T4 instances:
|
||||
CogVideoX-5B Quantized Text-to-Video Inference Colab code, which takes about 30 minutes per run.
|
||||
+ [CogVideoX-5B-I2V-Colab.ipynb](https://colab.research.google.com/drive/17CqYCqSwz39nZAX2YyonDxosVKUZGzcX?usp=sharing):
|
||||
CogVideoX-5B Image-to-Video Colab code.
|
||||
|
||||
+ [CogVideoX-5B-V2V-Colab.ipynb](https://colab.research.google.com/drive/1comfGAUJnChl5NwPuO8Ox5_6WCy4kbNN?usp=sharing):
|
||||
CogVideoX-5B Video-to-Video Colab code.
|
||||
|
||||
### Inference
|
||||
|
||||
|
@ -24,6 +24,9 @@
|
||||
|
||||
- 🔥🔥 **ニュース**: ```2024/9/19```: CogVideoXシリーズの画像生成ビデオモデル **CogVideoX-5B-I2V**
|
||||
をオープンソース化しました。このモデルでは、背景として画像を入力し、プロンプトと組み合わせてビデオを生成でき、より強力なコントロール性を提供します。これで、CogVideoXシリーズは、テキスト生成ビデオ、ビデオ拡張、画像生成ビデオの3つのタスクをサポートしています。ぜひ [オンラインでお試しください](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space)。
|
||||
- 🔥🔥 **ニュース**: ```2024/9/19```:CogVideoX
|
||||
のトレーニングプロセスで、ビデオデータをテキストに変換するためのキャプションモデル [cogvlm2-llama3-caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption)
|
||||
がオープンソース化されました。ぜひダウンロードしてご利用ください。
|
||||
- 🔥 **ニュース**: ```2024/9/16```: 自動動画生成ツールを追加しました!オープンソースのローカルモデル + FLUX + CogVideoX
|
||||
を使用して、高品質な動画を自動生成できます。ぜひ[お試しください](tools/llm_flux_cogvideox/llm_flux_cogvideox.py)。
|
||||
- 🔥 **ニュース**: ```2024/9/15```: CogVideoXのLoRAファインチューニングの重みがエクスポートされ、`diffusers`
|
||||
@ -286,6 +289,8 @@ pipe.vae.enable_tiling()
|
||||
CogVideoX-5B テキストからビデオへの量子化推論用Colabコード。1回の実行に約30分かかります。
|
||||
+ [CogVideoX-5B-I2V-Colab.ipynb](https://colab.research.google.com/drive/17CqYCqSwz39nZAX2YyonDxosVKUZGzcX?usp=sharing):
|
||||
CogVideoX-5B 画像からビデオへの生成用Colabコード。
|
||||
+ [CogVideoX-5B-V2V-Colab.ipynb](https://colab.research.google.com/drive/1comfGAUJnChl5NwPuO8Ox5_6WCy4kbNN?usp=sharing):
|
||||
CogVideoX-5B ビデオからビデオへの生成用Colabコード。
|
||||
|
||||
### Inference
|
||||
|
||||
|
@ -26,6 +26,9 @@
|
||||
- 🔥🔥 **News**: ```2024/9/19```: 我们开源 CogVideoX 系列图生视频模型 **CogVideoX-5B-I2V**
|
||||
。该模型可以将一张图像作为背景输入,结合提示词一起生成视频,具有更强的可控性。
|
||||
至此,CogVideoX系列模型已经支持文本生成视频,视频续写,图片生成视频三种任务。欢迎前往在线[体验](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space)。
|
||||
- 🔥🔥 **News**: ```2024/9/19```: CogVideoX 训练过程中用于将视频数据转换为文本描述的 Caption
|
||||
模型 [cogvlm2-llama3-caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption)
|
||||
已经开源。欢迎前往下载并使用。
|
||||
- 🔥 **News**: ```2024/9/16```: 我们添加自动化生成视频工具,你可以使用本地开源模型 + FLUX + CogVideoX
|
||||
实现自动生成优质视频,欢迎[体验](tools/llm_flux_cogvideox/llm_flux_cogvideox.py)
|
||||
- 🔥 **News**: ```2024/9/15```: CogVideoX LoRA微调权重导出并在`diffusers`库中测试通过,请查看[教程](sat/README_zh.md)。
|
||||
@ -276,6 +279,8 @@ pipe.vae.enable_tiling()
|
||||
CogVideoX-5B 文字生成视频量化推理 Colab 代码,运行一次大约需要30分钟。
|
||||
+ [CogVideoX-5B-I2V-Colab.ipynb](https://colab.research.google.com/drive/17CqYCqSwz39nZAX2YyonDxosVKUZGzcX?usp=sharing):
|
||||
CogVideoX-5B 图片生成视频 Colab 代码。
|
||||
+ [CogVideoX-5B-V2V-Colab.ipynb](https://colab.research.google.com/drive/1comfGAUJnChl5NwPuO8Ox5_6WCy4kbNN?usp=sharing):
|
||||
CogVideoX-5B 视频生成视频 Colab 代码。
|
||||
|
||||
### inference
|
||||
|
||||
|
@ -40,10 +40,10 @@ device = "cuda" if torch.cuda.is_available() else "cpu"
|
||||
hf_hub_download(repo_id="ai-forever/Real-ESRGAN", filename="RealESRGAN_x4.pth", local_dir="model_real_esran")
|
||||
snapshot_download(repo_id="AlexWortega/RIFE", local_dir="model_rife")
|
||||
|
||||
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16).to(device)
|
||||
pipe = CogVideoXPipeline.from_pretrained("/share/official_pretrains/hf_home/CogVideoX-5b", torch_dtype=torch.bfloat16).to(device)
|
||||
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
|
||||
pipe_video = CogVideoXVideoToVideoPipeline.from_pretrained(
|
||||
"THUDM/CogVideoX-5b",
|
||||
"/share/official_pretrains/hf_home/CogVideoX-5b",
|
||||
transformer=pipe.transformer,
|
||||
vae=pipe.vae,
|
||||
scheduler=pipe.scheduler,
|
||||
@ -53,9 +53,9 @@ pipe_video = CogVideoXVideoToVideoPipeline.from_pretrained(
|
||||
).to(device)
|
||||
|
||||
pipe_image = CogVideoXImageToVideoPipeline.from_pretrained(
|
||||
"THUDM/CogVideoX-5b-I2V",
|
||||
"/share/official_pretrains/hf_home/CogVideoX-5b-I2V",
|
||||
transformer=CogVideoXTransformer3DModel.from_pretrained(
|
||||
"THUDM/CogVideoX-5b-I2V", subfolder="transformer", torch_dtype=torch.bfloat16
|
||||
"/share/official_pretrains/hf_home/CogVideoX-5b-I2V", subfolder="transformer", torch_dtype=torch.bfloat16
|
||||
),
|
||||
vae=pipe.vae,
|
||||
scheduler=pipe.scheduler,
|
||||
@ -322,11 +322,11 @@ with gr.Blocks() as demo:
|
||||
with gr.Column():
|
||||
with gr.Accordion("I2V: Image Input (cannot be used simultaneously with video input)", open=False):
|
||||
image_input = gr.Image(label="Input Image (will be cropped to 720 * 480)")
|
||||
examples_component_images = gr.Examples(examples_images, inputs=[examples_images], cache_examples=False)
|
||||
examples_component_images = gr.Examples(examples_images, inputs=[image_input], cache_examples=False)
|
||||
with gr.Accordion("V2V: Video Input (cannot be used simultaneously with image input)", open=False):
|
||||
video_input = gr.Video(label="Input Video (will be cropped to 49 frames, 6 seconds at 8fps)")
|
||||
strength = gr.Slider(0.1, 1.0, value=0.8, step=0.01, label="Strength")
|
||||
examples_component_videos = gr.Examples(examples_videos, inputs=[examples_videos], cache_examples=False)
|
||||
examples_component_videos = gr.Examples(examples_videos, inputs=[video_input], cache_examples=False)
|
||||
prompt = gr.Textbox(label="Prompt (Less than 200 Words)", placeholder="Enter your prompt here", lines=5)
|
||||
|
||||
with gr.Row():
|
||||
|
194
tools/llm_flux_cogvideox/gradio_page.py
Normal file
194
tools/llm_flux_cogvideox/gradio_page.py
Normal file
@ -0,0 +1,194 @@
|
||||
import os
|
||||
import gradio as gr
|
||||
import gc
|
||||
import random
|
||||
import torch
|
||||
import numpy as np
|
||||
from PIL import Image
|
||||
import transformers
|
||||
from diffusers import CogVideoXImageToVideoPipeline, CogVideoXDPMScheduler, DiffusionPipeline
|
||||
from diffusers.utils import export_to_video
|
||||
from transformers import AutoTokenizer
|
||||
from datetime import datetime, timedelta
|
||||
import threading
|
||||
import time
|
||||
import moviepy.editor as mp
|
||||
|
||||
torch.set_float32_matmul_precision("high")
|
||||
|
||||
# Set default values
|
||||
caption_generator_model_id = "/share/home/zyx/Models/Meta-Llama-3.1-8B-Instruct"
|
||||
image_generator_model_id = "/share/home/zyx/Models/FLUX.1-dev"
|
||||
video_generator_model_id = "/share/official_pretrains/hf_home/CogVideoX-5b-I2V"
|
||||
seed = 1337
|
||||
|
||||
os.makedirs("./output", exist_ok=True)
|
||||
os.makedirs("./gradio_tmp", exist_ok=True)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(caption_generator_model_id, trust_remote_code=True)
|
||||
caption_generator = transformers.pipeline(
|
||||
"text-generation",
|
||||
model=caption_generator_model_id,
|
||||
device_map="balanced",
|
||||
model_kwargs={
|
||||
"local_files_only": True,
|
||||
"torch_dtype": torch.bfloat16,
|
||||
},
|
||||
trust_remote_code=True,
|
||||
tokenizer=tokenizer
|
||||
)
|
||||
|
||||
image_generator = DiffusionPipeline.from_pretrained(
|
||||
image_generator_model_id,
|
||||
torch_dtype=torch.bfloat16,
|
||||
device_map="balanced"
|
||||
)
|
||||
# image_generator.to("cuda")
|
||||
|
||||
video_generator = CogVideoXImageToVideoPipeline.from_pretrained(
|
||||
video_generator_model_id,
|
||||
torch_dtype=torch.bfloat16,
|
||||
device_map="balanced"
|
||||
)
|
||||
|
||||
video_generator.vae.enable_slicing()
|
||||
video_generator.vae.enable_tiling()
|
||||
|
||||
video_generator.scheduler = CogVideoXDPMScheduler.from_config(
|
||||
video_generator.scheduler.config, timestep_spacing="trailing"
|
||||
)
|
||||
|
||||
# Define prompts
|
||||
SYSTEM_PROMPT = """
|
||||
You are part of a team of people that create videos using generative models. You use a video-generation model that can generate a video about anything you describe.
|
||||
|
||||
For example, if you respond with "A beautiful morning in the woods with the sun peaking through the trees", the video generation model will create a video of exactly as described. Your task is to summarize the descriptions of videos provided by users and create detailed prompts to feed into the generative model.
|
||||
|
||||
There are a few rules to follow:
|
||||
- You will only ever output a single video description per request.
|
||||
- If the user mentions to summarize the prompt in [X] words, make sure not to exceed the limit.
|
||||
|
||||
Your responses should just be the video generation prompt. Here are examples:
|
||||
- "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
|
||||
- "A street artist, clad in a worn-out denim jacket and a colorful bandana, stands before a vast concrete wall in the heart of the city, holding a can of spray paint, spray-painting a colorful bird on a mottled wall."
|
||||
""".strip()
|
||||
|
||||
USER_PROMPT = """
|
||||
Could you generate a prompt for a video generation model? Please limit the prompt to [{0}] words.
|
||||
""".strip()
|
||||
|
||||
|
||||
def generate_caption(prompt):
|
||||
num_words = random.choice([25, 50, 75, 100])
|
||||
user_prompt = USER_PROMPT.format(num_words)
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": SYSTEM_PROMPT},
|
||||
{"role": "user", "content": prompt + "\n" + user_prompt},
|
||||
]
|
||||
|
||||
response = caption_generator(
|
||||
messages,
|
||||
max_new_tokens=226,
|
||||
return_full_text=False
|
||||
)
|
||||
caption = response[0]["generated_text"]
|
||||
if caption.startswith("\"") and caption.endswith("\""):
|
||||
caption = caption[1:-1]
|
||||
return caption
|
||||
|
||||
|
||||
def generate_image(caption, progress=gr.Progress(track_tqdm=True)):
|
||||
image = image_generator(
|
||||
prompt=caption,
|
||||
height=480,
|
||||
width=720,
|
||||
num_inference_steps=30,
|
||||
guidance_scale=3.5,
|
||||
).images[0]
|
||||
return image, image # One for output One for State
|
||||
|
||||
|
||||
def generate_video(
|
||||
caption,
|
||||
image,
|
||||
progress=gr.Progress(track_tqdm=True)
|
||||
):
|
||||
generator = torch.Generator().manual_seed(seed)
|
||||
video_frames = video_generator(
|
||||
image=image,
|
||||
prompt=caption,
|
||||
height=480,
|
||||
width=720,
|
||||
num_frames=49,
|
||||
num_inference_steps=50,
|
||||
guidance_scale=6,
|
||||
use_dynamic_cfg=True,
|
||||
generator=generator,
|
||||
).frames[0]
|
||||
video_path = save_video(video_frames)
|
||||
gif_path = convert_to_gif(video_path)
|
||||
return video_path, gif_path
|
||||
|
||||
|
||||
def save_video(tensor):
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
video_path = f"./output/{timestamp}.mp4"
|
||||
os.makedirs(os.path.dirname(video_path), exist_ok=True)
|
||||
export_to_video(tensor, video_path, fps=8)
|
||||
return video_path
|
||||
|
||||
|
||||
def convert_to_gif(video_path):
|
||||
clip = mp.VideoFileClip(video_path)
|
||||
clip = clip.set_fps(8)
|
||||
clip = clip.resize(height=240)
|
||||
gif_path = video_path.replace(".mp4", ".gif")
|
||||
clip.write_gif(gif_path, fps=8)
|
||||
return gif_path
|
||||
|
||||
|
||||
def delete_old_files():
|
||||
while True:
|
||||
now = datetime.now()
|
||||
cutoff = now - timedelta(minutes=10)
|
||||
directories = ["./output", "./gradio_tmp"]
|
||||
|
||||
for directory in directories:
|
||||
for filename in os.listdir(directory):
|
||||
file_path = os.path.join(directory, filename)
|
||||
if os.path.isfile(file_path):
|
||||
file_mtime = datetime.fromtimestamp(os.path.getmtime(file_path))
|
||||
if file_mtime < cutoff:
|
||||
os.remove(file_path)
|
||||
time.sleep(600)
|
||||
|
||||
|
||||
threading.Thread(target=delete_old_files, daemon=True).start()
|
||||
|
||||
with gr.Blocks() as demo:
|
||||
gr.Markdown("""
|
||||
<div style="text-align: center; font-size: 32px; font-weight: bold; margin-bottom: 20px;">
|
||||
LLM + FLUX + CogVideoX-I2V Space 🤗
|
||||
</div>
|
||||
""")
|
||||
with gr.Row():
|
||||
with gr.Column():
|
||||
prompt = gr.Textbox(label="Prompt", placeholder="Enter your prompt here", lines=5)
|
||||
generate_caption_button = gr.Button("Generate Caption")
|
||||
caption = gr.Textbox(label="Caption", placeholder="Caption will appear here", lines=5)
|
||||
generate_image_button = gr.Button("Generate Image")
|
||||
image_output = gr.Image(label="Generated Image")
|
||||
state_image = gr.State()
|
||||
generate_caption_button.click(fn=generate_caption, inputs=prompt, outputs=caption)
|
||||
generate_image_button.click(fn=generate_image, inputs=caption, outputs=[image_output, state_image])
|
||||
with gr.Column():
|
||||
video_output = gr.Video(label="Generated Video", width=720, height=480)
|
||||
download_video_button = gr.File(label="📥 Download Video", visible=False)
|
||||
download_gif_button = gr.File(label="📥 Download GIF", visible=False)
|
||||
generate_video_button = gr.Button("Generate Video from Image")
|
||||
generate_video_button.click(fn=generate_video, inputs=[caption, state_image],
|
||||
outputs=[video_output, download_gif_button])
|
||||
|
||||
if __name__ == "__main__":
|
||||
demo.launch()
|
Loading…
x
Reference in New Issue
Block a user