diff --git a/README.md b/README.md index 0c66b77..a159daf 100644 --- a/README.md +++ b/README.md @@ -27,7 +27,7 @@ Experience the CogVideoX-5B model online at + + + ## Video Caption via CogVLM2-Video -

-🤗 Hugging Face   |   🤖 ModelScope   |    📑 Blog    | 💬 Online Demo   -

+[Code](https://github.com/THUDM/CogVLM2/tree/main/video_demo) | 🤗 [Hugging Face](https://huggingface.co/THUDM/cogvlm2-video-llama3-chat) | 🤖 [ModelScope](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-chat) | 📑 [Blog](https://cogvlm2-video.github.io/) | [💬 Online Demo](http://cogvlm2-online.cogviewai.cn:7868/) CogVLM2-Video is a versatile video understanding model equipped with timestamp-based question answering capabilities. Users can input prompts such as `Please describe this video in detail.` to the model to obtain a detailed video caption: @@ -15,4 +41,27 @@ Users can input prompts such as `Please describe this video in detail.` to the m -Users can use the provided [code](https://github.com/THUDM/CogVLM2/tree/main/video_demo) to load the model or configure a RESTful API to generate video captions. \ No newline at end of file +Users can use the provided [code](https://github.com/THUDM/CogVLM2/tree/main/video_demo) to load the model or configure a RESTful API to generate video captions. + +## Citation + +🌟 If you find our work helpful, please leave us a star and cite our paper. + +CogVLM2-Caption: +``` +@article{yang2024cogvideox, + title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer}, + author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others}, + journal={arXiv preprint arXiv:2408.06072}, + year={2024} +} +``` +CogVLM2-Video: +``` +@article{hong2024cogvlm2, + title={CogVLM2: Visual Language Models for Image and Video Understanding}, + author={Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang, Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others}, + journal={arXiv preprint arXiv:2408.16500}, + year={2024} +} +``` \ No newline at end of file diff --git a/tools/caption/README_ja.md b/tools/caption/README_ja.md index 4fe096f..25c6cce 100644 --- a/tools/caption/README_ja.md +++ b/tools/caption/README_ja.md @@ -2,11 +2,37 @@ 通常、ほとんどのビデオデータには対応する説明文が付いていないため、ビデオデータをテキストの説明に変換して、テキストからビデオへのモデルに必要なトレーニングデータを提供する必要があります。 +## 更新とニュース +- 🔥🔥 **ニュース**: ```2024/9/19```:CogVideoX + のトレーニングプロセスで、ビデオデータをテキストに変換するためのキャプションモデル [CogVLM2-Caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption) + がオープンソース化されました。ぜひダウンロードしてご利用ください。 +## CogVLM2-Captionによるビデオキャプション + +🤗 [Hugging Face](https://huggingface.co/THUDM/cogvlm2-llama3-caption) | 🤖 [ModelScope](https://modelscope.cn/models/ZhipuAI/cogvlm2-llama3-caption/) + +CogVLM2-Captionは、CogVideoXモデルのトレーニングデータを生成するために使用されるビデオキャプションモデルです。 + +### インストール +```shell +pip install -r requirements.txt +``` + +### 使用方法 +```shell +python video_caption.py +``` + +例: +
+ +
+ + + ## CogVLM2-Video を使用したビデオキャプション -

-🤗 Hugging Face   |   🤖 ModelScope   |    📑 ブログ    | 💬 オンラインデモ   -

+[Code](https://github.com/THUDM/CogVLM2/tree/main/video_demo) | 🤗 [Hugging Face](https://huggingface.co/THUDM/cogvlm2-video-llama3-chat) | 🤖 [ModelScope](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-chat) | 📑 [Blog](https://cogvlm2-video.github.io/) | [💬 Online Demo](http://cogvlm2-online.cogviewai.cn:7868/) + CogVLM2-Video は、タイムスタンプベースの質問応答機能を備えた多機能なビデオ理解モデルです。ユーザーは `このビデオを詳細に説明してください。` などのプロンプトをモデルに入力して、詳細なビデオキャプションを取得できます:
@@ -14,3 +40,26 @@ CogVLM2-Video は、タイムスタンプベースの質問応答機能を備え
ユーザーは提供された[コード](https://github.com/THUDM/CogVLM2/tree/main/video_demo)を使用してモデルをロードするか、RESTful API を構成してビデオキャプションを生成できます。 + +## Citation + +🌟 If you find our work helpful, please leave us a star and cite our paper. + +CogVLM2-Caption: +``` +@article{yang2024cogvideox, + title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer}, + author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others}, + journal={arXiv preprint arXiv:2408.06072}, + year={2024} +} +``` +CogVLM2-Video: +``` +@article{hong2024cogvlm2, + title={CogVLM2: Visual Language Models for Image and Video Understanding}, + author={Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang, Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others}, + journal={arXiv preprint arXiv:2408.16500}, + year={2024} +} +``` diff --git a/tools/caption/README_zh.md b/tools/caption/README_zh.md index dc905e4..f6da7a6 100644 --- a/tools/caption/README_zh.md +++ b/tools/caption/README_zh.md @@ -2,15 +2,66 @@ 通常,大多数视频数据不带有相应的描述性文本,因此需要将视频数据转换为文本描述,以提供必要的训练数据用于文本到视频模型。 +## 项目更新 +- 🔥🔥 **News**: ```2024/9/19```: CogVideoX 训练过程中用于将视频数据转换为文本描述的 Caption + 模型 [CogVLM2-Caption](https://huggingface.co/THUDM/cogvlm2-llama3-caption) + 已经开源。欢迎前往下载并使用。 + +## 通过 CogVLM2-Caption 模型生成视频Caption + +🤗 [Hugging Face](https://huggingface.co/THUDM/cogvlm2-llama3-caption) | 🤖 [ModelScope](https://modelscope.cn/models/ZhipuAI/cogvlm2-llama3-caption/) + +CogVLM2-Caption是用于生成CogVideoX模型训练数据的视频caption模型。 + +### 安装依赖 +```shell +pip install -r requirements.txt +``` + +### 运行caption模型 + +```shell +python video_caption.py +``` + +示例: +
+ +
+ ## 通过 CogVLM2-Video 模型生成视频Caption -🤗 [Hugging Face](https://huggingface.co/THUDM/cogvlm2-video-llama3-chat) | 🤖 [ModelScope](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-chat) | 📑 [Blog](https://cogvlm2-video.github.io/) | [💬 Online Demo](http://cogvlm2-online.cogviewai.cn:7868/) +[Code](https://github.com/THUDM/CogVLM2/tree/main/video_demo) | 🤗 [Hugging Face](https://huggingface.co/THUDM/cogvlm2-video-llama3-chat) | 🤖 [ModelScope](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-chat) | 📑 [Blog](https://cogvlm2-video.github.io/) | [💬 Online Demo](http://cogvlm2-online.cogviewai.cn:7868/) -CogVLM2-Video 是一个多功能的视频理解模型,具备基于时间戳的问题回答能力。用户可以输入诸如 `请详细描述这个视频` 的提示语给模型,以获得详细的视频Caption: +CogVLM2-Video 是一个多功能的视频理解模型,具备基于时间戳的问题回答能力。用户可以输入诸如 `Describe this video in detail.` 的提示语给模型,以获得详细的视频Caption:
-用户可以使用提供的[代码](https://github.com/THUDM/CogVLM2/tree/main/video_demo)加载模型或配置 RESTful API 来生成视频Caption。 \ No newline at end of file +用户可以使用提供的[代码](https://github.com/THUDM/CogVLM2/tree/main/video_demo)加载模型或配置 RESTful API 来生成视频Caption。 + + +## Citation + +🌟 If you find our work helpful, please leave us a star and cite our paper. + +CogVLM2-Caption: +``` +@article{yang2024cogvideox, + title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer}, + author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others}, + journal={arXiv preprint arXiv:2408.06072}, + year={2024} +} +``` +CogVLM2-Video: +``` +@article{hong2024cogvlm2, + title={CogVLM2: Visual Language Models for Image and Video Understanding}, + author={Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang, Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others}, + journal={arXiv preprint arXiv:2408.16500}, + year={2024} +} +``` \ No newline at end of file diff --git a/tools/caption/assests/CogVLM2-Caption-example.png b/tools/caption/assests/CogVLM2-Caption-example.png new file mode 100644 index 0000000..8612bfe Binary files /dev/null and b/tools/caption/assests/CogVLM2-Caption-example.png differ diff --git a/tools/caption/requirements.txt b/tools/caption/requirements.txt new file mode 100644 index 0000000..ce2e17e --- /dev/null +++ b/tools/caption/requirements.txt @@ -0,0 +1,23 @@ +decord>=0.6.0 +#根据https://download.pytorch.org/whl/torch/,python版本为[3.8,3.11] +torch==2.1.0 +torchvision== 0.16.0 +pytorchvideo==0.1.5 +xformers +transformers==4.42.4 +#git+https://github.com/huggingface/transformers.git +huggingface-hub>=0.23.0 +pillow +chainlit>=1.0 +pydantic>=2.7.1 +timm>=0.9.16 +openai>=1.30.1 +loguru>=0.7.2 +pydantic>=2.7.1 +einops +sse-starlette>=2.1.0 +flask +gunicorn +gevent +requests +gradio \ No newline at end of file diff --git a/tools/caption/video_caption.py b/tools/caption/video_caption.py new file mode 100644 index 0000000..1110fca --- /dev/null +++ b/tools/caption/video_caption.py @@ -0,0 +1,108 @@ +import io + +import argparse +import numpy as np +import torch +from decord import cpu, VideoReader, bridge +from transformers import AutoModelForCausalLM, AutoTokenizer + +MODEL_PATH = "THUDM/cogvlm2-llama3-caption" + +DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu' +TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[ + 0] >= 8 else torch.float16 + +parser = argparse.ArgumentParser(description="CogVLM2-Video CLI Demo") +parser.add_argument('--quant', type=int, choices=[4, 8], help='Enable 4-bit or 8-bit precision loading', default=0) +args = parser.parse_args([]) + + +def load_video(video_data, strategy='chat'): + bridge.set_bridge('torch') + mp4_stream = video_data + num_frames = 24 + decord_vr = VideoReader(io.BytesIO(mp4_stream), ctx=cpu(0)) + + frame_id_list = None + total_frames = len(decord_vr) + if strategy == 'base': + clip_end_sec = 60 + clip_start_sec = 0 + start_frame = int(clip_start_sec * decord_vr.get_avg_fps()) + end_frame = min(total_frames, + int(clip_end_sec * decord_vr.get_avg_fps())) if clip_end_sec is not None else total_frames + frame_id_list = np.linspace(start_frame, end_frame - 1, num_frames, dtype=int) + elif strategy == 'chat': + timestamps = decord_vr.get_frame_timestamp(np.arange(total_frames)) + timestamps = [i[0] for i in timestamps] + max_second = round(max(timestamps)) + 1 + frame_id_list = [] + for second in range(max_second): + closest_num = min(timestamps, key=lambda x: abs(x - second)) + index = timestamps.index(closest_num) + frame_id_list.append(index) + if len(frame_id_list) >= num_frames: + break + + video_data = decord_vr.get_batch(frame_id_list) + video_data = video_data.permute(3, 0, 1, 2) + return video_data + + +tokenizer = AutoTokenizer.from_pretrained( + MODEL_PATH, + trust_remote_code=True, +) + +model = AutoModelForCausalLM.from_pretrained( + MODEL_PATH, + torch_dtype=TORCH_TYPE, + trust_remote_code=True +).eval().to(DEVICE) + + +def predict(prompt, video_data, temperature): + strategy = 'chat' + + video = load_video(video_data, strategy=strategy) + + history = [] + query = prompt + inputs = model.build_conversation_input_ids( + tokenizer=tokenizer, + query=query, + images=[video], + history=history, + template_version=strategy + ) + inputs = { + 'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'), + 'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'), + 'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'), + 'images': [[inputs['images'][0].to('cuda').to(TORCH_TYPE)]], + } + gen_kwargs = { + "max_new_tokens": 2048, + "pad_token_id": 128002, + "top_k": 1, + "do_sample": False, + "top_p": 0.1, + "temperature": temperature, + } + with torch.no_grad(): + outputs = model.generate(**inputs, **gen_kwargs) + outputs = outputs[:, inputs['input_ids'].shape[1]:] + response = tokenizer.decode(outputs[0], skip_special_tokens=True) + return response + + +def test(): + prompt = "Please describe this video in detail." + temperature = 0.1 + video_data = open('test.mp4', 'rb').read() + response = predict(prompt, video_data, temperature) + print(response) + + +if __name__ == '__main__': + test()