5.2 KiB
GPT-SoVITS - Voice Conversion and Text-to-Speech WebUI
Demo Video and Features
Check out our demo video in Chinese: Bilibili Demo
https://github.com/RVC-Boss/GPT-SoVITS/assets/129054828/05bee1fa-bdd8-4d85-9350-80c060ab47fb
Features:
-
Zero-shot TTS: Input a 5-second vocal sample and experience instant text-to-speech conversion.
-
Few-shot TTS: Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.
-
Cross-lingual Support: Inference in languages different from the training dataset, currently supporting English, Japanese, and Chinese.
-
WebUI Tools: Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models.
Todo List
-
High Priority:
- Localization in Japanese and English.
- User guide.
-
Features:
- Zero-shot voice conversion (5s) / few-shot voice conversion (1min).
- TTS speaking speed control.
- Enhanced TTS emotion control.
- Experiment with changing SoVITS token inputs to probability distribution of vocabs.
- Improve English and Japanese text frontend.
- Develop tiny and larger-sized TTS models.
- Colab scripts.
- Expand training dataset (2k -> 10k).
Requirements (How to Install)
Visual Studio Enterprise 2017 (Windows)
Before installing this project, please check if you have Visual studio Enterprise 2017, as version 2022 will cause issues with pyopenjtalk. If you dont have it installed, you can subscribe to Visual Studio Dev Essentials (free) by clicking here.
Then, install Visual Studio Enterprise 2017 by clicking here, choose the top one that says Visual Studio Enterprise 2017 and click Download. Finally, follow the instructions to install Visual Studio Enterprise 2017 on your windows computer.
Add cmake and hostx64 into Path in System Environment Variables
Please add these two file directories into Path in System Environment Variables (type environment in Windows search bar, click 'Edit the system environment variables', then click Environment Variables)
{Your path for VS 2017}\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin
{Your path for VS 2017}\VC\Tools\MSVC\14.16.27023\bin\Hostx64\x64
Special thanks to YulKe on CSDN in providing this tutorial.
Python and PyTorch Version
Tested with Python 3.9, PyTorch 2.0.1, and CUDA 11.
Pip Packages
pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm==4.59.0 cn2an pypinyin pyopenjtalk g2p_en
Additional Requirements
If you need Chinese ASR (supported by FunASR), install:
pip install modelscope torchaudio sentencepiece funasr
FFmpeg
Ubuntu/Debian Users
sudo apt install ffmpeg
MacOS Users
brew install ffmpeg
Windows Users
Download and place ffmpeg.exe and ffprobe.exe in the GPT-SoVITS root.
Pretrained Models
Download pretrained models from GPT-SoVITS Models and place them in GPT_SoVITS\pretrained_models
.
For Chinese ASR, download models from Damo ASR Models and place them in tools/damo_asr/models
.
For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal), download models from UVR5 Weights and place them in tools/uvr5/uvr5_weights
.
Dataset Format
The TTS annotation .list file format:
vocal_path|speaker_name|language|text
Example:
D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.
Language dictionary:
- 'zh': Chinese
- 'ja': Japanese
- 'en': English
Credits
Special thanks to the following projects and contributors: