I am organizing and uploading the codes. It will be public in one day.
demo video and features
https://www.bilibili.com/video/BV12g4y1m7Uw/
features:
1、input 5s vocal, zero shot TTS
2、1min training dataset, fine tune (few shot TTS. The TTS model trained using few-shot techniques exhibits significantly better similarity and realism in the speaker's voice compared to zero-shot.)
3、Cross lingual (inference another language that is different from the training dataset language), now support English, Japanese and Chinese
4、This WebUI integrates tools such as voice accompaniment separation, automatic segmentation of training sets, Chinese ASR, text labeling, etc., to help beginners quickly create their own training datasets and GPT/SoVITS models.
todolist
1、zero shot voice conversion(5s) /few shot voice converion(1min)
2、TTS speaking speed control
3、more TTS emotion control
4、experiment about change sovits token inputs to probability distribution of vocabs
5、better English and Japanese text frontend
6、tiny version and larger-sized TTS models
7、colab scripts
Requirments (How to install)
python and pytorch version
py39+pytorch2.0.1+cu11 passed the test.
pip packages
pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm==4.59.0 cn2an pypinyin pyopenjtalk g2p_en
additionally
If you need the Chinese ASR feature supported by funasr, you should
pip install modelscope torchaudio sentencepiece funasr
You need ffmpeg.
Ubuntu/Debian users
sudo apt install ffmpeg
MacOS users
brew install ffmpeg
Windows users
download and put them in the GPT-SoVITS root.
-
download ffmpeg.exe
-
download ffprobe.exe
You need download some pretrained models
pretrained GPT-SoVITS models/SSL feature model/Chinese BERT model
put these files
https://huggingface.co/lj1995/GPT-SoVITS
to
GPT_SoVITS\pretrained_models
Chinese ASR (Additionally)
put these files
https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/files
https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/files
to
tools/damo_asr/models
UVR5 (Vocals/Accompaniment Separation & Reverberation Removal. Additionally)
put the models you need from
https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/uvr5_weights
to
tools/uvr5/uvr5_weights
dataset format
The format of the TTS annotation .list file:
vocal path|speaker_name|language|text
e.g. D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.
language dictionary:
'zh': Chinese
"ja": Japanese
'en': English
Credits
https://github.com/innnky/ar-vits
https://github.com/yangdongchao/SoundStorm/tree/master/soundstorm/s1/AR
https://github.com/jaywalnut310/vits
https://github.com/hcy71o/TransferTTS/blob/master/models.py#L556
https://github.com/TencentGameMate/chinese_speech_pretrain
https://github.com/auspicious3000/contentvec/
https://github.com/jik876/hifi-gan
https://huggingface.co/hfl/chinese-roberta-wwm-ext-large
https://github.com/fishaudio/fish-speech/blob/main/tools/llama/generate.py#L41
https://github.com/Anjok07/ultimatevocalremovergui
https://github.com/openvpi/audio-slicer
https://github.com/cronrpc/SubFix