2024-01-17 17:06:28 +11:00
2024-01-16 22:25:15 +08:00
2024-01-16 22:38:49 +09:00
2024-01-16 17:35:12 +08:00
2024-01-16 17:25:53 +08:00
2024-01-16 17:25:53 +08:00
2024-01-15 02:05:22 +08:00
2024-01-16 20:23:22 +08:00
2024-01-16 17:36:27 +08:00

GPT-SoVITS - Voice Conversion and Text-to-Speech WebUI

Demo Video and Features

Check out our demo video in Chinese: Bilibili Demo

https://github.com/RVC-Boss/GPT-SoVITS/assets/129054828/05bee1fa-bdd8-4d85-9350-80c060ab47fb

Features:

  1. Zero-shot TTS: Input a 5-second vocal sample and experience instant text-to-speech conversion.

  2. Few-shot TTS: Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.

  3. Cross-lingual Support: Inference in languages different from the training dataset, currently supporting English, Japanese, and Chinese.

  4. WebUI Tools: Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models.

Todo List

  1. High Priority:

    • Localization in Japanese and English.
    • User guide.
  2. Features:

    • Zero-shot voice conversion (5s) / few-shot voice conversion (1min).
    • TTS speaking speed control.
    • Enhanced TTS emotion control.
    • Experiment with changing SoVITS token inputs to probability distribution of vocabs.
    • Improve English and Japanese text frontend.
    • Develop tiny and larger-sized TTS models.
    • Colab scripts.
    • Expand training dataset (2k -> 10k).

Requirements (How to Install)

Visual Studio Enterprise 2017 (for windows)

Before installing this project, please check if you have Visual studio Enterprise 2017, as version 2022 will cause issues with pyopenjtalk. If you dont have it installed, you can subscribe to Visual Studio Dev Essentials(free) by clicking here.

Then, install Visual Studio Enterprise 2017 by clickinghere, choose the top one that says Visual Studio Enterprise 2017 and click Download. Finally, follow the instructions to install Visual Studio Enterprise 2017 on your windows computer.

Add cmake and hostx64 into Path in System Environment Variables

Please add these two file directories into Path in System Environment Variables (type environment in Windows search bar, click 'Edit the system environment variables', then click Environment Variables

{Your path for VS 2017}\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin
{Your path for VS 2017}\VC\Tools\MSVC\14.16.27023\bin\Hostx64\x64

**Special thanks to YulKe on CSDN in providing this tutorial.

Python and PyTorch Version

Tested with Python 3.9, PyTorch 2.0.1, and CUDA 11.

Pip Packages

pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm==4.59.0 cn2an pypinyin pyopenjtalk g2p_en

Additional Requirements

If you need Chinese ASR (supported by FunASR), install:

pip install modelscope torchaudio sentencepiece funasr

FFmpeg

Ubuntu/Debian Users

sudo apt install ffmpeg

MacOS Users

brew install ffmpeg

Windows Users

Download and place ffmpeg.exe and ffprobe.exe in the GPT-SoVITS root.

Pretrained Models

Download pretrained models from GPT-SoVITS Models and place them in GPT_SoVITS\pretrained_models.

For Chinese ASR, download models from Damo ASR Models and place them in tools/damo_asr/models.

For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal), download models from UVR5 Weights and place them in tools/uvr5/uvr5_weights.

Dataset Format

The TTS annotation .list file format:

vocal_path|speaker_name|language|text

Example:

D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.

Language dictionary:

  • 'zh': Chinese
  • 'ja': Japanese
  • 'en': English

Credits

Special thanks to the following projects and contributors:

Description
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
Readme MIT 141 MiB
Languages
Python 96.6%
Shell 0.9%
Jupyter Notebook 0.9%
Cuda 0.6%
PowerShell 0.5%
Other 0.4%