diff --git a/README.md b/README.md index 2adecf06..db8d449a 100644 --- a/README.md +++ b/README.md @@ -1,160 +1,163 @@ -
- -

GPT-SoVITS-WebUI

-A Powerful Few-shot Voice Conversion and Text-to-Speech WebUI.

- -[![madewithlove](https://img.shields.io/badge/made_with-%E2%9D%A4-red?style=for-the-badge&labelColor=orange -)](https://github.com/RVC-Boss/GPT-SoVITS) - -
- -[![Licence](https://img.shields.io/badge/LICENSE-MIT-green.svg?style=for-the-badge)](https://github.com/RVC-Boss/GPT-SoVITS/blob/main/LICENSE) -[![Huggingface](https://img.shields.io/badge/🤗%20-Spaces-yellow.svg?style=for-the-badge)](https://huggingface.co/lj1995/GPT-SoVITS/tree/main) - -[**English**](./README.md) | [**中文简体**](./docs/cn/README.md) - -
- ------- - - - -> Check out our [demo video](https://www.bilibili.com/video/BV12g4y1m7Uw) here! - -https://github.com/RVC-Boss/GPT-SoVITS/assets/129054828/05bee1fa-bdd8-4d85-9350-80c060ab47fb - -## Features: -1. **Zero-shot TTS:** Input a 5-second vocal sample and experience instant text-to-speech conversion. - -2. **Few-shot TTS:** Fine-tune the model with just 1 minute of training data for improved voice similarity and realism. - -3. **Cross-lingual Support:** Inference in languages different from the training dataset, currently supporting English, Japanese, and Chinese. - -4. **WebUI Tools:** Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models. - -## Environment Preparation - -If you are a Windows user (tested with win>=10) you can install directly via the prezip. Just download the [prezip](https://huggingface.co/lj1995/GPT-SoVITS-windows-package/resolve/main/GPT-SoVITS-beta.7z?download=true), unzip it and double-click go-webui.bat to start GPT-SoVITS-WebUI. - -### Python and PyTorch Version - -Tested with Python 3.9, PyTorch 2.0.1, and CUDA 11. - -### Quick Install with Conda - -```bash -conda create -n GPTSoVits python=3.9 -conda activate GPTSoVits -bash install.sh -``` -### Install Manually -#### Pip Packages - -```bash -pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm cn2an pypinyin pyopenjtalk g2p_en chardet -``` - -#### Additional Requirements - -If you need Chinese ASR (supported by FunASR), install: - -```bash -pip install modelscope torchaudio sentencepiece funasr -``` - -#### FFmpeg - -##### Conda Users -```bash -conda install ffmpeg -``` - -##### Ubuntu/Debian Users - -```bash -sudo apt install ffmpeg -sudo apt install libsox-dev -conda install -c conda-forge 'ffmpeg<7' -``` - -##### MacOS Users - -```bash -brew install ffmpeg -``` - -##### Windows Users - -Download and place [ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe) and [ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe) in the GPT-SoVITS root. - -### Pretrained Models - - -Download pretrained models from [GPT-SoVITS Models](https://huggingface.co/lj1995/GPT-SoVITS) and place them in `GPT_SoVITS\pretrained_models`. - -For Chinese ASR (additionally), download models from [Damo ASR Model](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/files), [Damo VAD Model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/files), and [Damo Punc Model](https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/files) and place them in `tools/damo_asr/models`. - -For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal, additionally), download models from [UVR5 Weights](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/uvr5_weights) and place them in `tools/uvr5/uvr5_weights`. - - -## Dataset Format - -The TTS annotation .list file format: - -``` -vocal_path|speaker_name|language|text -``` - -Language dictionary: - -- 'zh': Chinese -- 'ja': Japanese -- 'en': English - -Example: - -``` -D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin. -``` -## Todo List - -- [ ] **High Priority:** - - [ ] Localization in Japanese and English. - - [ ] User guide. - - [ ] Japanese and English dataset fine tune training. - -- [ ] **Features:** - - [ ] Zero-shot voice conversion (5s) / few-shot voice conversion (1min). - - [ ] TTS speaking speed control. - - [ ] Enhanced TTS emotion control. - - [ ] Experiment with changing SoVITS token inputs to probability distribution of vocabs. - - [ ] Improve English and Japanese text frontend. - - [ ] Develop tiny and larger-sized TTS models. - - [ ] Colab scripts. - - [ ] Expand training dataset (2k -> 10k). - - [ ] better sovits base model (enhanced audio quality) - - [ ] model mix - -## Credits - -Special thanks to the following projects and contributors: - -- [ar-vits](https://github.com/innnky/ar-vits) -- [SoundStorm](https://github.com/yangdongchao/SoundStorm/tree/master/soundstorm/s1/AR) -- [vits](https://github.com/jaywalnut310/vits) -- [TransferTTS](https://github.com/hcy71o/TransferTTS/blob/master/models.py#L556) -- [Chinese Speech Pretrain](https://github.com/TencentGameMate/chinese_speech_pretrain) -- [contentvec](https://github.com/auspicious3000/contentvec/) -- [hifi-gan](https://github.com/jik876/hifi-gan) -- [Chinese-Roberta-WWM-Ext-Large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large) -- [fish-speech](https://github.com/fishaudio/fish-speech/blob/main/tools/llama/generate.py#L41) -- [ultimatevocalremovergui](https://github.com/Anjok07/ultimatevocalremovergui) -- [audio-slicer](https://github.com/openvpi/audio-slicer) -- [SubFix](https://github.com/cronrpc/SubFix) -- [FFmpeg](https://github.com/FFmpeg/FFmpeg) -- [gradio](https://github.com/gradio-app/gradio) - -## Thanks to all contributors for their efforts - - - +
+ +

GPT-SoVITS-WebUI

+A Powerful Few-shot Voice Conversion and Text-to-Speech WebUI.

+ +[![madewithlove](https://img.shields.io/badge/made_with-%E2%9D%A4-red?style=for-the-badge&labelColor=orange +)](https://github.com/RVC-Boss/GPT-SoVITS) + +
+ +[![Licence](https://img.shields.io/badge/LICENSE-MIT-green.svg?style=for-the-badge)](https://github.com/RVC-Boss/GPT-SoVITS/blob/main/LICENSE) +[![Huggingface](https://img.shields.io/badge/🤗%20-Spaces-yellow.svg?style=for-the-badge)](https://huggingface.co/lj1995/GPT-SoVITS/tree/main) + +[**English**](./README.md) | [**中文简体**](./docs/cn/README.md) + +
+ +------ + + + +> Check out our [demo video](https://www.bilibili.com/video/BV12g4y1m7Uw) here! + +https://github.com/RVC-Boss/GPT-SoVITS/assets/129054828/05bee1fa-bdd8-4d85-9350-80c060ab47fb + +## Features: +1. **Zero-shot TTS:** Input a 5-second vocal sample and experience instant text-to-speech conversion. + +2. **Few-shot TTS:** Fine-tune the model with just 1 minute of training data for improved voice similarity and realism. + +3. **Cross-lingual Support:** Inference in languages different from the training dataset, currently supporting English, Japanese, and Chinese. + +4. **WebUI Tools:** Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models. + +## Environment Preparation + +If you are a Windows user (tested with win>=10) you can install directly via the prezip. Just download the [prezip](https://huggingface.co/lj1995/GPT-SoVITS-windows-package/resolve/main/GPT-SoVITS-beta.7z?download=true), unzip it and double-click go-webui.bat to start GPT-SoVITS-WebUI. + +### Tested Environments + +- Python 3.9, PyTorch 2.0.1, CUDA 11 +- Python 3.10.13, PyTorch 2.1.2, CUDA 12.3 (Windows) + +_NOTE: numba==0.56.4 require Python<3.11_ + +### Quick Install with Conda + +```bash +conda create -n GPTSoVits python=3.9 +conda activate GPTSoVits +bash install.sh +``` +### Install Manually +#### Pip Packages + +```bash +pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm cn2an pypinyin pyopenjtalk g2p_en chardet +``` + +#### Additional Requirements + +If you need Chinese ASR (supported by FunASR), install: + +```bash +pip install modelscope torchaudio sentencepiece funasr +``` + +#### FFmpeg + +##### Conda Users +```bash +conda install ffmpeg +``` + +##### Ubuntu/Debian Users + +```bash +sudo apt install ffmpeg +sudo apt install libsox-dev +conda install -c conda-forge 'ffmpeg<7' +``` + +##### MacOS Users + +```bash +brew install ffmpeg +``` + +##### Windows Users + +Download and place [ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe) and [ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe) in the GPT-SoVITS root. + +### Pretrained Models + + +Download pretrained models from [GPT-SoVITS Models](https://huggingface.co/lj1995/GPT-SoVITS) and place them in `GPT_SoVITS/pretrained_models`. + +For Chinese ASR (additionally), download models from [Damo ASR Model](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/files), [Damo VAD Model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/files), and [Damo Punc Model](https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/files) and place them in `tools/damo_asr/models`. + +For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal, additionally), download models from [UVR5 Weights](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/uvr5_weights) and place them in `tools/uvr5/uvr5_weights`. + + +## Dataset Format + +The TTS annotation .list file format: + +``` +vocal_path|speaker_name|language|text +``` + +Language dictionary: + +- 'zh': Chinese +- 'ja': Japanese +- 'en': English + +Example: + +``` +D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin. +``` +## Todo List + +- [ ] **High Priority:** + - [ ] Localization in Japanese and English. + - [ ] User guide. + - [ ] Japanese and English dataset fine tune training. + +- [ ] **Features:** + - [ ] Zero-shot voice conversion (5s) / few-shot voice conversion (1min). + - [ ] TTS speaking speed control. + - [ ] Enhanced TTS emotion control. + - [ ] Experiment with changing SoVITS token inputs to probability distribution of vocabs. + - [ ] Improve English and Japanese text frontend. + - [ ] Develop tiny and larger-sized TTS models. + - [ ] Colab scripts. + - [ ] Expand training dataset (2k -> 10k). + - [ ] better sovits base model (enhanced audio quality) + - [ ] model mix + +## Credits + +Special thanks to the following projects and contributors: + +- [ar-vits](https://github.com/innnky/ar-vits) +- [SoundStorm](https://github.com/yangdongchao/SoundStorm/tree/master/soundstorm/s1/AR) +- [vits](https://github.com/jaywalnut310/vits) +- [TransferTTS](https://github.com/hcy71o/TransferTTS/blob/master/models.py#L556) +- [Chinese Speech Pretrain](https://github.com/TencentGameMate/chinese_speech_pretrain) +- [contentvec](https://github.com/auspicious3000/contentvec/) +- [hifi-gan](https://github.com/jik876/hifi-gan) +- [Chinese-Roberta-WWM-Ext-Large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large) +- [fish-speech](https://github.com/fishaudio/fish-speech/blob/main/tools/llama/generate.py#L41) +- [ultimatevocalremovergui](https://github.com/Anjok07/ultimatevocalremovergui) +- [audio-slicer](https://github.com/openvpi/audio-slicer) +- [SubFix](https://github.com/cronrpc/SubFix) +- [FFmpeg](https://github.com/FFmpeg/FFmpeg) +- [gradio](https://github.com/gradio-app/gradio) + +## Thanks to all contributors for their efforts + + + diff --git a/requirements.txt b/requirements.txt index 2e640334..2b89e6a9 100644 --- a/requirements.txt +++ b/requirements.txt @@ -20,3 +20,6 @@ transformers chardet PyYAML psutil +soundfile +fastapi +uvicorn