14 Commits

Author SHA1 Message Date
lsh
551d3dc281 Fix s1_train DDP crash on Windows single-GPU (sm_120 / Blackwell)
On Windows with a single GPU running CUDA 12.8 + PyTorch 2.7+ on Blackwell
(sm_120) hardware, s1_train.py crashes with an access violation (exit code
3221225477) shortly after pytorch_lightning's Trainer initialization, before
the first batch runs.

Root cause: DDPStrategy with the gloo backend is forced on Windows even
when there's only one GPU. The gloo + sm_120 + CUDA 12.8 combination has a
known incompatibility (see PyTorch forum "[Solved] RTX 5090 sm_120 Training
Segfault - DDP Was the Cause") that produces a native crash inside the
Lightning training loop.

Two changes, scoped to Windows + CUDA only:

  * GPT_SoVITS/s1_train.py: on Windows, use Lightning's "auto" strategy,
    which picks `single_device` for one GPU and skips DDP entirely. Also
    pin devices=1 on Windows so multi-GPU users don't accidentally enable
    DDP. Non-Windows behaviour is unchanged (NCCL DDP, all available GPUs).
  * GPT_SoVITS/AR/data/bucket_sampler.py: when the distributed process
    group isn't initialized (i.e. running under single_device strategy),
    fall back to a single-replica configuration instead of crashing in
    dist.get_world_size(). Defensive change — behaviour is unchanged when
    DDP is properly initialized.

Tested on:
  * Windows 11 + RTX 5090 (sm_120) + CUDA 12.8 + PyTorch 2.11+cu128
    15-epoch s1 training completes cleanly, weights saved as expected.

Closes #2626.
2026-05-16 19:10:20 -07:00
XXXXRT666
53cac93589
Refactor: Format Code with Ruff and Update Deprecated G2PW Link (#2255)
* ruff check --fix

* ruff format --line-length 120 --target-version py39

* Change the link for G2PW Model

* update pytorch version and colab
2025-04-07 16:42:47 +08:00
RVC-Boss
e937b625e4
support sovits v3 lora training, 8G GPU memory is enough
support sovits v3 lora training, 8G GPU memory is enough
2025-02-23 00:37:14 +08:00
RVC-Boss
fa42d26d0e
gpt_sovits_v3
gpt_sovits_v3
2025-02-11 21:07:03 +08:00
huangxu1991
4f8e1660af
Add use_distributed_sampler=False in Trainer (#756)
if you have defined your own sampler, you should have to set use_distributed_sampler to False!
当使用自定义的 sampler 时,必须设置 use_distributed_sampler 为 False
2024-07-19 10:33:24 +08:00
RVC-Boss
a208698e77
Update s1_train.py 2024-06-29 22:54:05 +08:00
Lion
1963eb01cc support cpu training, use cpu training on mac 2024-03-13 22:24:32 +08:00
RVC-Boss
e97cc3346a
模型实验名可设置为中文。
fix https://github.com/RVC-Boss/GPT-SoVITS/issues/500
2024-02-17 16:45:31 +08:00
RVC-Boss
59f35adad8
修复gpt训练卡死问题和unmatched '}' in format string问题
修复gpt训练卡死问题和unmatched '}' in format string问题
2024-02-08 21:53:31 +08:00
RVC-Boss
f0cfe39708
fix gpt not save issue. 2024-01-28 19:34:03 +08:00
Wu Zichen
07a5339691 mps support 2024-01-24 19:37:47 +08:00
Wu Zichen
8069264e64 mps support 2024-01-24 17:30:49 +08:00
Blaise
0d92575115 Code refactor + remove unused imports 2024-01-16 17:10:27 +01:00
RVC-Boss
41ca6028d6
Add files via upload 2024-01-16 17:38:48 +08:00