lsh 551d3dc281 Fix s1_train DDP crash on Windows single-GPU (sm_120 / Blackwell)
On Windows with a single GPU running CUDA 12.8 + PyTorch 2.7+ on Blackwell
(sm_120) hardware, s1_train.py crashes with an access violation (exit code
3221225477) shortly after pytorch_lightning's Trainer initialization, before
the first batch runs.

Root cause: DDPStrategy with the gloo backend is forced on Windows even
when there's only one GPU. The gloo + sm_120 + CUDA 12.8 combination has a
known incompatibility (see PyTorch forum "[Solved] RTX 5090 sm_120 Training
Segfault - DDP Was the Cause") that produces a native crash inside the
Lightning training loop.

Two changes, scoped to Windows + CUDA only:

  * GPT_SoVITS/s1_train.py: on Windows, use Lightning's "auto" strategy,
    which picks `single_device` for one GPU and skips DDP entirely. Also
    pin devices=1 on Windows so multi-GPU users don't accidentally enable
    DDP. Non-Windows behaviour is unchanged (NCCL DDP, all available GPUs).
  * GPT_SoVITS/AR/data/bucket_sampler.py: when the distributed process
    group isn't initialized (i.e. running under single_device strategy),
    fall back to a single-replica configuration instead of crashing in
    dist.get_world_size(). Defensive change — behaviour is unchanged when
    DDP is properly initialized.

Tested on:
  * Windows 11 + RTX 5090 (sm_120) + CUDA 12.8 + PyTorch 2.11+cu128
    15-epoch s1 training completes cleanly, weights saved as expected.

Closes #2626.
2026-05-16 19:10:20 -07:00
..
2025-11-28 22:02:03 +08:00
2025-11-28 22:02:03 +08:00
2025-11-28 21:36:57 +08:00