On Windows with a single GPU running CUDA 12.8 + PyTorch 2.7+ on Blackwell
(sm_120) hardware, s1_train.py crashes with an access violation (exit code
3221225477) shortly after pytorch_lightning's Trainer initialization, before
the first batch runs.
Root cause: DDPStrategy with the gloo backend is forced on Windows even
when there's only one GPU. The gloo + sm_120 + CUDA 12.8 combination has a
known incompatibility (see PyTorch forum "[Solved] RTX 5090 sm_120 Training
Segfault - DDP Was the Cause") that produces a native crash inside the
Lightning training loop.
Two changes, scoped to Windows + CUDA only:
* GPT_SoVITS/s1_train.py: on Windows, use Lightning's "auto" strategy,
which picks `single_device` for one GPU and skips DDP entirely. Also
pin devices=1 on Windows so multi-GPU users don't accidentally enable
DDP. Non-Windows behaviour is unchanged (NCCL DDP, all available GPUs).
* GPT_SoVITS/AR/data/bucket_sampler.py: when the distributed process
group isn't initialized (i.e. running under single_device strategy),
fall back to a single-replica configuration instead of crashing in
dist.get_world_size(). Defensive change — behaviour is unchanged when
DDP is properly initialized.
Tested on:
* Windows 11 + RTX 5090 (sm_120) + CUDA 12.8 + PyTorch 2.11+cu128
15-epoch s1 training completes cleanly, weights saved as expected.
Closes#2626.