1、报错信息:

raise RuntimeError("Distributed package doesn't have NCCL " "built in")

RuntimeError: Distributed package doesn't have NCCL built in

2、报错原因:

  windows系统不支持nccl,采用gloo;

3、报错解决:

  代码开头添加:

import os 
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo"

或者

init_process_group(backend="nccl", rank=rank, world_size=world_size)
 # 修改为 
init_process_group(backend="gloo", rank=rank, world_size=world_size)

# windows users may have to use "gloo" instead of "nccl" as backend

# nccl: NVIDIA Collective Communication Library#

#windows用户可能必须使用“gloo”而不是“nccl”作为后端

#nccl:NVIDIA集体通信库

Logo

一站式 AI 云服务平台

更多推荐