问题

安装环境运行模型时报错:CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED at “…/aten/src/ATen/cuda/CUDAContext.cpp”:50, please report a bug to PyTorch.

解决办法

修改文件 /usr/local/lib/python3.8/dist-packages/torch/cuda# vim __init__.py
修该

_cached_device_count: Optional[int] = None
def device_count() -> int:
    r"""Return the number of GPUs available."""
    global _cached_device_count
    if not _is_compiled():
        return 0
    if _cached_device_count is not None:
        return _cached_device_count
    # bypass _device_count_nvml() if rocm (not supported)
    nvml_count = -1 if torch.version.hip else _device_count_nvml()
    r = torch._C._cuda_getDeviceCount() if nvml_count < 0 else nvml_count
    # NB: Do not cache the device count prior to CUDA initialization, because
    # the number of devices can change due to changes to CUDA_VISIBLE_DEVICES
    # setting prior to CUDA initialization.
    if _cached_device_count is None and _initialized:
         _cached_device_count = r
    return r

执行代码的时指定设备

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2" 
os.environ["WORLD_SIZE"] = "1"

在这里插入图片描述

Logo

一站式 AI 云服务平台

更多推荐