pytorch runtime_error ai_generated true

RuntimeError: CUDA error: invalid device ordinal

ID: pytorch/cuda-error-invalid-device-ordinal

Also available as: JSON · Markdown · 中文
85%Fix Rate
88%Confidence
1Evidence
2024-03-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
pytorch>=2.0.0 active
cuda>=11.7 active

Root Cause

The requested GPU device index (e.g., cuda:0) does not exist on the system, or the CUDA_VISIBLE_DEVICES environment variable restricts available devices.

generic

中文

请求的 GPU 设备索引(如 cuda:0)在系统中不存在,或者 CUDA_VISIBLE_DEVICES 环境变量限制了可用设备。

Official Documentation

https://pytorch.org/docs/stable/notes/cuda.html#device-handling

Workarounds

  1. 90% success Check available GPU devices with `torch.cuda.device_count()` and list them using `nvidia-smi`. Then set the device to a valid index, e.g., `torch.device('cuda:0')` if at least one GPU exists.
    Check available GPU devices with `torch.cuda.device_count()` and list them using `nvidia-smi`. Then set the device to a valid index, e.g., `torch.device('cuda:0')` if at least one GPU exists.
  2. 85% success Verify the CUDA_VISIBLE_DEVICES environment variable. In bash, run `echo $CUDA_VISIBLE_DEVICES`. If set, ensure it contains valid indices, or unset it: `unset CUDA_VISIBLE_DEVICES`.
    Verify the CUDA_VISIBLE_DEVICES environment variable. In bash, run `echo $CUDA_VISIBLE_DEVICES`. If set, ensure it contains valid indices, or unset it: `unset CUDA_VISIBLE_DEVICES`.

中文步骤

  1. 使用 `torch.cuda.device_count()` 检查可用 GPU 设备,并通过 `nvidia-smi` 列出。然后设置有效的设备索引,例如 `torch.device('cuda:0')`。
  2. 检查 CUDA_VISIBLE_DEVICES 环境变量。在 bash 中运行 `echo $CUDA_VISIBLE_DEVICES`,如果已设置,确保包含有效索引,或取消设置:`unset CUDA_VISIBLE_DEVICES`。

Dead Ends

Common approaches that don't work:

  1. 70% fail

    The issue is configuration (device index), not installation. Reinstalling does not fix the index mismatch.

  2. 50% fail

    The environment variable is still incorrect after the change; users may set it to a non-existent device.

  3. 60% fail

    This still fails if no GPU is available; the root cause is the ordinal, not the device type.