GID tensorflow config_error ai_generated true

内部错误:CUDA_ERROR_INVALID_DEVICE:无效的设备序号

InternalError: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal

ID: tensorflow/gpu-visible-devices-invalid-id

其他格式: JSON · Markdown 中文 · English
90%修复率
85%置信度
1证据数
2023-05-10首次发现

版本兼容性

版本状态引入弃用备注
tensorflow 2.12 active
tensorflow 2.13 active
tensorflow 2.14 active
cuda 11.8 active
cuda 12.0 active

根因分析

CUDA_VISIBLE_DEVICES 环境变量引用了系统中不存在的 GPU 索引。

English

CUDA_VISIBLE_DEVICES environment variable references a GPU index that does not exist on the system.

generic

官方文档

https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth

解决方案

  1. List available GPUs with nvidia-smi, then set CUDA_VISIBLE_DEVICES to a valid index. For example: export CUDA_VISIBLE_DEVICES=0 (if only one GPU exists). In Python: import os; os.environ['CUDA_VISIBLE_DEVICES'] = '0'; import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))
  2. Remove CUDA_VISIBLE_DEVICES entirely to let TensorFlow auto-detect all GPUs: unset CUDA_VISIBLE_DEVICES

无效尝试

常见但无效的做法:

  1. Reinstalling CUDA drivers 95% 失败

    The issue is not driver installation but environment variable misconfiguration; reinstalling drivers does not fix the ordinal mapping.

  2. Setting CUDA_VISIBLE_DEVICES to all GPUs (e.g., '0,1,2,3') blindly 70% 失败

    If the system has fewer GPUs than specified, the error persists; the correct approach is to query available devices first.