GPA tensorflow system_error ai_generated partial

InternalError: Peer access from GPU:0 to GPU:1 is not supported by the current CUDA driver or device topology

ID: tensorflow/gpu-peer-access-error

Also available as: JSON · Markdown · 中文

80%Fix Rate

84%Confidence

1Evidence

2024-02-14First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
tensorflow>=2.13.0	active	—	—	—
cuda>=11.8	active	—	—	—
nvidia-driver>=525	active	—	—	—

Root Cause

The GPUs in the system do not support peer-to-peer memory access (e.g., via NVLink) due to hardware limitations, driver version, or PCIe topology constraints, but TensorFlow's multi-GPU distribution strategy attempted to enable it.

generic

中文

由于硬件限制、驱动程序版本或PCIe拓扑约束，系统中的GPU不支持点对点内存访问（例如通过NVLink），但TensorFlow的多GPU分发策略尝试启用它。

Official Documentation

https://www.tensorflow.org/guide/gpu#multi-gpu_setup

Workarounds

80% success Disable peer access in TensorFlow by setting the environment variable `TF_GPU_ALLOCATOR=cuda_malloc_async` or using `tf.config.experimental.set_memory_growth` per GPU. Alternatively, use `tf.distribute.MirroredStrategy` with `cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()` which does not require peer access.
```
Disable peer access in TensorFlow by setting the environment variable `TF_GPU_ALLOCATOR=cuda_malloc_async` or using `tf.config.experimental.set_memory_growth` per GPU. Alternatively, use `tf.distribute.MirroredStrategy` with `cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()` which does not require peer access.
```
75% success Check GPU topology with `nvidia-smi topo -m` and if peer access is unsupported, place GPUs on the same PCIe switch if possible, or use a distribution strategy that avoids peer access (e.g., `tf.distribute.experimental.MultiWorkerMirroredStrategy` with RPC).
```
Check GPU topology with `nvidia-smi topo -m` and if peer access is unsupported, place GPUs on the same PCIe switch if possible, or use a distribution strategy that avoids peer access (e.g., `tf.distribute.experimental.MultiWorkerMirroredStrategy` with RPC).
```

中文步骤

Disable peer access in TensorFlow by setting the environment variable `TF_GPU_ALLOCATOR=cuda_malloc_async` or using `tf.config.experimental.set_memory_growth` per GPU. Alternatively, use `tf.distribute.MirroredStrategy` with `cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()` which does not require peer access.

Check GPU topology with `nvidia-smi topo -m` and if peer access is unsupported, place GPUs on the same PCIe switch if possible, or use a distribution strategy that avoids peer access (e.g., `tf.distribute.experimental.MultiWorkerMirroredStrategy` with RPC).

Dead Ends

Common approaches that don't work:

Upgrading to the latest CUDA toolkit without checking driver compatibility. 65% fail
Peer access support depends on both hardware (e.g., NVLink) and driver version; a newer CUDA toolkit may not help if the driver is outdated or hardware lacks NVLink.
Setting CUDA_VISIBLE_DEVICES to a single GPU to avoid multi-GPU errors. 50% fail
This bypasses the error but reduces the effective GPU count to 1, defeating the purpose of multi-GPU training; the error is not fixed, just avoided.