# InternalError：当前CUDA驱动程序或设备拓扑不支持从GPU:0到GPU:1的直连访问

- **ID:** `tensorflow/gpu-peer-access-error`
- **领域:** tensorflow
- **类别:** system_error
- **错误码:** `GPA`
- **验证级别:** ai_generated
- **修复率:** 80%

## 根因

由于硬件限制、驱动程序版本或PCIe拓扑约束，系统中的GPU不支持点对点内存访问（例如通过NVLink），但TensorFlow的多GPU分发策略尝试启用它。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| tensorflow>=2.13.0 | active | — | — |
| cuda>=11.8 | active | — | — |
| nvidia-driver>=525 | active | — | — |

## 解决方案

1. ```
   Disable peer access in TensorFlow by setting the environment variable `TF_GPU_ALLOCATOR=cuda_malloc_async` or using `tf.config.experimental.set_memory_growth` per GPU. Alternatively, use `tf.distribute.MirroredStrategy` with `cross_device_ops=tf.distribute.HierarchicalCopyAllReduce()` which does not require peer access.
   ```
2. ```
   Check GPU topology with `nvidia-smi topo -m` and if peer access is unsupported, place GPUs on the same PCIe switch if possible, or use a distribution strategy that avoids peer access (e.g., `tf.distribute.experimental.MultiWorkerMirroredStrategy` with RPC).
   ```

## 无效尝试

- **Upgrading to the latest CUDA toolkit without checking driver compatibility.** — Peer access support depends on both hardware (e.g., NVLink) and driver version; a newer CUDA toolkit may not help if the driver is outdated or hardware lacks NVLink. (65% 失败率)
- **Setting CUDA_VISIBLE_DEVICES to a single GPU to avoid multi-GPU errors.** — This bypasses the error but reduces the effective GPU count to 1, defeating the purpose of multi-GPU training; the error is not fixed, just avoided. (50% 失败率)
