CUDA_ERROR_ILLEGAL_ADDRESS
pytorch
runtime_error
ai_generated
partial
运行时错误:CUDA 错误:遇到非法内存访问
RuntimeError: CUDA error: an illegal memory access was encountered
ID: pytorch/cuda-error-illegal-memory-access
75%修复率
85%置信度
1证据数
2023-03-15首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| pytorch>=1.10 | active | — | — | — |
| cuda>=11.0 | active | — | — | — |
| cudnn>=8.0 | active | — | — | — |
根因分析
内核尝试读取或写入其分配区域之外的内存,通常由张量越界索引或指针损坏引起。
English
A kernel attempted to read or write memory outside its allocated region, often caused by out-of-bounds tensor indexing or corrupted pointers.
官方文档
https://pytorch.org/docs/stable/notes/cuda.html#cuda-error-handling解决方案
-
Enable CUDA synchronous debugging to pinpoint the exact line: set environment variable CUDA_LAUNCH_BLOCKING=1 before running the script. Then run the script and check the traceback.
-
Replace all dynamic indexing with torch.clamp or torch.where to ensure indices stay within bounds. For example: `idx = torch.clamp(idx, 0, tensor.size(0)-1)`
-
Use torch.cuda.synchronize() after suspicious operations to force synchronization and catch the error earlier.
无效尝试
常见但无效的做法:
-
Increasing GPU memory or adding more GPUs
90% 失败
The error is not about memory capacity but invalid access; more memory doesn't fix invalid pointers.
-
Rebooting the machine or resetting CUDA context
70% 失败
The root cause is in the code logic; a reboot may temporarily mask the issue but it reoccurs.
-
Switching to CPU mode entirely
50% 失败
Avoids the error but defeats the purpose of using GPU acceleration.