CUDA_ERROR_ILLEGAL_ADDRESS pytorch runtime_error ai_generated partial

运行时错误:CUDA 错误:遇到非法内存访问

RuntimeError: CUDA error: an illegal memory access was encountered

ID: pytorch/cuda-error-illegal-memory-access

其他格式: JSON · Markdown 中文 · English
75%修复率
85%置信度
1证据数
2023-03-15首次发现

版本兼容性

版本状态引入弃用备注
pytorch>=1.10 active
cuda>=11.0 active
cudnn>=8.0 active

根因分析

内核尝试读取或写入其分配区域之外的内存,通常由张量越界索引或指针损坏引起。

English

A kernel attempted to read or write memory outside its allocated region, often caused by out-of-bounds tensor indexing or corrupted pointers.

generic

官方文档

https://pytorch.org/docs/stable/notes/cuda.html#cuda-error-handling

解决方案

  1. Enable CUDA synchronous debugging to pinpoint the exact line: set environment variable CUDA_LAUNCH_BLOCKING=1 before running the script. Then run the script and check the traceback.
  2. Replace all dynamic indexing with torch.clamp or torch.where to ensure indices stay within bounds. For example: `idx = torch.clamp(idx, 0, tensor.size(0)-1)`
  3. Use torch.cuda.synchronize() after suspicious operations to force synchronization and catch the error earlier.

无效尝试

常见但无效的做法:

  1. Increasing GPU memory or adding more GPUs 90% 失败

    The error is not about memory capacity but invalid access; more memory doesn't fix invalid pointers.

  2. Rebooting the machine or resetting CUDA context 70% 失败

    The root cause is in the code logic; a reboot may temporarily mask the issue but it reoccurs.

  3. Switching to CPU mode entirely 50% 失败

    Avoids the error but defeats the purpose of using GPU acceleration.