cudaErrorIllegalAddress
cuda
runtime_error
ai_generated
true
运行时错误:CUDA错误:在释放仍在使用的张量后遇到非法内存访问
RuntimeError: CUDA error: an illegal memory access was encountered after a cudaFree call on a tensor still in use
ID: cuda/illegal-memory-access-after-free
79%修复率
82%置信度
1证据数
2025-01-20首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| CUDA 12.2 | active | — | — | — |
| PyTorch 2.2.0 | active | — | — | — |
| NVIDIA Driver 550.54.14 | active | — | — | — |
根因分析
张量或缓冲区通过cudaFree或torch.cuda.empty_cache被释放,而内核或异步操作仍持有引用,导致GPU上的释放后使用。
English
A tensor or buffer was freed via cudaFree or torch.cuda.empty_cache while a kernel or asynchronous operation still holds a reference, leading to a use-after-free on the GPU.
官方文档
https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html解决方案
-
Ensure all CUDA streams are synchronized before freeing tensors. Example: torch.cuda.synchronize() before calling del tensor or torch.cuda.empty_cache(). For custom kernels, use cudaStreamSynchronize on the relevant stream.
-
Use reference counting or weak references to track tensor lifetimes. In PyTorch, keep a strong reference to the tensor until the kernel completes, e.g., by storing it in a list until the next iteration.
无效尝试
常见但无效的做法:
-
70% 失败
Synchronization may hide the bug but does not fix the root cause; the free still happens before all uses complete.
-
95% 失败
Memory size is unrelated; the error is about lifetime management, not capacity.