CUDA_ERROR_ILLEGAL_ADDRESS pytorch runtime_error ai_generated partial

RuntimeError: CUDA error: an illegal memory access was encountered

ID: pytorch/cuda-error-illegal-memory-access

Also available as: JSON · Markdown · 中文

75%Fix Rate

85%Confidence

1Evidence

2023-03-15First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
pytorch>=1.10	active	—	—	—
cuda>=11.0	active	—	—	—
cudnn>=8.0	active	—	—	—

Root Cause

A kernel attempted to read or write memory outside its allocated region, often caused by out-of-bounds tensor indexing or corrupted pointers.

generic

中文

内核尝试读取或写入其分配区域之外的内存，通常由张量越界索引或指针损坏引起。

Official Documentation

https://pytorch.org/docs/stable/notes/cuda.html#cuda-error-handling

Workarounds

80% success Enable CUDA synchronous debugging to pinpoint the exact line: set environment variable CUDA_LAUNCH_BLOCKING=1 before running the script. Then run the script and check the traceback.
```
Enable CUDA synchronous debugging to pinpoint the exact line: set environment variable CUDA_LAUNCH_BLOCKING=1 before running the script. Then run the script and check the traceback.
```
75% success Replace all dynamic indexing with torch.clamp or torch.where to ensure indices stay within bounds. For example: `idx = torch.clamp(idx, 0, tensor.size(0)-1)`
```
Replace all dynamic indexing with torch.clamp or torch.where to ensure indices stay within bounds. For example: `idx = torch.clamp(idx, 0, tensor.size(0)-1)`
```
70% success Use torch.cuda.synchronize() after suspicious operations to force synchronization and catch the error earlier.
```
Use torch.cuda.synchronize() after suspicious operations to force synchronization and catch the error earlier.
```

中文步骤

Enable CUDA synchronous debugging to pinpoint the exact line: set environment variable CUDA_LAUNCH_BLOCKING=1 before running the script. Then run the script and check the traceback.

Replace all dynamic indexing with torch.clamp or torch.where to ensure indices stay within bounds. For example: `idx = torch.clamp(idx, 0, tensor.size(0)-1)`

Use torch.cuda.synchronize() after suspicious operations to force synchronization and catch the error earlier.

Dead Ends

Common approaches that don't work:

Increasing GPU memory or adding more GPUs 90% fail
The error is not about memory capacity but invalid access; more memory doesn't fix invalid pointers.
Rebooting the machine or resetting CUDA context 70% fail
The root cause is in the code logic; a reboot may temporarily mask the issue but it reoccurs.
Switching to CPU mode entirely 50% fail
Avoids the error but defeats the purpose of using GPU acceleration.