CUDA_ERROR_ILLEGAL_ADDRESS pytorch runtime_error ai_generated partial

RuntimeError: CUDA error: an illegal memory access was encountered

ID: pytorch/cuda-error-illegal-memory-access

Also available as: JSON · Markdown · 中文
75%Fix Rate
85%Confidence
1Evidence
2023-03-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
pytorch>=1.10 active
cuda>=11.0 active
cudnn>=8.0 active

Root Cause

A kernel attempted to read or write memory outside its allocated region, often caused by out-of-bounds tensor indexing or corrupted pointers.

generic

中文

内核尝试读取或写入其分配区域之外的内存,通常由张量越界索引或指针损坏引起。

Official Documentation

https://pytorch.org/docs/stable/notes/cuda.html#cuda-error-handling

Workarounds

  1. 80% success Enable CUDA synchronous debugging to pinpoint the exact line: set environment variable CUDA_LAUNCH_BLOCKING=1 before running the script. Then run the script and check the traceback.
    Enable CUDA synchronous debugging to pinpoint the exact line: set environment variable CUDA_LAUNCH_BLOCKING=1 before running the script. Then run the script and check the traceback.
  2. 75% success Replace all dynamic indexing with torch.clamp or torch.where to ensure indices stay within bounds. For example: `idx = torch.clamp(idx, 0, tensor.size(0)-1)`
    Replace all dynamic indexing with torch.clamp or torch.where to ensure indices stay within bounds. For example: `idx = torch.clamp(idx, 0, tensor.size(0)-1)`
  3. 70% success Use torch.cuda.synchronize() after suspicious operations to force synchronization and catch the error earlier.
    Use torch.cuda.synchronize() after suspicious operations to force synchronization and catch the error earlier.

中文步骤

  1. Enable CUDA synchronous debugging to pinpoint the exact line: set environment variable CUDA_LAUNCH_BLOCKING=1 before running the script. Then run the script and check the traceback.
  2. Replace all dynamic indexing with torch.clamp or torch.where to ensure indices stay within bounds. For example: `idx = torch.clamp(idx, 0, tensor.size(0)-1)`
  3. Use torch.cuda.synchronize() after suspicious operations to force synchronization and catch the error earlier.

Dead Ends

Common approaches that don't work:

  1. Increasing GPU memory or adding more GPUs 90% fail

    The error is not about memory capacity but invalid access; more memory doesn't fix invalid pointers.

  2. Rebooting the machine or resetting CUDA context 70% fail

    The root cause is in the code logic; a reboot may temporarily mask the issue but it reoccurs.

  3. Switching to CPU mode entirely 50% fail

    Avoids the error but defeats the purpose of using GPU acceleration.