pytorch assertion_error ai_generated true

RuntimeError: CUDA error: device-side assert triggered. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions

ID: pytorch/cuda-assert-triggered

Also available as: JSON · Markdown · 中文
90%Fix Rate
88%Confidence
1Evidence
2023-03-10First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
torch 1.13.1 active
torch 2.0.0 active
cuda 11.7 active
cuda 12.0 active

Root Cause

A CUDA kernel performed an illegal operation (e.g., out-of-bounds index, NaN in loss) that triggered a device-side assertion, but detailed info is suppressed without DSA build.

generic

中文

CUDA 内核执行了非法操作(例如,越界索引、损失中的 NaN),触发了设备端断言,但在没有 DSA 构建的情况下详细信息被抑制。

Official Documentation

https://pytorch.org/docs/stable/notes/cuda.html#cuda-errors

Workarounds

  1. 90% success Rebuild PyTorch from source with TORCH_USE_CUDA_DSA=1 to get detailed error messages: export TORCH_USE_CUDA_DSA=1 pip install --no-cache-dir --verbose torch --no-binary torch Then rerun and check the exact line causing the assertion.
    Rebuild PyTorch from source with TORCH_USE_CUDA_DSA=1 to get detailed error messages:
    export TORCH_USE_CUDA_DSA=1
    pip install --no-cache-dir --verbose torch --no-binary torch
    Then rerun and check the exact line causing the assertion.
  2. 85% success Add assertions in your code before CUDA operations, e.g., check index bounds: assert (indices >= 0).all() and (indices < tensor.size(0)).all(), "Index out of bounds" Also check for NaN/Inf in loss: assert not torch.isnan(loss).any()
    Add assertions in your code before CUDA operations, e.g., check index bounds:
    assert (indices >= 0).all() and (indices < tensor.size(0)).all(), "Index out of bounds"
    Also check for NaN/Inf in loss: assert not torch.isnan(loss).any()

中文步骤

  1. Rebuild PyTorch from source with TORCH_USE_CUDA_DSA=1 to get detailed error messages:
    export TORCH_USE_CUDA_DSA=1
    pip install --no-cache-dir --verbose torch --no-binary torch
    Then rerun and check the exact line causing the assertion.
  2. Add assertions in your code before CUDA operations, e.g., check index bounds:
    assert (indices >= 0).all() and (indices < tensor.size(0)).all(), "Index out of bounds"
    Also check for NaN/Inf in loss: assert not torch.isnan(loss).any()

Dead Ends

Common approaches that don't work:

  1. 95% fail

    Simply catching the exception and retrying may mask the root cause (e.g., invalid index) and cause silent data corruption.

  2. 90% fail

    Increasing batch size or changing learning rate does not fix illegal memory access or index errors.

  3. 80% fail

    Disabling CUDA and falling back to CPU may work but is not a real fix and may be impractically slow.