pytorch assertion_error ai_generated true

RuntimeError: CUDA error: device-side assert triggered. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions

ID: pytorch/cuda-assert-triggered

Also available as: JSON · Markdown · 中文

90%Fix Rate

88%Confidence

1Evidence

2023-03-10First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
torch 1.13.1	active	—	—	—
torch 2.0.0	active	—	—	—
cuda 11.7	active	—	—	—
cuda 12.0	active	—	—	—

Root Cause

A CUDA kernel performed an illegal operation (e.g., out-of-bounds index, NaN in loss) that triggered a device-side assertion, but detailed info is suppressed without DSA build.

generic

中文

CUDA 内核执行了非法操作（例如，越界索引、损失中的 NaN），触发了设备端断言，但在没有 DSA 构建的情况下详细信息被抑制。

Official Documentation

https://pytorch.org/docs/stable/notes/cuda.html#cuda-errors

Workarounds

90% success Rebuild PyTorch from source with TORCH_USE_CUDA_DSA=1 to get detailed error messages: export TORCH_USE_CUDA_DSA=1 pip install --no-cache-dir --verbose torch --no-binary torch Then rerun and check the exact line causing the assertion.
```
Rebuild PyTorch from source with TORCH_USE_CUDA_DSA=1 to get detailed error messages:
export TORCH_USE_CUDA_DSA=1
pip install --no-cache-dir --verbose torch --no-binary torch
Then rerun and check the exact line causing the assertion.
```
85% success Add assertions in your code before CUDA operations, e.g., check index bounds: assert (indices >= 0).all() and (indices < tensor.size(0)).all(), "Index out of bounds" Also check for NaN/Inf in loss: assert not torch.isnan(loss).any()
```
Add assertions in your code before CUDA operations, e.g., check index bounds:
assert (indices >= 0).all() and (indices < tensor.size(0)).all(), "Index out of bounds"
Also check for NaN/Inf in loss: assert not torch.isnan(loss).any()
```

中文步骤

Rebuild PyTorch from source with TORCH_USE_CUDA_DSA=1 to get detailed error messages:
export TORCH_USE_CUDA_DSA=1
pip install --no-cache-dir --verbose torch --no-binary torch
Then rerun and check the exact line causing the assertion.

Add assertions in your code before CUDA operations, e.g., check index bounds:
assert (indices >= 0).all() and (indices < tensor.size(0)).all(), "Index out of bounds"
Also check for NaN/Inf in loss: assert not torch.isnan(loss).any()

Dead Ends

Common approaches that don't work:

95% fail
Simply catching the exception and retrying may mask the root cause (e.g., invalid index) and cause silent data corruption.
90% fail
Increasing batch size or changing learning rate does not fix illegal memory access or index errors.
80% fail
Disabling CUDA and falling back to CPU may work but is not a real fix and may be impractically slow.