pytorch
assertion_error
ai_generated
true
RuntimeError: CUDA error: device-side assert triggered. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions
ID: pytorch/cuda-assert-triggered
90%Fix Rate
88%Confidence
1Evidence
2023-03-10First Seen
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| torch 1.13.1 | active | — | — | — |
| torch 2.0.0 | active | — | — | — |
| cuda 11.7 | active | — | — | — |
| cuda 12.0 | active | — | — | — |
Root Cause
A CUDA kernel performed an illegal operation (e.g., out-of-bounds index, NaN in loss) that triggered a device-side assertion, but detailed info is suppressed without DSA build.
generic中文
CUDA 内核执行了非法操作(例如,越界索引、损失中的 NaN),触发了设备端断言,但在没有 DSA 构建的情况下详细信息被抑制。
Official Documentation
https://pytorch.org/docs/stable/notes/cuda.html#cuda-errorsWorkarounds
-
90% success Rebuild PyTorch from source with TORCH_USE_CUDA_DSA=1 to get detailed error messages: export TORCH_USE_CUDA_DSA=1 pip install --no-cache-dir --verbose torch --no-binary torch Then rerun and check the exact line causing the assertion.
Rebuild PyTorch from source with TORCH_USE_CUDA_DSA=1 to get detailed error messages: export TORCH_USE_CUDA_DSA=1 pip install --no-cache-dir --verbose torch --no-binary torch Then rerun and check the exact line causing the assertion.
-
85% success Add assertions in your code before CUDA operations, e.g., check index bounds: assert (indices >= 0).all() and (indices < tensor.size(0)).all(), "Index out of bounds" Also check for NaN/Inf in loss: assert not torch.isnan(loss).any()
Add assertions in your code before CUDA operations, e.g., check index bounds: assert (indices >= 0).all() and (indices < tensor.size(0)).all(), "Index out of bounds" Also check for NaN/Inf in loss: assert not torch.isnan(loss).any()
中文步骤
Rebuild PyTorch from source with TORCH_USE_CUDA_DSA=1 to get detailed error messages: export TORCH_USE_CUDA_DSA=1 pip install --no-cache-dir --verbose torch --no-binary torch Then rerun and check the exact line causing the assertion.
Add assertions in your code before CUDA operations, e.g., check index bounds: assert (indices >= 0).all() and (indices < tensor.size(0)).all(), "Index out of bounds" Also check for NaN/Inf in loss: assert not torch.isnan(loss).any()
Dead Ends
Common approaches that don't work:
-
95% fail
Simply catching the exception and retrying may mask the root cause (e.g., invalid index) and cause silent data corruption.
-
90% fail
Increasing batch size or changing learning rate does not fix illegal memory access or index errors.
-
80% fail
Disabling CUDA and falling back to CPU may work but is not a real fix and may be impractically slow.