pytorch
assertion_error
ai_generated
true
运行时错误:CUDA 错误:触发了设备端断言。使用 TORCH_USE_CUDA_DSA 编译以启用设备端断言
RuntimeError: CUDA error: device-side assert triggered. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions
ID: pytorch/cuda-assert-triggered
90%修复率
88%置信度
1证据数
2023-03-10首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| torch 1.13.1 | active | — | — | — |
| torch 2.0.0 | active | — | — | — |
| cuda 11.7 | active | — | — | — |
| cuda 12.0 | active | — | — | — |
根因分析
CUDA 内核执行了非法操作(例如,越界索引、损失中的 NaN),触发了设备端断言,但在没有 DSA 构建的情况下详细信息被抑制。
English
A CUDA kernel performed an illegal operation (e.g., out-of-bounds index, NaN in loss) that triggered a device-side assertion, but detailed info is suppressed without DSA build.
官方文档
https://pytorch.org/docs/stable/notes/cuda.html#cuda-errors解决方案
-
Rebuild PyTorch from source with TORCH_USE_CUDA_DSA=1 to get detailed error messages: export TORCH_USE_CUDA_DSA=1 pip install --no-cache-dir --verbose torch --no-binary torch Then rerun and check the exact line causing the assertion.
-
Add assertions in your code before CUDA operations, e.g., check index bounds: assert (indices >= 0).all() and (indices < tensor.size(0)).all(), "Index out of bounds" Also check for NaN/Inf in loss: assert not torch.isnan(loss).any()
无效尝试
常见但无效的做法:
-
95% 失败
Simply catching the exception and retrying may mask the root cause (e.g., invalid index) and cause silent data corruption.
-
90% 失败
Increasing batch size or changing learning rate does not fix illegal memory access or index errors.
-
80% 失败
Disabling CUDA and falling back to CPU may work but is not a real fix and may be impractically slow.