# 运行时错误：CUDA 错误：触发了设备端断言。使用 TORCH_USE_CUDA_DSA 编译以启用设备端断言

- **ID:** `pytorch/cuda-assert-triggered`
- **领域:** pytorch
- **类别:** assertion_error
- **验证级别:** ai_generated
- **修复率:** 90%

## 根因

CUDA 内核执行了非法操作（例如，越界索引、损失中的 NaN），触发了设备端断言，但在没有 DSA 构建的情况下详细信息被抑制。

## 版本兼容性

| 版本 | 状态 | 引入 | 弃用 |
|------|------|------|------|
| torch 1.13.1 | active | — | — |
| torch 2.0.0 | active | — | — |
| cuda 11.7 | active | — | — |
| cuda 12.0 | active | — | — |

## 解决方案

1. ```
   Rebuild PyTorch from source with TORCH_USE_CUDA_DSA=1 to get detailed error messages:
export TORCH_USE_CUDA_DSA=1
pip install --no-cache-dir --verbose torch --no-binary torch
Then rerun and check the exact line causing the assertion.
   ```
2. ```
   Add assertions in your code before CUDA operations, e.g., check index bounds:
assert (indices >= 0).all() and (indices < tensor.size(0)).all(), "Index out of bounds"
Also check for NaN/Inf in loss: assert not torch.isnan(loss).any()
   ```

## 无效尝试

- **** — Simply catching the exception and retrying may mask the root cause (e.g., invalid index) and cause silent data corruption. (95% 失败率)
- **** — Increasing batch size or changing learning rate does not fix illegal memory access or index errors. (90% 失败率)
- **** — Disabling CUDA and falling back to CPU may work but is not a real fix and may be impractically slow. (80% 失败率)
