pytorch assertion_error ai_generated true

运行时错误:CUDA 错误:触发了设备端断言。使用 TORCH_USE_CUDA_DSA 编译以启用设备端断言

RuntimeError: CUDA error: device-side assert triggered. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions

ID: pytorch/cuda-assert-triggered

其他格式: JSON · Markdown 中文 · English
90%修复率
88%置信度
1证据数
2023-03-10首次发现

版本兼容性

版本状态引入弃用备注
torch 1.13.1 active
torch 2.0.0 active
cuda 11.7 active
cuda 12.0 active

根因分析

CUDA 内核执行了非法操作(例如,越界索引、损失中的 NaN),触发了设备端断言,但在没有 DSA 构建的情况下详细信息被抑制。

English

A CUDA kernel performed an illegal operation (e.g., out-of-bounds index, NaN in loss) that triggered a device-side assertion, but detailed info is suppressed without DSA build.

generic

官方文档

https://pytorch.org/docs/stable/notes/cuda.html#cuda-errors

解决方案

  1. Rebuild PyTorch from source with TORCH_USE_CUDA_DSA=1 to get detailed error messages:
    export TORCH_USE_CUDA_DSA=1
    pip install --no-cache-dir --verbose torch --no-binary torch
    Then rerun and check the exact line causing the assertion.
  2. Add assertions in your code before CUDA operations, e.g., check index bounds:
    assert (indices >= 0).all() and (indices < tensor.size(0)).all(), "Index out of bounds"
    Also check for NaN/Inf in loss: assert not torch.isnan(loss).any()

无效尝试

常见但无效的做法:

  1. 95% 失败

    Simply catching the exception and retrying may mask the root cause (e.g., invalid index) and cause silent data corruption.

  2. 90% 失败

    Increasing batch size or changing learning rate does not fix illegal memory access or index errors.

  3. 80% 失败

    Disabling CUDA and falling back to CPU may work but is not a real fix and may be impractically slow.