cudaErrorIllegalAddress (77) cuda runtime_error ai_generated partial

terminate 调用后抛出 'thrust::system::system_error' 实例:what(): after cudaGetLastError: 遇到非法内存访问

terminate called after throwing an instance of 'thrust::system::system_error' what(): after cudaGetLastError: an illegal memory access was encountered

ID: cuda/thrust-vector-alloc-failure

其他格式: JSON · Markdown 中文 · English
80%修复率
84%置信度
1证据数
2023-08-22首次发现

版本兼容性

版本状态引入弃用备注
CUDA 11.8 active
CUDA 12.1 active
Thrust 1.17.2 active
Thrust 2.1.0 active

根因分析

Thrust 算法(如 thrust::sort、thrust::reduce)内部启动 CUDA 内核;当设备向量越界访问或主机指针被传递给设备端操作时,会发生非法内存访问,导致内核崩溃,Thrust 将其报告为 system_error。

English

Thrust algorithms (e.g., thrust::sort, thrust::reduce) internally launch CUDA kernels; an illegal memory access occurs when a device vector is accessed out of bounds or a host pointer is passed to a device-side operation, causing a kernel crash that Thrust reports as a system_error.

generic

官方文档

https://thrust.github.io/doc/group__cuda.html

解决方案

  1. Use cuda-memcheck or compute-sanitizer to pinpoint the exact illegal memory access location. Example: run the program with compute-sanitizer --tool memcheck ./my_program, then fix the out-of-bounds index.
  2. Ensure all device vectors are properly sized and that host pointers are not accidentally passed to Thrust algorithms. Use thrust::device_vector for device data and thrust::host_vector for host data. Example: change thrust::sort(host_ptr, host_ptr+N) to thrust::sort(d_vec.begin(), d_vec.end()).

无效尝试

常见但无效的做法:

  1. 90% 失败

    Resetting the device (cudaDeviceReset) does not fix the root cause; the illegal access will recur on the next Thrust call.

  2. 75% 失败

    Increasing the device vector size arbitrarily may mask the out-of-bounds access but does not guarantee correctness; the real bug is in the algorithm logic.