cuda build_error ai_generated partial

运行时错误：Triton 错误 [CUDA]：PTX 汇编失败：ptxas 致命错误：由于错误而中止 PTX 汇编

RuntimeError: Triton Error [CUDA]: PTX assembly failed: ptxas fatal : Ptx assembly aborted due to errors

ID: cuda/triton-ptx-assembly-failed

其他格式: JSON · Markdown 中文 · English

75%修复率

85%置信度

1证据数

2024-01-15首次发现

版本兼容性

版本	状态	引入	弃用	备注
Triton 2.1.0	active	—	—	—
Triton 2.2.0	active	—	—	—
CUDA 12.1	active	—	—	—
PyTorch 2.2.0	active	—	—	—

根因分析

Triton JIT 编译器生成的 PTX 代码无法被 ptxas 汇编，通常是由于寄存器溢出超过限制、目标架构不支持的 PTX 指令或 Triton 编译器 IR 生成中的错误。

English

The Triton JIT compiler generated PTX code that cannot be assembled by ptxas, often due to register spilling exceeding the limit, unsupported PTX instructions for the target architecture, or a bug in the Triton compiler IR generation.

generic

官方文档

https://triton-lang.org/main/getting-started/troubleshooting.html

解决方案

通过减少每个程序的操作数来简化 Triton 内核，特别是避免大型循环或大量使用 `tl.where` 和 `tl.sum`。将内核拆分为多个较小的内核并手动融合。

设置环境变量 `TRITON_MAX_REGISTERS=0` 以禁用寄存器分配提示，让 ptxas 自动管理寄存器，这可以减少溢出。示例：在运行脚本前执行 `export TRITON_MAX_REGISTERS=0`。

将 Triton 升级到最新的 nightly 版本（`pip install -U --pre triton`），其中可能包含 PTX 生成错误的修复。如果使用 PyTorch，确保其构建与兼容的 Triton 版本对应。

无效尝试

常见但无效的做法:

Reinstalling Triton from source without changing compiler flags 95% 失败
The error is not due to a missing Triton installation but to a PTX generation issue in the specific kernel; reinstalling does not fix the kernel code.
Setting `TRITON_PTXAS_PATH` to a different ptxas binary from a newer CUDA version 70% 失败
While a newer ptxas may support more instructions, the root cause is often register spilling or IR bugs; a newer ptxas may still fail with the same PTX.
Reducing the number of blocks per grid arbitrarily 90% 失败
The error is about PTX assembly, not grid launch configuration; changing grid size does not affect the PTX code generated.