cudaErrorStreamOrderViolation cuda runtime_error ai_generated true

CUDA error: stream-order violation during graph launch (cudaErrorStreamOrderViolation)

ID: cuda/stream-order-violation-cuda-graph

Also available as: JSON · Markdown · 中文
85%Fix Rate
88%Confidence
1Evidence
2024-01-10First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
CUDA 11.7 active
CUDA 12.2 active
PyTorch 2.1 active
PyTorch 2.3 active

Root Cause

A CUDA graph is launched on a stream that has pending operations from a different stream or graph, violating the implicit ordering constraints when using CUDA graph capturing.

generic

中文

CUDA 图在具有来自不同流或图的未决操作的流上启动,违反了使用 CUDA 图捕获时的隐式排序约束。

Official Documentation

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graph-stream-order

Workarounds

  1. 85% success Ensure all operations on the target stream are synchronized before launching a graph. Use `torch.cuda.synchronize()` or stream synchronization primitives before `cudaGraphLaunch`.
    Ensure all operations on the target stream are synchronized before launching a graph. Use `torch.cuda.synchronize()` or stream synchronization primitives before `cudaGraphLaunch`.
  2. 90% success Re-capture the graph on a dedicated stream that is not used for other operations, ensuring no cross-stream dependencies.
    Re-capture the graph on a dedicated stream that is not used for other operations, ensuring no cross-stream dependencies.

中文步骤

  1. Ensure all operations on the target stream are synchronized before launching a graph. Use `torch.cuda.synchronize()` or stream synchronization primitives before `cudaGraphLaunch`.
  2. Re-capture the graph on a dedicated stream that is not used for other operations, ensuring no cross-stream dependencies.

Dead Ends

Common approaches that don't work:

  1. 90% fail

    The error is about stream synchronization, not parallelism; adding workers can introduce more streams and worsen the violation.

  2. 60% fail

    This removes the performance benefit but does not fix the underlying stream management; the error may reappear if graphs are re-enabled.