cudaErrorStreamCaptureInvalidated cuda runtime_error ai_generated true

RuntimeError: CUDA error: operation not permitted when stream is capturing (streamCaptureInvalidated)

ID: cuda/stream-capture-invalid-scope

Also available as: JSON · Markdown · 中文
81%Fix Rate
87%Confidence
1Evidence
2024-09-05First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
CUDA 12.0 active
PyTorch 2.1.0 active
NVIDIA Driver 535.129.03 active

Root Cause

A CUDA graph capture is in progress on a stream, but an operation (e.g., memory allocation, host-side sync) that is invalid during capture was attempted, invalidating the capture.

generic

中文

流上正在进行CUDA图捕获,但尝试了捕获期间无效的操作(例如内存分配、主机端同步),导致捕获失效。

Official Documentation

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html

Workarounds

  1. 88% success Move all memory allocations and host-device synchronization outside the capture scope. Example: pre-allocate tensors before calling torch.cuda.CUDAGraph.begin_capture(), and use torch.cuda.synchronize() only after capture ends.
    Move all memory allocations and host-device synchronization outside the capture scope. Example: pre-allocate tensors before calling torch.cuda.CUDAGraph.begin_capture(), and use torch.cuda.synchronize() only after capture ends.
  2. 80% success Use cudaStreamBeginCapture with cudaStreamCaptureModeGlobal to allow more operations, but ensure no host-side blocking calls occur during capture. In PyTorch, wrap the capture in a context manager that defers any print or sleep calls.
    Use cudaStreamBeginCapture with cudaStreamCaptureModeGlobal to allow more operations, but ensure no host-side blocking calls occur during capture. In PyTorch, wrap the capture in a context manager that defers any print or sleep calls.

中文步骤

  1. Move all memory allocations and host-device synchronization outside the capture scope. Example: pre-allocate tensors before calling torch.cuda.CUDAGraph.begin_capture(), and use torch.cuda.synchronize() only after capture ends.
  2. Use cudaStreamBeginCapture with cudaStreamCaptureModeGlobal to allow more operations, but ensure no host-side blocking calls occur during capture. In PyTorch, wrap the capture in a context manager that defers any print or sleep calls.

Dead Ends

Common approaches that don't work:

  1. 92% fail

    This disables cuDNN heuristics but does not fix the capture violation; the error will reoccur if capture is attempted again.

  2. 98% fail

    Thread configuration is unrelated to capture validity; the error is about operations allowed during capture.