cudaErrorGraphUpdateViolation cuda runtime_error ai_generated true

RuntimeError: CUDA error: the graph update was not performed because it included changes which violated constraints specific to instantiated CUDA graphs (cudaErrorGraphUpdateViolation) - memory pool mismatch

ID: cuda/cudagraph-memory-pool-mismatch

Also available as: JSON · Markdown · 中文
80%Fix Rate
86%Confidence
1Evidence
2024-03-10First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
CUDA 12.0 active
CUDA 12.3 active
PyTorch 2.3.0 active
NVIDIA Driver 545.23 active

Root Cause

When updating a CUDA graph using `cudaGraphInstantiate` with `cudaGraphInstantiateFlagAutoFreeOnLaunch`, the new graph nodes reference a different memory pool than the original instantiation, which is not allowed because the graph's memory pool is fixed after instantiation.

generic

中文

当使用 `cudaGraphInstantiate` 和 `cudaGraphInstantiateFlagAutoFreeOnLaunch` 更新 CUDA 图时,新图节点引用了与原始实例化不同的内存池,这是不允许的,因为图的内存池在实例化后是固定的。

Official Documentation

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1g1a5b9a2b8c3f4e5d6a7b8c9d0e1f2a3b

Workarounds

  1. 85% success Ensure that all tensors used in the graph are allocated from the same memory pool by using `torch.cuda.cudart().cudaMemPoolSetAttribute` to create a custom pool and explicitly assigning it to all tensors before graph capture. Alternatively, avoid using `cudaGraphInstantiateFlagAutoFreeOnLaunch` and manage memory manually.
    Ensure that all tensors used in the graph are allocated from the same memory pool by using `torch.cuda.cudart().cudaMemPoolSetAttribute` to create a custom pool and explicitly assigning it to all tensors before graph capture. Alternatively, avoid using `cudaGraphInstantiateFlagAutoFreeOnLaunch` and manage memory manually.
  2. 90% success Instead of updating the graph, capture a new graph each time the memory pool changes. Use `torch.cuda.CUDAGraph` and call `graph.replay()` only if the input shapes and memory pools are unchanged. If they change, call `graph.capture_begin()` again to recapture.
    Instead of updating the graph, capture a new graph each time the memory pool changes. Use `torch.cuda.CUDAGraph` and call `graph.replay()` only if the input shapes and memory pools are unchanged. If they change, call `graph.capture_begin()` again to recapture.
  3. 75% success Set environment variable `CUDA_GRAPH_DEBUG=1` to enable verbose logging from the CUDA graph runtime, which prints the memory pool addresses and helps identify which node causes the mismatch.
    Set environment variable `CUDA_GRAPH_DEBUG=1` to enable verbose logging from the CUDA graph runtime, which prints the memory pool addresses and helps identify which node causes the mismatch.

中文步骤

  1. 通过使用 `torch.cuda.cudart().cudaMemPoolSetAttribute` 创建自定义池并在图捕获前显式将其分配给所有张量,确保图中使用的所有张量来自同一内存池。或者,避免使用 `cudaGraphInstantiateFlagAutoFreeOnLaunch` 并手动管理内存。
  2. 不要更新图,而是在内存池更改时每次捕获一个新图。使用 `torch.cuda.CUDAGraph`,并且仅在输入形状和内存池未更改时调用 `graph.replay()`。如果它们更改,再次调用 `graph.capture_begin()` 重新捕获。
  3. 设置环境变量 `CUDA_GRAPH_DEBUG=1` 以启用 CUDA 图运行时的详细日志记录,这会打印内存池地址并帮助识别导致不匹配的节点。

Dead Ends

Common approaches that don't work:

  1. Calling `torch.cuda.empty_cache()` before graph capture to free memory 90% fail

    Emptying the cache does not change the memory pool assignment; the graph will still capture from the default pool, and the update will still fail if the pool changes.

  2. Using `cudaGraphInstantiate` without the `AutoFreeOnLaunch` flag 70% fail

    While this avoids the pool mismatch error, it disables automatic memory management and may lead to memory leaks or performance degradation.

  3. Rebuilding the entire graph from scratch instead of updating 60% fail

    Rebuilding works but is inefficient; the error is about update constraints, not about graph creation itself.