RuntimeError: CUDA error: the graph update was not performed because it included changes which violated constraints specific to instantiated CUDA graphs (cudaErrorGraphUpdateViolation) - memory pool mismatch
ID: cuda/cudagraph-memory-pool-mismatch
Version Compatibility
| Version | Status | Introduced | Deprecated | Notes |
|---|---|---|---|---|
| CUDA 12.0 | active | — | — | — |
| CUDA 12.3 | active | — | — | — |
| PyTorch 2.3.0 | active | — | — | — |
| NVIDIA Driver 545.23 | active | — | — | — |
Root Cause
When updating a CUDA graph using `cudaGraphInstantiate` with `cudaGraphInstantiateFlagAutoFreeOnLaunch`, the new graph nodes reference a different memory pool than the original instantiation, which is not allowed because the graph's memory pool is fixed after instantiation.
generic中文
当使用 `cudaGraphInstantiate` 和 `cudaGraphInstantiateFlagAutoFreeOnLaunch` 更新 CUDA 图时,新图节点引用了与原始实例化不同的内存池,这是不允许的,因为图的内存池在实例化后是固定的。
Official Documentation
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__GRAPH.html#group__CUDART__GRAPH_1g1a5b9a2b8c3f4e5d6a7b8c9d0e1f2a3bWorkarounds
-
85% success Ensure that all tensors used in the graph are allocated from the same memory pool by using `torch.cuda.cudart().cudaMemPoolSetAttribute` to create a custom pool and explicitly assigning it to all tensors before graph capture. Alternatively, avoid using `cudaGraphInstantiateFlagAutoFreeOnLaunch` and manage memory manually.
Ensure that all tensors used in the graph are allocated from the same memory pool by using `torch.cuda.cudart().cudaMemPoolSetAttribute` to create a custom pool and explicitly assigning it to all tensors before graph capture. Alternatively, avoid using `cudaGraphInstantiateFlagAutoFreeOnLaunch` and manage memory manually.
-
90% success Instead of updating the graph, capture a new graph each time the memory pool changes. Use `torch.cuda.CUDAGraph` and call `graph.replay()` only if the input shapes and memory pools are unchanged. If they change, call `graph.capture_begin()` again to recapture.
Instead of updating the graph, capture a new graph each time the memory pool changes. Use `torch.cuda.CUDAGraph` and call `graph.replay()` only if the input shapes and memory pools are unchanged. If they change, call `graph.capture_begin()` again to recapture.
-
75% success Set environment variable `CUDA_GRAPH_DEBUG=1` to enable verbose logging from the CUDA graph runtime, which prints the memory pool addresses and helps identify which node causes the mismatch.
Set environment variable `CUDA_GRAPH_DEBUG=1` to enable verbose logging from the CUDA graph runtime, which prints the memory pool addresses and helps identify which node causes the mismatch.
中文步骤
通过使用 `torch.cuda.cudart().cudaMemPoolSetAttribute` 创建自定义池并在图捕获前显式将其分配给所有张量,确保图中使用的所有张量来自同一内存池。或者,避免使用 `cudaGraphInstantiateFlagAutoFreeOnLaunch` 并手动管理内存。
不要更新图,而是在内存池更改时每次捕获一个新图。使用 `torch.cuda.CUDAGraph`,并且仅在输入形状和内存池未更改时调用 `graph.replay()`。如果它们更改,再次调用 `graph.capture_begin()` 重新捕获。
设置环境变量 `CUDA_GRAPH_DEBUG=1` 以启用 CUDA 图运行时的详细日志记录,这会打印内存池地址并帮助识别导致不匹配的节点。
Dead Ends
Common approaches that don't work:
-
Calling `torch.cuda.empty_cache()` before graph capture to free memory
90% fail
Emptying the cache does not change the memory pool assignment; the graph will still capture from the default pool, and the update will still fail if the pool changes.
-
Using `cudaGraphInstantiate` without the `AutoFreeOnLaunch` flag
70% fail
While this avoids the pool mismatch error, it disables automatic memory management and may lead to memory leaks or performance degradation.
-
Rebuilding the entire graph from scratch instead of updating
60% fail
Rebuilding works but is inefficient; the error is about update constraints, not about graph creation itself.