RuntimeError: 在训练模式下调用 cudnnRNNBackwardData_v8 并启用双重反向传播时出现 cuDNN 错误:CUDNN_STATUS_NOT_SUPPORTED
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED when calling cudnnRNNBackwardData_v8 with training mode enabled and double backward
ID: cuda/cudnn-rnn-double-backward
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| cuDNN 8.9.0 | active | — | — | — |
| cuDNN 8.9.5 | active | — | — | — |
| PyTorch 2.1.0 | active | — | — | — |
| PyTorch 2.2.0 | active | — | — | — |
根因分析
cuDNN RNN 反向传播操作(特别是反向数据与双重反向传播)在特定 RNN 模式(如带投影的 LSTM)下不受支持,或者当输入张量需要梯度且计算图被保留时;cuDNN v8 将双重反向传播支持限制为特定配置。
English
cuDNN RNN backward operations (especially backward data with double backward) are not supported for certain RNN modes (e.g., LSTM with projection) or when the input tensor requires grad and the graph is retained; cuDNN v8 restricts double backward support to specific configurations.
官方文档
https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnRNNBackwardData解决方案
-
Switch to a non-projected LSTM (e.g., remove projection layer) or use GRU instead, which has broader double backward support. Example: change nn.LSTM(input_size, hidden_size, proj_size=hidden_size) to nn.LSTM(input_size, hidden_size).
-
Use torch.autograd.grad with create_graph=False for the backward pass, and manually implement double backward using torch.autograd.Function with a custom backward that does not rely on cuDNN RNN backward data.
无效尝试
常见但无效的做法:
-
80% 失败
Increasing cuDNN version does not add double backward support for all RNN modes; the limitation is architectural in cuDNN v8.
-
70% 失败
Setting torch.backends.cudnn.enabled=False forces a fallback to non-cuDNN RNN but may cause performance regression or different numerical behavior; double backward still fails if the custom RNN does not support it.
-
90% 失败
Using retain_graph=True without detaching intermediate activations does not prevent the error; the double backward path still triggers the unsupported cuDNN routine.