RuntimeError: 调用 cudnnBatchNormalizationForwardTraining 时出现 cuDNN 错误:CUDNN_STATUS_BAD_PARAM,epsilon < 0
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM when calling cudnnBatchNormalizationForwardTraining with epsilon < 0
ID: cuda/cudnn-bn-epsilon-negative
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| cuDNN 8.9.0 | active | — | — | — |
| cuDNN 9.0.0 | active | — | — | — |
| PyTorch 2.0.0 | active | — | — | — |
| PyTorch 2.1.0 | active | — | — | — |
根因分析
cuDNN 批归一化例程要求 epsilon >= 0(通常为小的正值,如 1e-5);负 epsilon 违反批归一化的数学定义,cuDNN 将其作为错误参数拒绝。
English
cuDNN batch normalization routines require epsilon >= 0 (typically a small positive value, e.g., 1e-5); a negative epsilon violates the mathematical definition of batch normalization and cuDNN rejects it as a bad parameter.
官方文档
https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnBatchNormalizationForwardTraining解决方案
-
Ensure epsilon is a small positive float, typically 1e-5. Example: if (epsilon < 0) epsilon = 1e-5;
-
Add a validation check before the cuDNN call to clamp epsilon to a minimum positive value. Example: epsilon = max(epsilon, 1e-7);
无效尝试
常见但无效的做法:
-
70% 失败
Setting epsilon to a very large value (e.g., 1.0) causes numerical instability (division by sqrt(var+1.0) ~ 1) and poor training accuracy, but cuDNN does not error out; this masks the real issue.
-
60% 失败
Disabling cuDNN batch normalization (torch.backends.cudnn.enabled=False) forces a fallback to PyTorch's own implementation, which may accept negative epsilon but produces incorrect gradients.