CUDNN_STATUS_BAD_PARAM cuda runtime_error ai_generated true

RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM when calling cudnnBatchNormalizationForwardTraining with epsilon=1e-06

ID: cuda/cudnn-bn-epsilon-too-small

Also available as: JSON · Markdown · 中文

92%Fix Rate

84%Confidence

1Evidence

2023-11-05First Seen

Version Compatibility

Version	Status	Introduced	Deprecated	Notes
cuDNN 8.9.5	active	—	—	—
cuDNN 9.0	active	—	—	—
PyTorch 2.0	active	—	—	—
PyTorch 2.1	active	—	—	—

Root Cause

cuDNN batch normalization requires epsilon to be at least 1e-5 (or higher for certain data types like float16) to avoid numerical instability; a value of 1e-6 is too small and triggers a BAD_PARAM error.

generic

中文

cuDNN 批量归一化要求 epsilon 至少为 1e-5（对于 float16 等某些数据类型要求更高），以避免数值不稳定；1e-6 的值太小，会触发 BAD_PARAM 错误。

Official Documentation

https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnBatchNormalizationForwardTraining

Workarounds

95% success Set epsilon to a value >= 1e-5. In PyTorch: nn.BatchNorm2d(num_features, eps=1e-5). For float16 models, use eps=1e-4 or higher. This is the recommended fix.
```
Set epsilon to a value >= 1e-5. In PyTorch: nn.BatchNorm2d(num_features, eps=1e-5). For float16 models, use eps=1e-4 or higher. This is the recommended fix.
```
90% success If using a pre-trained model with a hardcoded epsilon, override it after loading: model.bn_layer.eps = 1e-5. Then reinitialize the batch norm statistics if needed.
```
If using a pre-trained model with a hardcoded epsilon, override it after loading: model.bn_layer.eps = 1e-5. Then reinitialize the batch norm statistics if needed.
```
70% success Convert the model to use float32 for batch normalization layers only: model.bn_layer = model.bn_layer.float(). This allows smaller epsilon values but may increase memory usage.
```
Convert the model to use float32 for batch normalization layers only: model.bn_layer = model.bn_layer.float(). This allows smaller epsilon values but may increase memory usage.
```

中文步骤

将 epsilon 设置为 >= 1e-5 的值。在 PyTorch 中：nn.BatchNorm2d(num_features, eps=1e-5)。对于 float16 模型，使用 eps=1e-4 或更高。这是推荐的修复方法。

如果使用硬编码 epsilon 的预训练模型，请在加载后覆盖它：model.bn_layer.eps = 1e-5。然后根据需要重新初始化批量归一化统计信息。

仅将批量归一化层转换为 float32：model.bn_layer = model.bn_layer.float()。这允许使用较小的 epsilon 值，但可能会增加内存使用量。

Dead Ends

Common approaches that don't work:

30% fail
While it avoids the BAD_PARAM error, a large epsilon reduces the effectiveness of batch normalization, potentially degrading model accuracy.
10% fail
This works but disables all cuDNN optimizations, significantly slowing down training. It's an overreaction if only the epsilon is wrong.
100% fail
The error halts execution immediately; ignoring it is not possible without modifying the source code to catch the exception.