CUDNN_STATUS_BAD_PARAM cuda runtime_error ai_generated true

运行时错误:cuDNN 错误:调用 cudnnBatchNormalizationForwardTraining 时返回 CUDNN_STATUS_BAD_PARAM,epsilon=1e-06

RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM when calling cudnnBatchNormalizationForwardTraining with epsilon=1e-06

ID: cuda/cudnn-bn-epsilon-too-small

其他格式: JSON · Markdown 中文 · English
92%修复率
84%置信度
1证据数
2023-11-05首次发现

版本兼容性

版本状态引入弃用备注
cuDNN 8.9.5 active
cuDNN 9.0 active
PyTorch 2.0 active
PyTorch 2.1 active

根因分析

cuDNN 批量归一化要求 epsilon 至少为 1e-5(对于 float16 等某些数据类型要求更高),以避免数值不稳定;1e-6 的值太小,会触发 BAD_PARAM 错误。

English

cuDNN batch normalization requires epsilon to be at least 1e-5 (or higher for certain data types like float16) to avoid numerical instability; a value of 1e-6 is too small and triggers a BAD_PARAM error.

generic

官方文档

https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnBatchNormalizationForwardTraining

解决方案

  1. 将 epsilon 设置为 >= 1e-5 的值。在 PyTorch 中:nn.BatchNorm2d(num_features, eps=1e-5)。对于 float16 模型,使用 eps=1e-4 或更高。这是推荐的修复方法。
  2. 如果使用硬编码 epsilon 的预训练模型,请在加载后覆盖它:model.bn_layer.eps = 1e-5。然后根据需要重新初始化批量归一化统计信息。
  3. 仅将批量归一化层转换为 float32:model.bn_layer = model.bn_layer.float()。这允许使用较小的 epsilon 值,但可能会增加内存使用量。

无效尝试

常见但无效的做法:

  1. 30% 失败

    While it avoids the BAD_PARAM error, a large epsilon reduces the effectiveness of batch normalization, potentially degrading model accuracy.

  2. 10% 失败

    This works but disables all cuDNN optimizations, significantly slowing down training. It's an overreaction if only the epsilon is wrong.

  3. 100% 失败

    The error halts execution immediately; ignoring it is not possible without modifying the source code to catch the exception.