ERTL tensorflow resource_error ai_generated true

资源耗尽错误:函数 'train_step' 已被重跟踪 1000 次。已达到跟踪限制。这可能是由于传递 Python 字面量或形状变化的张量导致的。

ResourceExhaustedError: The function 'train_step' has been retraced 1000 times. The tracing limit has been reached. This may be caused by passing Python literals or tensors with changing shapes.

ID: tensorflow/tf-function-recompilation-cache-limit

其他格式: JSON · Markdown 中文 · English
88%修复率
89%置信度
1证据数
2023-08-12首次发现

版本兼容性

版本状态引入弃用备注
TensorFlow 2.6.0 active
TensorFlow 2.10.0 active

根因分析

被 tf.function 装饰的函数(例如 train_step)因接收形状变化或 Python 值的参数而被过度重跟踪,这些参数未作为函数输入签名的一部分被缓存,耗尽了跟踪缓存。

English

A tf.function-decorated function (e.g., train_step) is being re-traced excessively because it receives arguments with varying shapes or Python values that are not cached as part of the function's input signature, exhausting the tracing cache.

generic

官方文档

https://www.tensorflow.org/guide/function#controlling_retracing

解决方案

  1. 确保传递给 tf.function 的所有张量参数具有一致的形状。在传递之前将输入填充或调整为固定形状。对于 Python 参数,将其转换为张量或使用 tf.constant() 使其成为图签名的一部分。
  2. 使用 tf.TensorSpec 显式定义输入签名,以防止因形状或数据类型变化而重跟踪。这告诉 TensorFlow 对匹配签名的所有调用使用单个图。
  3. 如果函数使用变化的 Python 整数或布尔参数,将其转换为张量,或使用 tf.cond() 或 tf.switch_case() 进行控制流,将其移出 tf.function。

无效尝试

常见但无效的做法:

  1. 90% 失败

    Running eagerly defeats the purpose of tf.function (performance), and increasing the limit only delays the error without fixing the root cause of shape variability.

  2. 70% 失败

    While this can reduce retracing for shape changes, it does not address retracing caused by changing Python literal values (e.g., integer arguments), and may still hit the limit.

  3. 60% 失败

    This eliminates retracing but also removes the performance benefits of graph compilation, potentially slowing training significantly.