huggingface runtime_error ai_generated true

Token indices sequence length is longer than the specified maximum sequence length for this model (2048 > 1024). Running out-of-order

ID: huggingface/tokenizer-decoder-max-length-overflow

Also available as: JSON · Markdown · 中文
85%Fix Rate
85%Confidence
1Evidence
2023-11-15First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
transformers>=4.30.0 active
tokenizers>=0.13.0 active
python>=3.8 active

Root Cause

Input text is too long for the model's max_position_embeddings, causing tokenizer to truncate incorrectly or overflow without proper truncation settings.

generic

中文

输入文本长度超过模型的 max_position_embeddings,导致分词器未正确截断或溢出。

Official Documentation

https://huggingface.co/docs/transformers/main/en/llm_tutorial#truncation

Workarounds

  1. 90% success Set truncation=True and max_length=512 when encoding inputs. Example: tokenizer(text, truncation=True, max_length=512, return_tensors='pt')
    Set truncation=True and max_length=512 when encoding inputs. Example: tokenizer(text, truncation=True, max_length=512, return_tensors='pt')
  2. 80% success Use a model with larger max_position_embeddings (e.g., 4096) or switch to a long-context model like Longformer.
    Use a model with larger max_position_embeddings (e.g., 4096) or switch to a long-context model like Longformer.

中文步骤

  1. 在编码输入时设置 truncation=True 和 max_length=512。示例:tokenizer(text, truncation=True, max_length=512, return_tensors='pt')
  2. 使用具有更大 max_position_embeddings(如 4096)的模型,或切换到长上下文模型如 Longformer。

Dead Ends

Common approaches that don't work:

  1. 60% fail

    truncation=False disables truncation entirely, leading to a hard crash rather than graceful handling.

  2. 80% fail

    Model's learned positional embeddings only support up to max_position_embeddings; exceeding it leads to out-of-range errors.