huggingface config_error ai_generated true

UserWarning: You are using a decoder-only model with padding_side='right'. This may produce incorrect results. Consider setting `tokenizer.padding_side = 'left'`.

ID: huggingface/transformers-tokenizer-padding-side-mismatch

Also available as: JSON · Markdown · 中文
95%Fix Rate
90%Confidence
1Evidence
2023-04-20First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
transformers>=4.30.0 active
torch>=1.12.0 active

Root Cause

Decoder-only models (e.g., GPT, LLaMA) expect padding on the left side for batched generation; using right padding causes the model to attend to padding tokens at the end of the sequence.

generic

中文

解码器专用模型(如 GPT、LLaMA)在批量生成时期望左侧填充;使用右侧填充会导致模型关注序列末尾的填充标记。

Official Documentation

https://huggingface.co/docs/transformers/en/pad_truncation#padding-and-truncation

Workarounds

  1. 95% success Set padding_side before tokenizing: `tokenizer.padding_side = 'left'` then tokenize the batch again. Example: `tokenizer.padding_side = 'left'; inputs = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')`
    Set padding_side before tokenizing: `tokenizer.padding_side = 'left'` then tokenize the batch again. Example: `tokenizer.padding_side = 'left'; inputs = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')`
  2. 90% success For generation, use a pipeline with `tokenizer.padding_side = 'left'` and set `pad_token_id=tokenizer.eos_token_id` if no pad token is defined.
    For generation, use a pipeline with `tokenizer.padding_side = 'left'` and set `pad_token_id=tokenizer.eos_token_id` if no pad token is defined.
  3. 85% success If using Trainer, set `tokenizer.padding_side = 'left'` in the data collator or before creating the dataset to ensure all batches are left-padded.
    If using Trainer, set `tokenizer.padding_side = 'left'` in the data collator or before creating the dataset to ensure all batches are left-padded.

中文步骤

  1. Set padding_side before tokenizing: `tokenizer.padding_side = 'left'` then tokenize the batch again. Example: `tokenizer.padding_side = 'left'; inputs = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')`
  2. For generation, use a pipeline with `tokenizer.padding_side = 'left'` and set `pad_token_id=tokenizer.eos_token_id` if no pad token is defined.
  3. If using Trainer, set `tokenizer.padding_side = 'left'` in the data collator or before creating the dataset to ensure all batches are left-padded.

Dead Ends

Common approaches that don't work:

  1. 80% fail

    The model will generate incorrect tokens because it attends to padding tokens at the end, especially for left-to-right generation tasks like text completion.

  2. 60% fail

    The tokenizer's padding_side only affects future calls to tokenizer(); if you already tokenized the input, the padding direction is already set and won't change.

  3. 50% fail

    This only sets the pad token ID; it does not change the padding direction, so the warning and incorrect behavior persist.