huggingface data_error ai_generated true

ValueError: The features of the dataset do not match the expected schema. Missing columns: ['text', 'label']. Extra columns: ['input', 'target']

ID: huggingface/dataset-features-column-mismatch

Also available as: JSON · Markdown · 中文
90%Fix Rate
85%Confidence
1Evidence
2023-08-10First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
datasets>=2.10.0 active
transformers>=4.25.0 active

Root Cause

Dataset loaded from Hugging Face Datasets has different column names than those expected by the training script or tokenizer.

generic

中文

从 Hugging Face Datasets 加载的数据集具有与训练脚本或分词器预期不同的列名。

Official Documentation

https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset

Workarounds

  1. 95% success Align columns using Dataset.rename_columns() and Dataset.remove_columns(): `dataset = dataset.rename_columns({'input': 'text', 'target': 'label'}).remove_columns(['unused_col'])`
    Align columns using Dataset.rename_columns() and Dataset.remove_columns(): `dataset = dataset.rename_columns({'input': 'text', 'target': 'label'}).remove_columns(['unused_col'])`
  2. 90% success Use datasets.Dataset.map() with a function that selects only the required columns: `dataset = dataset.map(lambda x: {'text': x['input'], 'label': x['target']}, remove_columns=dataset.column_names)`
    Use datasets.Dataset.map() with a function that selects only the required columns: `dataset = dataset.map(lambda x: {'text': x['input'], 'label': x['target']}, remove_columns=dataset.column_names)`
  3. 85% success Load the dataset with expected column names by specifying the 'columns' argument in load_dataset() if the dataset supports it, or create a new dataset with the correct schema.
    Load the dataset with expected column names by specifying the 'columns' argument in load_dataset() if the dataset supports it, or create a new dataset with the correct schema.

中文步骤

  1. Align columns using Dataset.rename_columns() and Dataset.remove_columns(): `dataset = dataset.rename_columns({'input': 'text', 'target': 'label'}).remove_columns(['unused_col'])`
  2. Use datasets.Dataset.map() with a function that selects only the required columns: `dataset = dataset.map(lambda x: {'text': x['input'], 'label': x['target']}, remove_columns=dataset.column_names)`
  3. Load the dataset with expected column names by specifying the 'columns' argument in load_dataset() if the dataset supports it, or create a new dataset with the correct schema.

Dead Ends

Common approaches that don't work:

  1. 40% fail

    If there are more mismatches (e.g., 'target' vs 'label'), the error persists. Also, renaming may break other downstream code that expects 'input'.

  2. 50% fail

    Trainer does not have ignore_columns; dropping columns with dataset.remove_columns() is correct but users often drop the wrong ones or forget to add missing columns.

  3. 70% fail

    Model config does not control dataset schema; this is a data preprocessing issue, not a model architecture issue.