huggingface data_error ai_generated true

ValueError: The features of the dataset do not match the expected schema. Missing columns: ['text', 'label']. Extra columns: ['id', 'metadata']

ID: huggingface/dataset-features-mismatch

Also available as: JSON · Markdown · 中文
88%Fix Rate
86%Confidence
1Evidence
2024-01-05First Seen

Version Compatibility

VersionStatusIntroducedDeprecatedNotes
datasets>=2.10.0 active
transformers>=4.30.0 active
python>=3.8 active

Root Cause

The dataset loaded from Hugging Face Datasets has columns that do not match the expected schema required by the model or training script.

generic

中文

从 Hugging Face Datasets 加载的数据集具有与模型或训练脚本所需的预期模式不匹配的列。

Official Documentation

https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.select_columns

Workarounds

  1. 90% success Use dataset.select_columns(['text', 'label']) to keep only required columns, then add missing columns with default values: dataset = dataset.add_column('label', [0]*len(dataset)).
    Use dataset.select_columns(['text', 'label']) to keep only required columns, then add missing columns with default values: dataset = dataset.add_column('label', [0]*len(dataset)).
  2. 85% success Map extra columns to required ones: dataset = dataset.map(lambda x: {'text': x['metadata'], 'label': 0}).
    Map extra columns to required ones: dataset = dataset.map(lambda x: {'text': x['metadata'], 'label': 0}).

中文步骤

  1. 使用 dataset.select_columns(['text', 'label']) 仅保留所需列,然后添加缺失列并赋予默认值:dataset = dataset.add_column('label', [0]*len(dataset))。
  2. 将多余列映射到所需列:dataset = dataset.map(lambda x: {'text': x['metadata'], 'label': 0})。

Dead Ends

Common approaches that don't work:

  1. 50% fail

    Missing columns still need to be added or mapped from existing columns.

  2. 60% fail

    If the column name is misspelled, the error persists.