huggingface
data_error
ai_generated
true
运行时错误:数据集洗牌需要可迭代数据集的确定性种子,但种子为 None
RuntimeError: Dataset shuffling requires a deterministic seed for iterable datasets, but seed is None
ID: huggingface/dataset-shuffling-iterator-break
88%修复率
82%置信度
1证据数
2023-11-20首次发现
版本兼容性
| 版本 | 状态 | 引入 | 弃用 | 备注 |
|---|---|---|---|---|
| datasets>=2.10.0 | active | — | — | — |
| torch>=1.13.0 | active | — | — | — |
根因分析
IterableDataset 不支持在没有固定种子的情况下随机洗牌;数据集迭代器无法确定性重放以进行洗牌。
English
IterableDataset does not support random shuffling without a fixed seed; the dataset iterator cannot be deterministically replayed for shuffling.
官方文档
https://huggingface.co/docs/datasets/v2.10.0/en/stream#shuffling解决方案
-
在洗牌时指定种子:dataset = dataset.shuffle(seed=42, buffer_size=1000)。这可确保流式数据集的确定性洗牌顺序。
-
禁用 IterableDataset 的洗牌并外部洗牌:train_loader = DataLoader(dataset, shuffle=False);然后如果使用 MapDataset,在每个 epoch 前手动洗牌索引。
无效尝试
常见但无效的做法:
-
Set `shuffle=True` on the DataLoader without fixing the seed
80% 失败
The DataLoader's shuffle is incompatible with IterableDataset; it raises an error or silently fails to shuffle.
-
Convert the IterableDataset to a MapDataset by calling `.to_iterable_dataset()`
90% 失败
This method does not exist; conversion requires loading the entire dataset into memory, which defeats the purpose of streaming.
-
Use `dataset.shuffle(buffer_size=1000)` without a seed
100% 失败
The shuffle method on IterableDataset requires a seed parameter; omitting it raises the same error.