Fix dataset loading error by converting column data to list[str] #1072

llmadd · 2025-09-12T05:36:20Z

When trying to tokenize text data directly from a dataset column using tokenizer(raw_train_dataset["sentence1"]), we encounter a ValueErrorindicating the input must be of type str, list[str], or list[list[str]].

The current dataset column returns a format that isn't directly compatible with the tokenizer's expected input types. This PR fixes the issue by explicitly converting the column data to a list[str]before tokenization.

Changes Made:

1.Modified the data loading code to convert dataset columns to list[str]before tokenization

2.Updated the tokenization call to handle the converted format

…), (batch or single pretokenized example) or (batch of pretokenized examples).'

HuggingFaceDocBuilderDev · 2025-09-12T05:47:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

fix tokenizer 'ValueError: text input must be of type (single example…

bb265e0

…), (batch or single pretokenized example) or (batch of pretokenized examples).'

llmadd closed this Sep 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix dataset loading error by converting column data to list[str] #1072

Fix dataset loading error by converting column data to list[str] #1072

Uh oh!

llmadd commented Sep 12, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Sep 12, 2025

Uh oh!

Uh oh!

Fix dataset loading error by converting column data to list[str] #1072

Fix dataset loading error by converting column data to list[str] #1072

Uh oh!

Conversation

llmadd commented Sep 12, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Sep 12, 2025

Uh oh!

Uh oh!