Skip to content

Conversation

llmadd
Copy link

@llmadd llmadd commented Sep 12, 2025

When trying to tokenize text data directly from a dataset column using tokenizer(raw_train_dataset["sentence1"]), we encounter a ValueErrorindicating the input must be of type str, list[str], or list[list[str]].

The current dataset column returns a format that isn't directly compatible with the tokenizer's expected input types. This PR fixes the issue by explicitly converting the column data to a list[str]before tokenization.

Changes Made:​​

1.Modified the data loading code to convert dataset columns to list[str]before tokenization

2.Updated the tokenization call to handle the converted format

…), (batch or single pretokenized example) or (batch of pretokenized examples).'
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@llmadd llmadd closed this Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants