diff --git a/chapters/en/chapter3/2.mdx b/chapters/en/chapter3/2.mdx index bc1b00179..b891841a1 100644 --- a/chapters/en/chapter3/2.mdx +++ b/chapters/en/chapter3/2.mdx @@ -129,8 +129,8 @@ from transformers import AutoTokenizer checkpoint = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(checkpoint) -tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"]) -tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"]) +tokenized_sentences_1 = tokenizer(list(raw_datasets["train"]["sentence1"])) +tokenized_sentences_2 = tokenizer(list(raw_datasets["train"]["sentence2"])) ``` @@ -195,8 +195,8 @@ Now that we have seen how our tokenizer can deal with one pair of sentences, we ```py tokenized_dataset = tokenizer( - raw_datasets["train"]["sentence1"], - raw_datasets["train"]["sentence2"], + list(raw_datasets["train"]["sentence1"]), + list(raw_datasets["train"]["sentence2"]), padding=True, truncation=True, ) @@ -208,7 +208,7 @@ To keep the data as a dataset, we will use the [`Dataset.map()`](https://hugging ```py def tokenize_function(example): - return tokenizer(example["sentence1"], example["sentence2"], truncation=True) + return tokenizer(list(example["sentence1"]), list(example["sentence2"]), truncation=True) ``` This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the keys `input_ids`, `attention_mask`, and `token_type_ids`. Note that it also works if the `example` dictionary contains several samples (each key as a list of sentences) since the `tokenizer` works on lists of pairs of sentences, as seen before. This will allow us to use the option `batched=True` in our call to `map()`, which will greatly speed up the tokenization. The `tokenizer` is backed by a tokenizer written in Rust from the [🤗 Tokenizers](https://github.com/huggingface/tokenizers) library. This tokenizer can be very fast, but only if we give it lots of inputs at once.