huggingface · llmadd · Sep 12, 2025
diff --git a/chapters/en/chapter3/2.mdx b/chapters/en/chapter3/2.mdx
@@ -129,8 +129,8 @@ from transformers import AutoTokenizer
 
 checkpoint = "bert-base-uncased"
 tokenizer = AutoTokenizer.from_pretrained(checkpoint)
-tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
-tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])
+tokenized_sentences_1 = tokenizer(list(raw_datasets["train"]["sentence1"]))
+tokenized_sentences_2 = tokenizer(list(raw_datasets["train"]["sentence2"]))
 ```
 
 <Tip>
@@ -195,8 +195,8 @@ Now that we have seen how our tokenizer can deal with one pair of sentences, we
 
 ```py
 tokenized_dataset = tokenizer(
-    raw_datasets["train"]["sentence1"],
-    raw_datasets["train"]["sentence2"],
+    list(raw_datasets["train"]["sentence1"]),
+    list(raw_datasets["train"]["sentence2"]),
     padding=True,
     truncation=True,
 )
@@ -208,7 +208,7 @@ To keep the data as a dataset, we will use the [`Dataset.map()`](https://hugging
 
 ```py
 def tokenize_function(example):
-    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
+    return tokenizer(list(example["sentence1"]), list(example["sentence2"]), truncation=True)
 ```
 
 This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the keys `input_ids`, `attention_mask`, and `token_type_ids`. Note that it also works if the `example` dictionary contains several samples (each key as a list of sentences) since the `tokenizer` works on lists of pairs of sentences, as seen before. This will allow us to use the option `batched=True` in our call to `map()`, which will greatly speed up the tokenization. The `tokenizer` is backed by a tokenizer written in Rust from the [🤗 Tokenizers](https://github.com/huggingface/tokenizers) library. This tokenizer can be very fast, but only if we give it lots of inputs at once.