Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions chapters/en/chapter3/2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -129,8 +129,8 @@ from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])
tokenized_sentences_1 = tokenizer(list(raw_datasets["train"]["sentence1"]))
tokenized_sentences_2 = tokenizer(list(raw_datasets["train"]["sentence2"]))
```

<Tip>
Expand Down Expand Up @@ -195,8 +195,8 @@ Now that we have seen how our tokenizer can deal with one pair of sentences, we

```py
tokenized_dataset = tokenizer(
raw_datasets["train"]["sentence1"],
raw_datasets["train"]["sentence2"],
list(raw_datasets["train"]["sentence1"]),
list(raw_datasets["train"]["sentence2"]),
padding=True,
truncation=True,
)
Expand All @@ -208,7 +208,7 @@ To keep the data as a dataset, we will use the [`Dataset.map()`](https://hugging

```py
def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
return tokenizer(list(example["sentence1"]), list(example["sentence2"]), truncation=True)
```

This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the keys `input_ids`, `attention_mask`, and `token_type_ids`. Note that it also works if the `example` dictionary contains several samples (each key as a list of sentences) since the `tokenizer` works on lists of pairs of sentences, as seen before. This will allow us to use the option `batched=True` in our call to `map()`, which will greatly speed up the tokenization. The `tokenizer` is backed by a tokenizer written in Rust from the [🤗 Tokenizers](https://github.com/huggingface/tokenizers) library. This tokenizer can be very fast, but only if we give it lots of inputs at once.
Expand Down
Loading