Full writeup can be found in our paper.
This module includes a real byte level tokenizer for text, which encodes text into a sequence of bytes (0-255).
Unlike ByT5Tokenizer for example, UTF8Tokenizer is implemented from scratch, and is much more efficient.
Other "Byte Level" tokenizers usually include various additional "special tokens" (e.g., <pad>, <unk>, etc.),
making the encoding and decoding logic more complex, and the token ids larger than 255.
Instead, we rely on C0 Control characters (0-31) as special tokens, which are not used in normal text.
pip install utf8-tokenizerTokenization:
from utf8_tokenizer.tokenizer import UTF8Tokenizer
tokenizer = UTF8Tokenizer()
texts = ["word", "or multiple"]
print(tokenizer(texts))Chat Template:
from utf8_tokenizer.tokenizer import UTF8Tokenizer
from utf8_tokenizer.control import visualize_control_tokens
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hey, what's 1+1?"},
{"role": "assistant", "content": "1+1 is 2."},
]
tokenizer = UTF8Tokenizer()
text = tokenizer.apply_chat_template(messages, tokenize=False)
# Visualize the text with special tokens
print(visualize_control_tokens(text))Bit-biased byte embeddings:
from transformers import AutoModelForCausalLM
# Load example model
model = AutoModelForCausalLM.from_pretrained("sbintuitions/tiny-lm")
model.resize_token_embeddings(256)
from utf8_tokenizer.embeddings import patch_embedding_layers, join_embedding_layers
patch_embedding_layers(model) # Apply bit-bias for training
#
# Train your model...
#
join_embedding_layers(model) # Fold to a single embedding layer for inferencepython experiments/benchmark.pyOn MacBook Pro, with Apple M4 Pro chip, just converting texts of 6 words in different languages to bytes, without wrapping them in tensors, creating attention masks, or padding, runs at 127.4k/sec.
Calling the ByT5 tokenizer runs at 6.2k/sec.
When we call our new tokenizer, through the __call__ path, we get 10.5k/sec, which is a bit faster.
Our optimized version with zero-copy runs at 86.7k/sec, where the loss of performance compared to the raw ints is in padding the input ids into a properly padded tensor. This is a 14x speedup over the original tokenizer.
We train a small language model with and without bit-bias.
Our results reveal that bit-bias improves both loss and accuracy, while increasing training time by about 1%. We hope that our bit-level embeddings module can be further optimized, to minimize the training overhead.
If you use this code in your research, please consider citing the work:
@misc{moryossef2025utf8,
title = {Back to Bytes: Revisiting Tokenization Through {UTF-8}},
author = {Amit Moryossef and Clara Meister and Pavel Stepachev and Desmond Elliott},
howpublished = {\url{https://github.com/sign/utf8-tokenizer}},
eprint = {2510.16987},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2510.16987},
year = {2025}
}