Byte-Pair Encoding Tokenization

Byte Pair Encoding (BPE) is a data compression and subword tokenization algorithm used to segment text into subword units. It iteratively merges the most frequent pair of adjacent bytes until no further merges are possible, an iteration limit has been reached, or a desired vocabulary size has been reached. This results in a vocabulary, a set of unique tokens that efficiently represent text data. This makes it particularly useful for:

Neural language models
Machine translation systems
Text compression applications
Balancing vocabulary size with the ability to represent rare words as sequences of subword units

Note: This is an educational implementation.

Implementation Features

Core Components

Vocabulary Generation: Builds a subword vocabulary by iteratively merging the most frequent character pairs
Radix Trie: Efficiently stores the vocabulary for fast prefix matching
Encoder: Converts raw text into token sequences using the vocabulary
Decoder: Reconstructs original text from token sequences

Key Optimizations

Radix Trie for Prefix Matching: Uses a space-efficient trie structure for fast longest-prefix lookups.
Caching System: Implements a prefix cache to avoid redundant lookups during tokenization.
Base94 Tokens: Implements a base10 to base94 conversion for more compressed tokens.
Strings/Maps: Minimized lookups and reconstruction of strings. Skips <> in decoding instead of str.replace(). Use string_view for string operations.

Dependencies

C++17 compatible compiler
CMake 3.30 or higher
Standard C++ libraries only (no external dependencies)

Usage

Building the Vocabulary

This generates a vocabulary file (vocabulary.tokens) from your training corpus.

./bpetok vocab <corpus_file>

Encoding Text

Converts plain text into a sequence of tokens based on the vocabulary.

./bpetok encode <input_file> <vocabulary_file> <output_file>

Decoding Text

Reconstructs the original text from token sequences.

./bpetok decode <input_file> <vocabulary_file> <output_file>

Building from Source

mkdir build && cd build
cmake ..
make

Implementation Details

Vocabulary Generation

Process corpus into unique words, count them, then split each word into characters.
Initializes vocabulary with all printable ASCII characters.
Counts frequency of character bigrams in the corpus.
Iteratively merges the most frequent pairs.
Continues until reaching target vocabulary size (50,000 tokens) or no more bigrams can be found.

Tokenization Process

Reads input and vocabulary files.
Normalizes some UTF-8 characters and handles special cases like smart quotes.
Build a Radix Trie using the vocabulary data.
Splits input text on whitespace
For each word:
1. Appends end-of-word marker (<>)
2. Finds longest matching subword from vocabulary using the Radix Trie
3. Repeats until the entire word is tokenized
Writes tokenized data into file

Detokenization Process

Reads input and vocabulary files.
Use a stream to grab tokens split by whitespace.
Lookup token in vocabulary map.
Slice off delimiters and output to file.

License

This codebase is licensed under the terms of the GNU General Public License v3.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
src		src
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Byte-Pair Encoding Tokenization

Implementation Features

Core Components

Key Optimizations

Dependencies

Usage

Building the Vocabulary

Encoding Text

Decoding Text

Building from Source

Implementation Details

Vocabulary Generation

Tokenization Process

Detokenization Process

License

About

Uh oh!

Languages

License

anivanchen/bpe-tokenization

Folders and files

Latest commit

History

Repository files navigation

Byte-Pair Encoding Tokenization

Implementation Features

Core Components

Key Optimizations

Dependencies

Usage

Building the Vocabulary

Encoding Text

Decoding Text

Building from Source

Implementation Details

Vocabulary Generation

Tokenization Process

Detokenization Process

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages