Byte Pair Encoding (BPE) is a data compression and subword tokenization algorithm used to segment text into subword units. It iteratively merges the most frequent pair of adjacent bytes until no further merges are possible, an iteration limit has been reached, or a desired vocabulary size has been reached. This results in a vocabulary, a set of unique tokens that efficiently represent text data. This makes it particularly useful for:
- Neural language models
- Machine translation systems
- Text compression applications
- Balancing vocabulary size with the ability to represent rare words as sequences of subword units
Note: This is an educational implementation.
- Vocabulary Generation: Builds a subword vocabulary by iteratively merging the most frequent character pairs
- Radix Trie: Efficiently stores the vocabulary for fast prefix matching
- Encoder: Converts raw text into token sequences using the vocabulary
- Decoder: Reconstructs original text from token sequences
- Radix Trie for Prefix Matching: Uses a space-efficient trie structure for fast longest-prefix lookups.
- Caching System: Implements a prefix cache to avoid redundant lookups during tokenization.
- Base94 Tokens: Implements a base10 to base94 conversion for more compressed tokens.
- Strings/Maps: Minimized lookups and reconstruction of strings. Skips
<>in decoding instead ofstr.replace(). Usestring_viewfor string operations.
- C++17 compatible compiler
- CMake 3.30 or higher
- Standard C++ libraries only (no external dependencies)
This generates a vocabulary file (vocabulary.tokens) from your training corpus.
./bpetok vocab <corpus_file>Converts plain text into a sequence of tokens based on the vocabulary.
./bpetok encode <input_file> <vocabulary_file> <output_file>Reconstructs the original text from token sequences.
./bpetok decode <input_file> <vocabulary_file> <output_file>mkdir build && cd build
cmake ..
make- Process corpus into unique words, count them, then split each word into characters.
- Initializes vocabulary with all printable ASCII characters.
- Counts frequency of character bigrams in the corpus.
- Iteratively merges the most frequent pairs.
- Continues until reaching target vocabulary size (50,000 tokens) or no more bigrams can be found.
- Reads input and vocabulary files.
- Normalizes some UTF-8 characters and handles special cases like smart quotes.
- Build a Radix Trie using the vocabulary data.
- Splits input text on whitespace
- For each word:
- Appends end-of-word marker (<>)
- Finds longest matching subword from vocabulary using the Radix Trie
- Repeats until the entire word is tokenized
- Writes tokenized data into file
- Reads input and vocabulary files.
- Use a stream to grab tokens split by whitespace.
- Lookup token in vocabulary map.
- Slice off delimiters and output to file.
This codebase is licensed under the terms of the GNU General Public License v3.0 - see the LICENSE file for details.