Building a Tokenizer for Large Language Models

Feb 11, 202615 min read

This implementation follows Andrej Karpathy's tutorial on building a tokenizer. What follows is my understanding of how it works and what I learned from implementing it.

I always wondered how ChatGPT "reads" text. You type "hello world" and somehow it becomes numbers that a neural network can process. The magic happens in a tokenizer—a piece of code that converts text into tokens, which are really just integers.

So I built one. A simple implementation of Byte Pair Encoding (BPE), the algorithm that powers modern LLM tokenizers like GPT's tiktoken. No libraries, no shortcuts—just the raw algorithm working directly with bytes.

What I discovered surprised me: tokenization is just aggressive compression through pattern matching. And once you understand it, a lot of LLM behavior suddenly makes sense.

The Problem: Text Isn't Numbers

Neural networks work with numbers, not text. They need fixed-size vectors, mathematical operations, matrix multiplications. But text is... messy. Variable length, infinite possibilities, special characters, emojis, every language on Earth.

The naive approach is simple: assign each character a number. 'a' = 0, 'b' = 1, and so on. This works but has problems:

Too granular: The network has to learn that 'h', 'e', 'l', 'l', 'o' form a common pattern called "hello". That's inefficient.

Doesn't scale: What about "running"? Should that be separate from "run"? And "runner"? The network has to learn all these relationships from scratch.

Wastes context: LLMs have limited context windows. If every character is a token, "hello" uses 5 tokens. That's 5 slots in the context window for one simple word.

The better approach: find common patterns and assign them single tokens. "hello" becomes one token, "world" becomes another. Common words like "the" and "ing" suffix? Also single tokens.

This is what Byte Pair Encoding does—it automatically finds the most frequent patterns and merges them.

Byte Pair Encoding: The Algorithm

BPE is beautifully simple:

Start with bytes (every possible value 0-255 is a token)
Find the most frequent pair of adjacent tokens
Merge that pair into a new token
Repeat until you have the vocabulary size you want

That's it. No machine learning, no complex math. Just counting and merging.

Let me show you what this looks like in practice.

The Implementation: Step by Step

Starting with Bytes

First, the code converts text to bytes. In TypeScript, Buffer.from() handles this:

const str = fs.readFileSync(path.resolve(__dirname, 'data.txt'), 'utf-8');
const bytes = [...Buffer.from(str, 'utf-8')];

UTF-8 encoding means "hello" becomes:

[104, 101, 108, 108, 111]

These are our initial tokens. Every character is represented by its UTF-8 byte values. This gives us 256 possible tokens to start with (0-255).

Counting Pairs

Next, the algorithm needs to find which pairs of tokens appear most frequently:

function getPairStats(data: number[]) {
    const stats: Record<string, number | undefined> = {};

    for (let i = 0; i < data.length - 1; i++) {
        const num1 = data[i];
        const num2 = data[i + 1];
        const pair = `${num1}-${num2}`;
        
        stats[pair] = (stats[pair] || 0) + 1;
    };

    const finalValue: [number, [number, number]][] = [];

    for (const key in stats) {
        finalValue.push([
            stats[key] || 0,
            key.split('-').map((t) => parseInt(t, 10)) as [number, number]
        ]);
    };

    return finalValue.sort((a, b) => b[0] - a[0]);
}

This scans through the token array and counts every adjacent pair. For "hello":

104-101: 1 occurrence  (he)
101-108: 1 occurrence  (el)
108-108: 1 occurrence  (ll)
108-111: 1 occurrence  (lo)

Then it sorts by frequency. In a real corpus with lots of text, certain pairs will be much more common. Maybe "th" appears 5,000 times, "he" appears 3,000 times, and so on.

The return format is [count, [byte1, byte2]]—the count and the actual pair. Sorting by count descending gives the most frequent pair first.

Merging the Most Frequent Pair

Once the algorithm knows the most frequent pair, it needs to merge every occurrence of it:

function performTokenSwapping({
	tokens,
	mergePair,
	newTokenId,
}: {
	tokens: number[]
	mergePair: [number, number]
	newTokenId: number
}): number[] {
	let tokensToOperate = [...tokens];

	for (let i = 0; i < tokensToOperate.length - 1; i++) {
		const num1 = tokensToOperate[i];
		const num2 = tokensToOperate[i + 1];

		if (num1 === mergePair[0] && num2 === mergePair[1]) {
			// found the pair
			tokensToOperate[i] = newTokenId;
			tokensToOperate[i + 1] = null as never; // we'll remove it later
		};
	};

	tokensToOperate = tokensToOperate.filter((t) => t != null);

	return tokensToOperate;
}

This walks through the tokens, finds every occurrence of the pair, and replaces it with a new token ID. The second element of the pair gets set to null and filtered out.

For example, if tokens [108, 108] (representing "ll") appear 1,000 times in the corpus and that's the most frequent pair, the algorithm replaces every [108, 108] with a new token 256 (the first token ID after the 256 base bytes).

After one merge, "hello" might become:

[104, 101, 256, 111]  // where 256 represents "ll"

The array is shorter now—4 tokens instead of 5.

Repeating the Process

The training loop is straightforward:

const sizeOfVocab = 300;
const iterationsRequired = sizeOfVocab - 256;

let tokensToOperateOn = [...bytes];

const mergeDictOrdered: [`${number}-${number}`, number][] = [];

for (let i = 0; i < iterationsRequired; i++) {
    const sortedPairStats = getPairStats(tokensToOperateOn);

    const newTokenId = 256 + i;

    tokensToOperateOn = performTokenSwapping({
        tokens: tokensToOperateOn,
        mergePair: sortedPairStats[0]![1],
        newTokenId,
    });

    mergeDictOrdered.push([
        `${sortedPairStats[0]![1][0]}-${sortedPairStats[0]![1][1]}`,
        newTokenId,
    ]);
}

This builds a vocabulary of 300 tokens. Since it starts with 256 (the byte values), it needs 44 merges (300 - 256 = 44).

Each iteration:

Counts all pairs in the current token sequence
Gets the most frequent pair
Creates a new token ID (256, 257, 258, etc.)
Merges all occurrences of that pair
Records the merge in mergeDictOrdered

The mergeDictOrdered array is crucial—it's the "learned vocabulary" of the tokenizer. It stores which pairs got merged and in what order. This is what gets used to encode new text later.

The Result: Compression Through Pattern Recognition

After training on a small text file, the result might look like:

Original: 5,000 bytes
Final: 2,100 tokens

The tokenizer compressed the text to about 42% of its original size in terms of token count. Not by using fancy compression algorithms, but by finding and merging common patterns.

The mergeDictOrdered array might look like:

108-108 -> 256  // "ll"
101-32  -> 257  // "e "
116-104 -> 258  // "th"
...

Each entry represents a pattern the tokenizer learned from the training data.

Encoding: Applying the Learned Patterns

Once the tokenizer is trained, it can encode new text:

function encode(str: string) {
    let bytes = [...Buffer.from(str, 'utf-8')];

    for (const item of mergeDictOrdered) {
        const priorityKey = item[0];

        for (let i = 0; i < bytes.length - 1; i++) {
            const b1 = bytes[i];
            const b2 = bytes[i + 1];

            if (priorityKey === `${b1}-${b2}`) {
                // good to replace
                bytes[i] = item[1];
                bytes[i + 1] = null as never; // will be removed later

                // skip the next byte (its going to be null)
                i++;
            };
        };
    };

    bytes = bytes.filter((t) => t != null);

    return bytes;
}

The key insight: the merges are applied in the same order they were learned. First merge is applied first, second merge second, and so on.

Why does order matter? Because later merges depend on earlier ones. If the tokenizer learned to merge "ll" first and "hello" second, it needs to merge "ll" before looking for "hello".

Encoding "hello world!" with this approach:

Start: [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33]
Merge "ll" → "hello w[256]d!"
Merge "o " → "hell[257]world!"
Continue with all learned merges...
Final: Maybe [280, 257, 281, 33] (much shorter)

Decoding: The Reverse Process

Decoding is trickier. The algorithm needs to reverse the merges:

function decode(tokens: number[]) {
    const bytes = [...tokens];

    const reverseDictionary: Record<
        number,
        { n1: number; n2: number } | undefined
    > = {};

    for (const item of mergeDictOrdered) {
        const [n1, n2] = item[0].split('-').map((t) => parseInt(t, 10)) as [
            number,
            number
        ];
        reverseDictionary[item[1]] = { n1, n2 };
    }

    for (let i = 0; i < bytes.length; i++) {
        const lookup = reverseDictionary[bytes[i]!];
        if (lookup != null) {
            bytes[i] = lookup.n1;
            bytes.splice(i + 1, 0, lookup.n2);
            i--; // process this byte again because there could be layers and layers of encoding
        };
    };

    return Buffer.from(bytes).toString('utf-8');
}

The reverse dictionary maps token IDs back to their constituent pairs. Token 256 maps back to [108, 108] ("ll").

The trick is that each position needs to be processed multiple times (i--). A token might have been created from a merge of two other merged tokens. The algorithm needs to keep expanding until it gets back to raw bytes (0-255).

For example:

Token 280 might expand to [256, 111]
Token 256 then expands to [108, 108]
Final: [108, 108, 111] → "llo"

By working backward through all the merges, I reconstruct the original bytes.

Testing It: The Moment of Truth

Running the full pipeline:

console.log('Encoding', encode('hello world!'));
console.log('Decoding', decode(encode('hello world!')));

Output:

Encoding [ 280, 257, 281, 33 ]
Decoding hello world!

It works! I can encode text into tokens and decode it back to the original text. The round trip preserves the message perfectly.

What I Learned About Tokenization

Building this taught me several things that explain LLM behavior:

Why LLMs Struggle with Spelling

Tokenizers merge common patterns. This means "strawberry" might become tokens like ["straw", "berry"]. The LLM never sees the individual letters 'r', 'r' in sequence—it sees one token representing "berry".

When you ask "How many 'r's are in strawberry?", the LLM has to reconstruct the letter sequence from its tokens. It's not counting letters directly; it's remembering what letters make up each token. This is error-prone.

Why Common Words are Cheap

The word "the" is probably a single token in GPT's vocabulary—it appears so frequently that it got merged early in training. So "the" costs 1 token.

But "antidisestablishmentarianism" might be 4-5 tokens because it's rare. The tokenizer never learned that pattern.

This is why LLMs can process way more simple text than complex technical jargon in the same context window.

Why Code Sometimes Breaks

When you give an LLM code, the tokenizer might split it awkwardly. A variable name like user_auth_token might become ["user", "auth", "token"] or ["user", "auth", "token"] depending on what patterns the tokenizer learned.

If the tokenizer was mostly trained on natural language, it might not have learned optimal code patterns. This can confuse the LLM because the token boundaries don't align with semantic boundaries in the code.

Why Punctuation Matters

Notice how my tokenizer treats "hello world!" differently than "helloworld". The space is a byte (32), so "hello " and " world" are different patterns than "hello" and "world".

This means punctuation and spacing dramatically affect tokenization. "Let's go!" and "Lets go!" become completely different token sequences, even though they're semantically similar to humans.

Why Different Languages Get Different Treatment

English text was probably overrepresented in the training data for most tokenizers. So English patterns got learned first and most thoroughly.

A language like Thai or Arabic might not have as many learned patterns, meaning it takes more tokens to represent the same semantic content. This effectively gives English-language users more context window capacity.

Limitations of This Implementation

This tokenizer works, but it's not production-ready:

No BPE-dropout: Real tokenizers like GPT's add randomness during training to learn more robust patterns. Mine is deterministic.

Small vocabulary: 300 tokens is tiny. GPT-4 has 100k+ tokens. Larger vocabularies mean better compression and fewer out-of-vocabulary issues.

No special tokens: Production tokenizers have special tokens like <|endoftext|> for marking document boundaries. Mine doesn't.

Character-level fallback: What if I encounter a byte sequence I've never seen? My tokenizer would need more bytes than expected. Real tokenizers handle this gracefully.

No optimization: The implementation is O(n²) or worse. Real tokenizers use efficient data structures like tries or hash maps for fast lookup.

But for understanding how tokenization works? This is perfect.

Why Order Matters: A Non-Obvious Detail

One thing that might not be immediately obvious: why does mergeDictOrdered need to be ordered?

Consider this scenario. The training data has:

"ll" appears 1,000 times (gets merged first → token 256)
"hello" appears 500 times (gets merged second → token 257)

If "hello" were merged first, it would replace every occurrence with token 257. Then when trying to merge "ll", it wouldn't be found anymore—it's already part of token 257.

But if "ll" is merged first, the result is:

"he[256]o" (hello becomes he+token256+o)
Then later merges might combine "he[256]o" into a single token

The order of merges fundamentally changes what patterns get learned. That's why they're recorded in mergeDictOrdered and applied in the same sequence during encoding.

Real-World Tokenizers: What's Different?

How does this implementation compare to tiktoken (GPT's tokenizer)?

Much larger vocabulary: GPT-4 has ~100,000 tokens vs the 300 in this example. This allows better compression and representation of rare words.

Pre-trained on massive corpora: tiktoken was trained on hundreds of gigabytes of text. This example used a single small file.

Special handling for numbers: Real tokenizers often treat digits specially so "2024" doesn't become four separate tokens.

Regex pre-processing: Before BPE, tiktoken uses regex to split text intelligently (e.g., keeping whitespace attached to words). This improves pattern learning.

Better compression: GPT's tokenizer achieves ~4:1 compression (4 characters per token on average). This implementation is closer to 2:1 because of the small vocabulary.

But the core algorithm? Exactly the same. Count pairs, merge the most frequent, repeat.

The Elegant Simplicity

What strikes me most about BPE is how simple it is. There's no machine learning model, no neural network, no gradient descent. Just counting and merging.

Yet this simple algorithm is the foundation of every modern LLM. GPT-4, Claude, Llama—they all use some variant of BPE to convert text to tokens.

The complexity in LLMs isn't in tokenization. It's in what happens after tokenization—the transformer architecture, attention mechanisms, billions of parameters. But none of that works without first converting text to numbers, and BPE does that job elegantly.

Try It Yourself

Want to see tokenization in action? Here's what I recommend:

Clone this code and run it on different text files. See how the vocabulary changes based on training data.
Try encoding text that's very different from your training data. Watch how it uses more tokens because it hasn't learned those patterns.
Experiment with vocabulary size. Train a 500-token vocabulary vs a 1,000-token vocabulary. See the compression difference.
Look at the mergeDictOrdered output. You'll see which patterns your tokenizer learned. Common words, common suffixes, common prefixes—they all show up.

The code is simple enough to modify and experiment with. That's the whole point.

Final Thoughts

Implementing this tokenizer from Karpathy's tutorial demystified a piece of LLM architecture that's easy to take for granted. That moment when the encoder and decoder successfully did a round trip—turning "hello world!" into [280, 257, 281, 33] and back—shows how elegantly simple the core algorithm is.

Now when I see weird LLM behavior around spelling, or notice that it handles certain languages better than others, or watch it count tokens in API usage, I understand why. It's all rooted in this simple pattern-matching algorithm that runs before the neural network ever sees the text.

If you've ever wondered what "tokens" actually are or why OpenAI charges by them, I encourage you to follow along with the tutorial and build something like this. The concepts that seem complex in NLP papers become clear when you're debugging why the pair counter isn't sorting correctly.

The full code is shown above. You can also check out Karpathy's original tutorial for more details and additional context.