Deep Dive into LLMs

Chapter 2: Tokenization

written by Tamás Fodor

These articles are loosely based on Andrej Karpathy’s technical deep dive into Large Language Models. Karpathy is the former Director of AI at Tesla and an OpenAI founding member. His recent lecture provides one of the most comprehensive technical explanations of how systems like ChatGPT actually work—from raw internet text to sophisticated AI assistants. This 8-part series breaks down his technical content into digestible chapters.

Previous: Data Collection & Preprocessing • Next: Neural Networks • Training • Inference • Post-Training • Advanced Capabilities • Evaluation & Deployment

The Bridge Between Text and Numbers

Following the collection and preprocessing of massive text datasets, we encounter a fundamental challenge: neural networks process exclusively numerical data, while human communication consists of words, sentences and paragraphs. Tokenization serves as the critical transformation layer that converts human-readable text into numerical sequences suitable for neural network processing.

While this may appear as a straightforward technical requirement, tokenization decisions profoundly impact AI system capabilities. The method of text segmentation directly influences model performance with specific words, processing efficiency across languages and computational costs.

Converting text to numbers presents immediate design challenges. Should individual characters serve as tokens? Should complete words be tokens? How should punctuation, whitespace, and previously unseen words be handled? Each approach involves significant trade-offs. Character-level tokenization produces extremely long sequences, increasing training time and computational costs. Word-level tokenization requires vocabularies of hundreds of thousands of symbols and fails when encountering novel words outside the training vocabulary.

Subword Tokenization

Modern AI systems employ subword tokenization as an elegant solution to these challenges. Rather than using characters or complete words, the system learns to segment text into statistically optimal chunks that balance efficiency with flexibility.

Before text reaches the response-generating neural network, it encounters a specialized preprocessing system: the tokenizer. This system has been trained to segment text into optimal units based on statistical patterns observed in language data. The GPT-4 tokenizer provides a concrete example. The phrase “Hello world” becomes two tokens: “Hello” (token ID 15339) and ” world” (token ID 1917, including the leading space). However, “hello world” in lowercase produces different tokens. Adding extra spaces changes the tokenization again. The system exhibits high sensitivity to these details, having learned these patterns from billions of text examples.

Modern tokenization employs algorithms like Byte Pair Encoding (BPE) that learn efficient text segmentation strategies. The system begins with individual characters and progressively identifies frequently occurring character combinations that merit dedicated tokens. The tokenizer might identify that the suffix “ing” appears frequently enough to warrant its own token. Common prefixes like “un” or “re” similarly receive dedicated tokens. This approach handles both common words (which receive single tokens) and rare words (which decompose into recognizable subword components).

Numerical Representation and Vocabulary

After text segmentation, each token receives a unique numerical ID from a fixed vocabulary. Contemporary language models typically employ vocabularies containing 50,000 to 100,000 distinct tokens. This creates a mathematical representation of language. The sentence “The cat sat on the mat” might transform into [464, 2355, 7563, 319, 262, 2603]. While these numbers appear arbitrary, they represent fundamental units of meaning that the neural network learns to manipulate.

Vocabulary size represents a crucial engineering decision. Larger vocabularies enable more tokens to represent complete words or phrases, reducing sequence length and improving processing efficiency. However, larger vocabularies require more neural network parameters and increase training complexity. Smaller vocabularies necessitate more subword splitting, creating longer sequences but requiring fewer parameters. Modern systems have converged on 50,000-100,000 tokens as the optimal balance point.

This intelligent segmentation enables the system to process previously unseen words by decomposing them into familiar components. A technical neologism like “bioengineering” might split into “bio”, “engineer”, and “ing”—all components the system recognizes from other contexts. Every possible text—from classical literature to technical documentation—can be represented as sequences drawn from this fixed vocabulary.

Context Windows and Computational Constraints

Tokenization directly impacts one of the most significant limitations in current AI systems: context windows. Every language model has a maximum token capacity it can process simultaneously—its context window. GPT-4 handles approximately 8,000 tokens, while newer models support 100,000 or even one million tokens.

This limitation means tokenization efficiency directly impacts model capabilities. Inefficient tokenization that uses excessive tokens for simple concepts causes premature context window saturation, forcing the model to discard earlier conversation or document portions. This drives continued research into improved tokenization methods. More efficient tokenization enables models to process longer documents, maintain extended conversational contexts, and utilize computational resources more effectively.

The quality of tokenization directly impacts model performance in non-obvious ways. Suboptimal tokenization can impede concept learning, create representational inconsistencies, or reduce text processing efficiency. If technical terminology consistently fragments into many subword pieces, the model may struggle to develop robust representations for those concepts. Conversely, domain-specific token optimization can yield exceptional performance in specialized areas.

Language-Specific Challenges

Tokenization exhibits inherent language biases. Most current systems were optimized for English, manifesting in measurable performance differences. English words frequently map to single tokens, while equivalent concepts in other languages may require multiple subword pieces.

This creates processing disparities: models process English text with fewer tokens, improving speed and preserving context window capacity. Non-English text requires more tokens for equivalent semantic content, potentially degrading performance. This technical limitation reflects broader challenges in developing globally equitable AI systems.

Languages with complex morphology require different tokenization strategies than English. Programming languages, mathematical notation and specialized terminology each present unique tokenization challenges. Organizations developing specialized AI systems often invest substantially in custom tokenizers optimized for their specific use cases, rather than adopting general-purpose tokenization designed for broad internet text.

Foundation for Neural Processing

Understanding tokenization is essential because it affects every subsequent processing stage. The neural network never encounters raw text—only numerical tokens. Training learns patterns between tokens. Generation produces token sequences that convert back to human-readable text.

This means tokenization limitations or biases propagate throughout the entire system. If the tokenizer struggles with certain text types, the final AI system will likely exhibit similar difficulties. Efficient tokenization for specific concepts typically correlates with strong model performance in those areas.

Research groups are developing more language-agnostic tokenization methods, but this remains an active area with significant implications for global AI accessibility. The transformation from human language to machine-processable format through tokenization represents a critical bottleneck that shapes the capabilities and limitations of modern AI systems.

Looking Ahead

With text successfully converted to numerical sequences for neural network processing, we can examine the neural networks themselves. The next chapter explores the transformer architecture—the sophisticated mathematical system that learns to predict subsequent tokens in sequences. We’ll examine how billions of parameters collaborate to capture patterns within token sequences, and how training adjusts these parameters to align network predictions with observed human text patterns.

Previous: Data Collection & Preprocessing • Next: Neural Networks • Training • Inference • Post-Training • Advanced Capabilities • Evaluation & Deployment

Leave a Reply

Your email address will not be published. Required fields are marked *