Deep Dive into LLMs

Chapter 1: Data Collection & Preprocessing

written by Tamás Fodor

These articles are loosely based on Andrej Karpathy’s technical deep dive into Large Language Models. Karpathy is the former Director of AI at Tesla and an OpenAI founding member. His recent lecture provides one of the most comprehensive technical explanations of how systems like ChatGPT actually work—from raw internet text to sophisticated AI assistants. This 8-part series breaks down his technical content into digestible chapters.

Next articles: Tokenization • Neural Networks • Training • Inference • Post-Training • Advanced Capabilities • Evaluation & Deployment

The Foundation of AI

When you chat with ChatGPT, you’re interacting with a system built on something surprisingly mundane: web scraping. Every large language model starts with an enormous collection of text data scraped from the internet. The foundation isn’t elegant algorithms—it’s billions of text fragments from websites, systematically collected and processed into training data.

Companies building LLMs deploy web crawlers that systematically visit millions of websites, extracting text from news sites, educational content, forums, and technical documentation. The scale is massive: modern datasets like FineWeb contain 44 terabytes of text—approximately 15 trillion tokens. To put this in perspective: reading one token per second, it would take nearly half a million years to read through a single training dataset.

This isn’t a random collection. The crawling targets high-quality sources across the spectrum of human knowledge and communication. Major news publications, educational institutions, technical documentation sites, well-moderated forums, and reference materials. The goal is capturing not just the breadth of human knowledge, but the patterns of how we communicate, argue, explain, and reason through text.

From Web Pages to Clean Text

The real technical challenge begins once millions of web pages have been downloaded: extracting meaningful content from the digital mess that is modern web design. Every web page is a complex mixture of actual content surrounded by navigation menus, advertisements, user comments, tracking scripts, and countless other elements.

Consider a typical news article. When you read it, you naturally focus on the headline and article text while ignoring everything else. But the raw HTML source contains dozens of other elements: sidebar advertisements, social media sharing buttons, related article suggestions, cookie consent banners, and navigation menus. The preprocessing pipeline must act like an intelligent human reader, identifying and extracting only the valuable content while discarding all the digital noise.

This extraction process requires sophisticated algorithms that can work across thousands of different website layouts and content management systems. Each major website has its own unique structure, and the system must be flexible enough to correctly identify the main content whether it’s coming from a WordPress blog, a major news site, or a technical documentation platform.

After successful text extraction, something fascinating happens: all this diverse content gets concatenated into massive, continuous streams of text. Imagine taking a Wikipedia article about quantum physics, immediately followed by a Reddit discussion about cooking techniques, followed by a news article about climate change, followed by a technical programming tutorial. This creates what Karpathy calls a “massive tapestry of text data”—a continuous stream that captures the incredible diversity of human communication patterns.

The Numerical Challenge

Neural networks work exclusively with numbers—they have no inherent understanding of letters, words, or sentences. Every piece of human text must be converted into numerical representations while somehow preserving the meaning, context, and relationships that make language meaningful.

The most straightforward approach would be to use UTF-8 encoding, the standard way computers represent text. In this system, every character gets converted to a specific sequence of bits. The letter ‘A’ becomes the 8-bit sequence 01000001, while ‘B’ becomes 01000010, and so on. But this seemingly simple approach creates a massive computational problem. UTF-8 encoding results in extremely long sequences of individual bits, and training a neural network on sequences where each “symbol” is just a single bit would be computationally prohibitive.

This leads to one of the most fundamental trade-offs in language model design: the balance between vocabulary size and sequence length. Character-level representation uses a small vocabulary of around 100-200 symbols but results in very long sequences. Word-level representation treats entire words as single symbols, creating vocabularies of 50,000-100,000 different words, but creates problems with rare words and variations. Subword representation splits the difference by using pieces of words as symbols—the modern approach that provides optimal balance.

Quality Control and Filtering

Not every piece of text on the internet is suitable for training sophisticated AI systems. The preprocessing pipeline includes extensive filtering to remove content that would degrade model performance or introduce unwanted behaviors. This filtering process removes obvious spam and automatically generated content, but it goes much deeper.

Duplicate content must be identified and removed—the same news article might appear on dozens of different websites, and training on identical content multiple times can cause the model to memorize rather than learn patterns. Personal information, contact details, and sensitive content must be stripped out to protect privacy. Perhaps most importantly, the filtering process must make quality judgments about text that might be technically correct but poorly written, biased, or misleading.

The quality of the training data directly determines the ceiling of what the AI system can achieve. A model can only be as knowledgeable, nuanced, and capable as the text it was trained on. When ChatGPT demonstrates understanding of quantum physics, it draws on physics textbooks, research papers, and educational materials that were included in its training data. When it writes sophisticated code, it’s leveraging patterns learned from millions of open-source repositories and technical documentation.

This is why understanding the data collection process is crucial for understanding AI capabilities and limitations. The training data defines not just what the model knows, but how it thinks, what perspectives it can represent, and what blind spots it might have. The preprocessing decisions—what content to include, how to clean it, and how to represent it—fundamentally shape the AI system’s capabilities.

The Engineering Infrastructure

What makes this entire process remarkable is that it happens largely invisibly, creating the foundation for AI systems that billions of people now interact with daily. Behind every ChatGPT conversation lies this massive infrastructure of data collection, cleaning, and preprocessing. The engineering challenges involved are immense: building crawling systems that can process millions of web pages, developing parsing algorithms that work across countless different website formats, creating filtering systems that can make nuanced quality judgments, and organizing terabytes of text data into formats suitable for neural network training.

The scale creates emergent properties that weren’t explicitly programmed. The patterns hidden in 15 trillion tokens of human text, when processed by neural networks, give rise to sophisticated reasoning and communication abilities. Understanding data collection gives us the foundation to appreciate how the collective knowledge of humanity transforms into intelligent systems.

Looking Ahead

Data collection and preprocessing create the raw material for AI intelligence, but they’re just the first step in a complex pipeline. The next crucial stage is tokenization—the process of converting this carefully processed text into the specific numerical sequences that neural networks can actually work with. We’ll see exactly how the simple phrase “Hello world” gets transformed into numerical tokens, and why these seemingly technical choices have profound implications for how AI systems understand and generate language.

Next articles: Tokenization • Neural Networks • Training • Inference • Post-Training • Advanced Capabilities • Evaluation & Deployment