Deep Dive into LLMs

Chapter 3: Neural Network Architecture

written by Tamás Fodor

These articles are loosely based on Andrej Karpathy’s technical deep dive into Large Language Models. Karpathy is the former Director of AI at Tesla and an OpenAI founding member. His recent lecture provides one of the most comprehensive technical explanations of how systems like ChatGPT actually work—from raw internet text to sophisticated AI assistants. This 8-part series breaks down his technical content into digestible chapters.

Previous: Data Collection • Tokenization • Next: Training Process • Inference • Post-Training • Advanced Capabilities • Evaluation & Deployment

The Transformer Architecture

Following the conversion of text to numerical token sequences, we arrive at the core computational engine of large language models: the Transformer neural network. This architecture enables parallel processing of sequences, unlike earlier models that processed tokens sequentially one after another.

The Transformer’s design allows training of models with billions of parameters. Major language models including GPT, BERT, and T5 are built on this architecture. The key advantage of Transformers over previous architectures is their ability to process all tokens in a sequence simultaneously, dramatically reducing training time and enabling the development of much larger models.

Scale and Parameters

Modern language models contain billions of parameters—individual numerical weights adjusted during training. GPT-2, released in 2019, contained 1.6 billion parameters. Current models have grown substantially larger, with some containing hundreds of billions of parameters or more. These parameters organize into structured layers that process token sequences and generate probability distributions over possible next tokens.

Each parameter contributes to the model’s predictions through mathematical operations including multiplication, addition, and non-linear transformations. The scale represents a fundamental shift in computational approach: rather than encoding explicit rules, the system learns patterns from data through massive parallel optimization. During training, these parameters are adjusted to minimize prediction errors across billions of text examples.

Token Representations and Embeddings

The neural network converts input token IDs into vectors through an embedding layer. Each token in the vocabulary maps to a unique vector of numbers—a distributed representation. The vocabulary typically contains tens of thousands of unique tokens, each associated with its own learned vector representation.

These embeddings start with random initialization but evolve during training. The training process adjusts these vectors so that tokens appearing in similar contexts develop similar representations. This allows the model to generalize beyond exact sequences seen in training data. For instance, tokens representing similar concepts or appearing in similar grammatical positions will naturally develop similar vector representations through the training process.

The embedding vectors serve as the initial representation that flows through the network’s layers. As these representations pass through successive layers, they become increasingly abstract and context-dependent, capturing not just the identity of individual tokens but their relationships within the sequence.

Attention Mechanisms

The Transformer’s key innovation is the attention mechanism, which allows the network to look at all previous tokens when predicting the next one. This happens through the attention blocks that form the core of the Transformer architecture. The attention mechanism computes relationships between token positions, determining which previous tokens are most relevant for predicting the next token.

The computation involves three learned transformations of the input: queries, keys, and values. These transformations allow the network to compute compatibility scores between different positions in the sequence. The resulting attention weights determine how much each position influences the representation at every other position. This mechanism scales quadratically with sequence length—doubling the sequence length quadruples the computational cost of attention.

Multi-head attention extends this concept by running multiple attention operations in parallel, each potentially focusing on different types of relationships: syntactic dependencies, semantic associations, or other linguistic patterns. The outputs of these parallel attention heads are combined to form the layer’s final output.

Layer Architecture and Depth

Transformers stack multiple layers of attention and feed-forward operations. While simple demonstration models might have just a few layers, modern state-of-the-art networks typically have around 100 layers or more. GPT-2 used 48 layers in its largest configuration. GPT-3’s largest model uses 96 layers.

Information flows through these layers sequentially, with each layer processing the output of the previous one. The representations become progressively more abstract as they move through deeper layers. Early layers tend to capture surface-level patterns like word co-occurrences and basic syntax. Middle layers develop understanding of grammatical structures and semantic relationships. Deeper layers encode more complex patterns and relationships.

Between attention operations, the network includes multi-layer perceptron blocks or feed-forward networks. These provide additional computational transformations between attention layers. The feed-forward networks typically expand the representation to a larger dimension, apply a non-linear activation function, and then project back to the original dimension. These operations account for a significant portion of the model’s parameters and computational cost.

Context Windows and Memory

Every Transformer has a maximum context length—the number of tokens it can process at once. GPT-2 had a context length of 1,024 tokens. Modern models have much larger context windows, with some handling hundreds of thousands of tokens. This context window fundamentally limits what the model can “see” when making predictions. Tokens beyond this window cannot influence the current prediction.

Within the context window, the attention mechanism allows any token to influence any other token’s representation. This creates a form of working memory where all information within the context is simultaneously available. However, information from beyond the context window must be encoded in the model’s parameters during training—there’s no way to access it during inference.

The context limitation drives various architectural innovations. Some approaches use hierarchical attention patterns to extend effective context length. Others employ recurrence or compression mechanisms to summarize information from earlier in the sequence.

Computational Flow and Constraints

The entire network is essentially a massive mathematical expression. Tokens enter at the top, flow through all the layers of computation, and produce probabilities for the next token at the output. This expression involves billions of simple operations—multiplications, additions, and non-linear functions—organized in a precise structure.

Importantly, there’s a fixed amount of computation per token. The model cannot do arbitrary amounts of computation in a single forward pass—there are only so many layers, and thus a bounded amount of processing that happens for each prediction. This constraint means the model must learn to perform complex reasoning within a fixed computational budget. Each layer can only perform a limited transformation of its input, and the total number of layers bounds the complexity of the overall computation.

This fixed computation has important implications. The model cannot “think harder” about difficult problems by using more computation. It must learn to allocate its fixed computational resources efficiently across different types of problems during training.

Parameters as Distributed Memory

The billions of parameters in these networks serve as the model’s knowledge storage. During training, information from the training data gets encoded into these parameter values. Unlike a database with explicit facts stored in specific locations, this knowledge is distributed across all the parameters in the network. A single fact might influence millions of parameters in subtle ways, and each parameter contributes to encoding many different pieces of information.

This distributed representation has both advantages and limitations. It allows the model to generalize and blend information in flexible ways, finding patterns and connections that weren’t explicitly present in the training data. However, it also makes it impossible to precisely edit or remove specific knowledge without affecting other capabilities.

Looking Ahead

The Transformer architecture provides the computational framework, but the model’s capabilities emerge through the training process. The architecture defines what computations are possible, but training determines which specific computations the model learns to perform. The next chapter examines how these networks learn: how the training process adjusts billions of parameters to capture patterns in text data, and how simple prediction objectives lead to complex emergent capabilities.

Previous: Data Collection • Tokenization • Next: Training Process • Inference • Post-Training • Advanced Capabilities • Evaluation & Deployment