How LLMs Actually Work: Transformers Explained

Summary: Most explanations of how LLMs work stop at “it’s a neural network trained on lots of text.” That’s not wrong, but it’s not useful either. This post goes layer by layer — from how your words get chopped into tokens, through the geometry of meaning, through the actual mechanism of attention, to the training objective that surprisingly scales into reasoning-like behavior. If you’ve used an LLM professionally and want to understand what the machine is actually doing, this is the honest picture.

There is a specific kind of frustration that comes from reading ten different explanations of how large language models work and still not being able to answer the question: but what is it actually computing?

The marketing layer tells you LLMs are “trained on the internet” and can “understand context.” The hyperbolic layer tells you they are either proto-sentient or elaborate autocomplete. Neither is helpful if you are trying to make a real decision — about whether to trust the output, about why it fails the way it fails, about what the architectural limits actually are.

What follows is the mechanical picture. Not a graduate course, but not a lie either.

Step 1: Your Text Doesn’t Enter as Words

When you type a message to an LLM, the first thing that happens has nothing to do with neural networks. Your text gets sliced into tokens — subword units that are the actual currency of the system.

Most modern LLMs use a variant of Byte Pair Encoding (BPE), a compression algorithm adapted for NLP by Sennrich, Haddow, and Birch [1]. The algorithm works by starting with individual characters and iteratively merging the most frequently co-occurring pairs until it has a fixed vocabulary — typically 50,000 to 100,000 tokens for large models.

The result is a vocabulary where common words get their own token (cat, the, running), but rarer or longer words get split into pieces. The word “strawberry” in GPT-4’s tokenizer splits into straw and berry. “Tokenization” itself splits into token + ization. A word like “uncharacteristically” might become four or five tokens.

Analogy: Think of it like a library card catalog that assigns every book a unique ID number. Instead of looking up “strawberry,” the system looks up catalog entry 14823 (“straw”) and catalog entry 6271 (“berry”) — two separate lookups, two separate processing steps. The catalog was designed to save space, not to map cleanly onto human word boundaries.

The practical consequences of this are not trivial. The model never sees “strawberry” as a single unit. It sees a two-step sequence, and any pattern the model has learned about the whole word has to be reconstructed from those pieces. This is one reason LLMs are surprisingly bad at character-level tasks (like counting the letters in a word): the model’s fundamental unit is not letters or words, it is token IDs.

Once tokenized, each token is mapped to an integer. For a sentence of 20 words, you might have 25–30 integer IDs being passed into the actual model.

Step 2: Integers Become Geometry

A list of integer IDs is not useful to a neural network on its own — you cannot do calculus on token number 14823 in a way that encodes that it is semantically similar to token number 6271. So the next step is embedding: each integer is mapped to a high-dimensional vector, typically 768 to 12,288 numbers depending on the model size.

Think of this as placing each token at a specific coordinate in a very high-dimensional space. Unlike GPS coordinates (which have two dimensions. I’m simplifying here to avoid adding axis Z), embedding vectors might have 4,096 dimensions. But the same geometric logic applies: tokens that are semantically related end up close together in that space, and the distances and directions encode meaningful relationships.

The groundwork for this idea was laid by Mikolov, Sutskever, Chen, Corrado, and Dean at Google in 2013 with word2vec [2]. Their Skip-gram model demonstrated that you could train vectors purely by predicting context words, and the geometry that emerged was surprisingly meaningful. The classic example: the vector for “king” minus “man” plus “woman” lands close to “queen.” Direction in the space encodes conceptual relationships.

Word2vec was a revelation, but it had a fundamental limitation: each word got exactly one vector, regardless of context. The word “bank” — financial institution or riverbank — got a single point in space, a crude average of its meanings. That static representation was too imprecise for language understanding at scale.

The Transformer architecture, and specifically the self-attention mechanism, solved this. In a Transformer, the embedding of a token is not fixed — it evolves through the network as a function of all the other tokens in the sequence. By the time “bank” has passed through several layers, its vector has been pulled toward the financial-institution cluster if it appeared near “loan” and “interest rate,” or toward the geographic cluster if it appeared near “river” and “flood.” The embedding becomes contextual.

Step 3: The Transformer — What “Attention Is All You Need” Actually Changed

The 2017 paper by Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin [3] is one of the most cited papers in the history of machine learning. Its title, “Attention Is All You Need,” was a pointed claim: you do not need recurrence, and you do not need convolutions. You need attention.

To understand why this was a departure, you need to know what came before.

Recurrent neural networks (RNNs) and their more sophisticated descendants, Long Short-Term Memory networks (LSTMs), processed sequences one token at a time, left to right. Each step compressed everything that came before into a fixed-size hidden state and passed it forward. This worked, but it had two problems. First, the hidden state was a bottleneck — a 512-dimensional vector trying to encode the entire context of a paragraph. Second, processing was sequential, which made it slow to train on modern parallel hardware.

Self-attention replaced the sequential bottleneck with a different mechanism: every token looks at every other token simultaneously and decides how much to weight each one.

The Q/K/V Mechanism

The way this works is through three learned linear projections of each token’s embedding: Query (Q), Key (K), and Value (V).

Analogy first: Imagine a reference librarian (the Query) looking through a filing system of index cards (the Keys). Each card describes the topic of a book. The librarian computes a relevance score for each card by comparing her query to the card’s description. High-scoring cards get retrieved; the actual books behind those cards (the Values) are pulled out in proportion to those scores. The librarian’s final answer is a weighted combination of the retrieved book contents.

The actual mechanism: for a sequence of tokens, each token’s Query vector is compared against every other token’s Key vector using a dot product, scaled by the square root of the vector dimension (a normalization trick to prevent very large dot products from collapsing the gradients). These raw scores are passed through a softmax function to produce a probability distribution — the attention weights. The output for each token is then computed as a weighted sum of all Value vectors, where the weights are the attention scores.

Expressed more directly: the output for token i is:

Attention(Q, K, V) = softmax( Q · Kᵀ / √d_k ) · V

where d_k is the dimension of the key vectors.

This is computed in parallel for all tokens simultaneously, which makes it dramatically faster to train than sequential RNNs. And because each token can attend to any other token with no distance penalty, long-range dependencies — “the subject of this sentence from three clauses ago” — are learned as easily as short-range ones.

Multi-head attention runs this process in parallel across multiple sets of Q/K/V projections. Each “head” can learn to attend to a different type of relationship — one head might capture syntactic dependencies, another semantic similarity, another positional proximity. The outputs of all heads are concatenated and projected back down to the original dimension.

This was the core architectural unlock. The Transformer replaced a sequential bottleneck with a parallel, dynamic, context-aware reweighting of the entire input at every layer.

Step 4: Going Deep — Layers, Residuals, and Layer Norm

A single attention layer is not a language model. The actual architecture stacks many such layers — GPT-3 uses 96 layers [4]. What does depth buy you?

The practical answer is compositional abstraction. Early layers tend to encode surface-level patterns: whether tokens are nouns or verbs, whether they are capitalized, their positions relative to punctuation. Middle layers encode phrase-level relationships: subject-verb agreement, pronoun reference, named entity boundaries. Deep layers encode abstract semantics: that this paragraph is making a causal argument, that this sentence is a concession to a counterpoint.

Analogy: Think of the layers as geological strata. A geologist reading a rock core from top (shallow, recent) to bottom (deep, ancient) sees progressively more fundamental structures. Transformer layers stack similarly — the deeper you go, the more abstract and compressed the representation.

Two architectural features make this depth practical:

Residual connections (borrowed from the computer vision literature, notably He et al.’s ResNet work [5]) add the input of each sub-layer directly to its output. In formula form: output = F(x) + x. This means the gradient signal can flow backward through the network without vanishing across dozens of layers — the residual path gives gradients a direct highway. It also means each layer needs to learn only an incremental correction on top of what the previous layers already computed, rather than learning the full representation from scratch.

Layer normalization [6], introduced by Ba, Kiros, and Hinton, normalizes the activations within each layer to have zero mean and unit variance before the next computation. Without this, the magnitude of activations can explode or collapse unpredictably as they pass through many layers. LayerNorm keeps training stable.

Together, residuals and LayerNorm are the engineering scaffolding that makes 96-layer networks trainable in the first place.

Step 5: The Training Objective That Punches Above Its Weight

Here is the most surprising fact in all of modern AI: these models are trained on a single task.

The entire pre-training objective is next-token prediction. Given a sequence of tokens, predict the probability distribution over what comes next. That’s it. The model sees a token sequence, makes a prediction, the prediction is compared to the actual next token, and the parameters are nudged via gradient descent. Repeat this across hundreds of billions of tokens of text.

Analogy: A child learns language partly by completing sentences: “The cat sat on the ___." A language model does exactly this — at 500 billion repetitions, across essentially every text pattern in written human language.

What makes this strange is that the task seems too simple to produce what it produces. Predicting the next token in a chemistry paper requires understanding chemistry. Predicting the next token in a legal brief requires understanding legal argumentation. Predicting the next token in a Python code block requires understanding the syntax and semantics of the language. The model is forced, by the sheer diversity of its training corpus, to build internal representations capable of reasoning across all of these domains.

Radford et al.’s GPT-2 paper (2019) [7] was the first to make this point compellingly at scale — a model trained purely on next-token prediction, with no task-specific fine-tuning, could generate coherent text, answer reading comprehension questions, and translate between languages it was never explicitly taught to translate. The GPT-3 paper by Brown et al. (2020) [4] scaled this to 175 billion parameters and showed that few-shot learning — the ability to perform new tasks given only a handful of examples in the prompt — emerged as a function of scale.

The Kaplan et al. scaling laws paper (2020) [8] gave this phenomenon a quantitative backbone. The researchers found clean power-law relationships between model size, training data volume, compute budget, and performance. Bigger models trained on more data, with more compute, perform better in smooth, predictable ways — no mysterious phase transitions, just a curve. The follow-up Chinchilla paper (Hoffmann et al., 2022) [9] refined this: prior large models were significantly undertrained relative to their size. The compute-optimal recipe calls for scaling model size and training tokens in roughly equal proportion. Chinchilla, at 70 billion parameters trained on 1.4 trillion tokens, outperformed GPT-3 despite being 2.5x smaller.

Step 6: From Pre-trained Model to Useful Assistant

A model trained purely on next-token prediction is technically capable but practically difficult to work with. It will continue whatever pattern it detects in the prompt — including unhelpful, harmful, or off-topic patterns. The raw pre-trained model is not ChatGPT.

The step that closes this gap is post-training, now typically a combination of supervised fine-tuning and Reinforcement Learning from Human Feedback (RLHF).

The process, documented in the InstructGPT paper by Ouyang, Wu, Jiang, and colleagues at OpenAI (2022) [10], works in three stages:

1. Supervised Fine-Tuning (SFT): Show the model examples of good instruction-following behavior — human-written responses to user prompts — and fine-tune on those. This teaches the model the format of helpful responses.

2. Reward Model Training: Collect human preference data — present two model outputs to a human rater, ask which is better — and train a separate neural network (the reward model) to predict human preference scores.

3. Policy Optimization with PPO: Fine-tune the language model using proximal policy optimization, treating the reward model’s scores as the reward signal. The language model learns to generate outputs that the reward model rates highly.

The InstructGPT paper put it directly: “Making language models bigger does not inherently make them better at following a user’s intent.” RLHF is what aligns a raw statistical text predictor with the notion of a helpful response. Instruction-tuned models like BERT [11] — which was fine-tuned on specific downstream tasks using labeled data — represent an earlier version of this intuition: pre-train generally, then specialize.

What This Architecture Is Not

Having described what the Transformer does, it is worth being equally direct about what it does not do — because the architecture’s limits are often mischaracterized as bugs that will be patched, when they are structural properties of how the system was built.

There is no world model baked in. The model learned statistical patterns over text. It did not learn a causal model of the world. It knows that “antibiotics treat bacterial infections” because that phrase appears with high frequency in its training corpus, not because it has a model of bacteria, immune systems, or molecular biology. For many practical purposes this distinction does not matter. For some purposes — like novel scientific reasoning or tasks that require reliable causal inference in domains with sparse training data — it matters enormously.

Hallucinations are not a bug to be patched; they are a structural consequence. A next-token prediction model is optimized to produce plausible continuations. When it encounters a question whose true answer is absent from its training distribution, it produces a plausible-sounding continuation regardless — because that is what it was trained to do. The model does not have a “I don’t know” state that activates when its confidence is low. Mitigation strategies (RLHF, retrieval augmentation, chain-of-thought prompting) reduce the frequency of confident wrong answers but do not eliminate the underlying cause.

Context windows are a real constraint, not an arbitrary setting. The self-attention mechanism has quadratic memory complexity in sequence length — processing a sequence of length N requires O(N²) memory. This is why context windows exist. Extending them requires architectural changes, hardware upgrades, or approximation methods, not just turning a dial. Recent research on “context rot” has also shown that performance can degrade with very long contexts even when they are technically within the window limit — the model’s effective attention is not uniform across all positions.

The model does not “read” your prompt; it processes a fixed-size token sequence. There is no distinction between “the system prompt” and “user input” at the architectural level — both are just token sequences concatenated together. The separation exists at the API layer, not in the model.

Why This Architecture Won

It is worth pausing to appreciate how clean this is. A relatively simple objective — predict the next token — applied to a relatively elegant architecture — stacked self-attention with residuals — trained at sufficient scale produces systems that write code, summarize legal documents, draft emails, explain concepts, and hold conversations. The “emergent capabilities” that appear at large scale are not magic; they are the consequence of a training objective that rewards the model for learning whatever internal representations are necessary to predict text well across all of human writing.

BERT [11] showed that the same architecture, pointed in a different direction (predicting masked tokens bidirectionally rather than the next token autoregressively), learned powerful representations for understanding tasks — classification, question answering, named entity recognition. The Transformer proved to be an architectural general-purpose tool, not a narrow solution.

The scaling laws tell us the system has not hit a wall. Performance continues to improve with data and compute in predictable ways. That does not mean the current architecture is the final form of AI — it almost certainly is not. But it means the next ten years of progress are more likely to come from scaling, from better data curation, from improved post-training methods, and from architectural refinements than from a wholesale departure from the Transformer.

Understanding these gears — tokens, embeddings, attention, stacking, next-token prediction, RLHF — is not a prerequisite for using LLMs effectively. But it is a prerequisite for using them honestly: knowing why they fail the way they fail, what claims about them are credible, and where the actual frontier of improvement lies.

References

[1] Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. arXiv:1508.07909

[2] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. arXiv:1310.4546

[3] Vaswani, A. et al. (2017). Attention Is All You Need. arXiv:1706.03762

[4] Brown, T. B. et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165

[5] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. arXiv:1512.03385

[6] Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450

[7] Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Blog

[8] Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361

[9] Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models. arXiv:2203.15556

[10] Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155

[11] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805

The Honest Guts of a Language Model: Transformers Explained Without the Fluff