The Math Behind Large Language Models

Summary: You have heard that transformers use attention, embeddings, and softmax. This post shows you the actual numbers — the matrix multiplications, the exponentials, the normalizations — in a toy model small enough to verify on paper. By the end, you will understand every calculation a transformer performs between raw text and a predicted next token.

Every explanation of large language models eventually reaches the same moment: "...and then the attention mechanism weighs the tokens." The phrase gestures at something real, but the gesture is not the mathematics. The mathematics is specific. It involves concrete matrix multiplications, scaling factors, exponential functions, and normalization steps whose exact form determines everything about what the model can and cannot do.

This post does not gesture. It computes.

We will walk through the complete transformer pipeline — from text to token IDs to embedding vectors to attention scores to output probabilities — using a toy model with four-dimensional embeddings and a vocabulary of five tokens. Every number in every worked example is original and chosen so the arithmetic stays tractable. You will not need a computer. A calculator and patience are sufficient.

One important note before we begin: the numerical examples here use dimensions far smaller than any real model. GPT-3 has an embedding dimension of 12,288 and 175 billion parameters [Brown et al., 2020]. Our toy model has an embedding dimension of 4 and roughly 200 parameters. The operations are identical; only the scale differs. That is the point — understanding the operation in miniature is the prerequisite for understanding it at scale.

1. Tokenization and Embeddings: How Text Becomes Vectors

What a token is

A language model never reads text directly. It reads integers. Before any computation begins, the input string is split into discrete units called tokens, and each token is mapped to a unique integer ID via a lookup table called the vocabulary.

Tokenization in modern models typically uses Byte-Pair Encoding (BPE), first applied to neural machine translation by Sennrich et al. [2016] and now standard across GPT-family and many other architectures. BPE begins with individual characters and iteratively merges the most frequent adjacent pair into a new token, until the vocabulary reaches a target size (50,000–100,000 is common). Common words become single tokens; rare words decompose into subword fragments.

For our purposes, we will use a minimal vocabulary of exactly five tokens:

Token vocabulary table:

"the" → ID 0 | "cat" → ID 1 | "sat" → ID 2 | "on" → ID 3 | "mat" → ID 4

The input sentence "the cat sat" produces the token ID sequence: [0, 1, 2].

The embedding matrix

Each token ID needs to be converted into a vector of real numbers so that mathematical operations can be applied to it. This conversion is handled by an embedding matrix E ∈ ℝ^(|V| × d), where |V| is the vocabulary size and d is the embedding dimension.

In our toy model, |V| = 5 and d = 4, so E ∈ ℝ^(5 × 4).

Each row of E is the embedding vector for the corresponding token ID. These vectors are learned during training — they are initialized randomly and then updated by gradient descent so that tokens that appear in similar contexts converge to similar vectors.

Toy embedding matrix E (5×4):

[ 0.2, 0.4, -0.1, 0.3 ] ← "the" (id=0) [ 0.5, -0.2, 0.6, 0.1 ] ← "cat" (id=1) [-0.3, 0.7, 0.2, -0.4 ] ← "sat" (id=2) [ 0.1, 0.3, -0.5, 0.8 ] ← "on" (id=3) [ 0.6, -0.1, 0.4, 0.2 ] ← "mat" (id=4)

To embed token "the" (id=0), we simply look up row 0: e₀ = [0.2, 0.4, -0.1, 0.3]

To embed "cat" (id=1): e₁ = [0.5, -0.2, 0.6, 0.1]

To embed "sat" (id=2): e₂ = [-0.3, 0.7, 0.2, -0.4]

So the input sequence "the cat sat" becomes the matrix:

X (input matrix, 3×4): [ 0.2, 0.4, -0.1, 0.3 ] ← row 0 ("the") [ 0.5, -0.2, 0.6, 0.1 ] ← row 1 ("cat") [-0.3, 0.7, 0.2, -0.4 ] ← row 2 ("sat")

where each row X_i is the embedding of the i-th token. This matrix X ∈ ℝ^(3 × 4) is the input to the first transformer layer.

Positional encodings

Embeddings encode identity but not order. The word "cat" has the same embedding whether it appears first, second, or tenth in the sentence. Yet position matters: "cat sat" and "sat cat" are different. To inject positional information, the original transformer [Vaswani et al., 2017] adds sinusoidal positional encodings to each embedding before any attention computation:

PE(pos, 2i) = sin( pos / 10000^(2i/d) ) PE(pos, 2i+1) = cos( pos / 10000^(2i/d) ) where pos = position index, i = dimension index

For our worked examples we will operate directly on the embedding matrix X without adding positional encodings, to keep the arithmetic clean. In a real model, you would add the PE matrix element-wise to X before passing it to the first layer.

2. Self-Attention: The Heart of the Transformer

Self-attention is the mechanism by which a transformer allows every token in a sequence to "look at" every other token and decide how much to borrow from each. This is what makes transformers fundamentally different from recurrent networks, which process tokens one at a time: in a transformer, all positions interact simultaneously [Vaswani et al., 2017].

The Q, K, V projections

Self-attention operates on three derived representations of the input: Queries (Q), Keys (K), and Values (V). These are produced by applying three separate learned linear projection matrices to the input:

Q = X·W_Q K = X·W_K V = X·W_V where W_Q, W_K, W_V ∈ ℝ^(d × d_k) are learned weight matrices, and d_k is the dimensionality of the query/key space. For our toy model: d = 4 (embedding dim), d_k = 2 (key/query dim), single attention head.

Weight matrices (4×2 each): W_Q: [ 1.0, 0.0 ] [ 0.0, 1.0 ] [-0.5, 0.2 ] [ 0.3, -0.1 ] W_K: [ 0.5, 0.2 ] [-0.3, 0.8 ] [ 0.7, -0.1 ] [ 0.1, 0.4 ] W_V: [ 0.6, -0.2 ] [ 0.3, 0.5 ] [-0.4, 0.1 ] [ 0.2, 0.7 ]

Computing Q = X · W_Q

X ∈ ℝ^(3×4), W_Q ∈ ℝ^(4×2), so Q ∈ ℝ^(3×2) Row 0 ("the"), x₀ = [0.2, 0.4, -0.1, 0.3]: Q₀₁ = (0.2)(1.0)+(0.4)(0.0)+(-0.1)(-0.5)+(0.3)(0.3) = 0.20+0.00+0.05+0.09 = 0.34 Q₀₂ = (0.2)(0.0)+(0.4)(1.0)+(-0.1)(0.2)+(0.3)(-0.1) = 0.00+0.40-0.02-0.03 = 0.35 → Q₀ = [0.34, 0.35] Row 1 ("cat"), x₁ = [0.5, -0.2, 0.6, 0.1]: Q₁₁ = (0.5)(1.0)+(-0.2)(0.0)+(0.6)(-0.5)+(0.1)(0.3) = 0.50+0.00-0.30+0.03 = 0.23 Q₁₂ = (0.5)(0.0)+(-0.2)(1.0)+(0.6)(0.2)+(0.1)(-0.1) = 0.00-0.20+0.12-0.01 = -0.09 → Q₁ = [0.23, -0.09] Row 2 ("sat"), x₂ = [-0.3, 0.7, 0.2, -0.4]: Q₂₁ = (-0.3)(1.0)+(0.7)(0.0)+(0.2)(-0.5)+(-0.4)(0.3) = -0.30+0.00-0.10-0.12 = -0.52 Q₂₂ = (-0.3)(0.0)+(0.7)(1.0)+(0.2)(0.2)+(-0.4)(-0.1) = 0.00+0.70+0.04+0.04 = 0.78 → Q₂ = [-0.52, 0.78] Q = [[ 0.34, 0.35], [ 0.23, -0.09], [-0.52, 0.78]]

Computing K = X · W_K

Row 0 ("the"): K₀₁ = (0.2)(0.5)+(0.4)(-0.3)+(-0.1)(0.7)+(0.3)(0.1) = 0.10-0.12-0.07+0.03 = -0.06 K₀₂ = (0.2)(0.2)+(0.4)(0.8)+(-0.1)(-0.1)+(0.3)(0.4) = 0.04+0.32+0.01+0.12 = 0.49 → K₀ = [-0.06, 0.49] Row 1 ("cat"): K₁₁ = (0.5)(0.5)+(-0.2)(-0.3)+(0.6)(0.7)+(0.1)(0.1) = 0.25+0.06+0.42+0.01 = 0.74 K₁₂ = (0.5)(0.2)+(-0.2)(0.8)+(0.6)(-0.1)+(0.1)(0.4) = 0.10-0.16-0.06+0.04 = -0.08 → K₁ = [0.74, -0.08] Row 2 ("sat"): K₂₁ = (-0.3)(0.5)+(0.7)(-0.3)+(0.2)(0.7)+(-0.4)(0.1) = -0.15-0.21+0.14-0.04 = -0.26 K₂₂ = (-0.3)(0.2)+(0.7)(0.8)+(0.2)(-0.1)+(-0.4)(0.4) = -0.06+0.56-0.02-0.16 = 0.32 → K₂ = [-0.26, 0.32] K = [[-0.06, 0.49], [ 0.74, -0.08], [-0.26, 0.32]]

Computing V = X · W_V

Row 0 ("the"): V₀₁ = (0.2)(0.6)+(0.4)(0.3)+(-0.1)(-0.4)+(0.3)(0.2) = 0.12+0.12+0.04+0.06 = 0.34 V₀₂ = (0.2)(-0.2)+(0.4)(0.5)+(-0.1)(0.1)+(0.3)(0.7) = -0.04+0.20-0.01+0.21 = 0.36 → V₀ = [0.34, 0.36] Row 1 ("cat"): V₁₁ = (0.5)(0.6)+(-0.2)(0.3)+(0.6)(-0.4)+(0.1)(0.2) = 0.30-0.06-0.24+0.02 = 0.02 V₁₂ = (0.5)(-0.2)+(-0.2)(0.5)+(0.6)(0.1)+(0.1)(0.7) = -0.10-0.10+0.06+0.07 = -0.07 → V₁ = [0.02, -0.07] Row 2 ("sat"): V₂₁ = (-0.3)(0.6)+(0.7)(0.3)+(0.2)(-0.4)+(-0.4)(0.2) = -0.18+0.21-0.08-0.08 = -0.13 V₂₂ = (-0.3)(-0.2)+(0.7)(0.5)+(0.2)(0.1)+(-0.4)(0.7) = 0.06+0.35+0.02-0.28 = 0.15 → V₂ = [-0.13, 0.15] V = [[ 0.34, 0.36], [ 0.02, -0.07], [-0.13, 0.15]]

The scaled dot-product attention formula

With Q, K, and V in hand, we compute attention following the formula from Vaswani et al. [2017]:

Attention(Q, K, V) = softmax( Q·Kᵀ / √d_k ) · V

This formula has four distinct steps that we will compute in sequence.

Step 1: Compute the raw attention scores, Q · Kᵀ

Q ∈ ℝ^(3×2) and Kᵀ ∈ ℝ^(2×3), so the score matrix S = Q·Kᵀ ∈ ℝ^(3×3). Each entry S_ij measures how much token i (as a query) attends to token j (as a key): S_ij = Q_i · K_j

Row 0 (query = "the"): S₀₀ = (0.34)(-0.06)+(0.35)(0.49) = -0.0204+0.1715 = 0.1511 S₀₁ = (0.34)(0.74)+(0.35)(-0.08) = 0.2516-0.0280 = 0.2236 S₀₂ = (0.34)(-0.26)+(0.35)(0.32) = -0.0884+0.1120 = 0.0236 Row 1 (query = "cat"): S₁₀ = (0.23)(-0.06)+(-0.09)(0.49) = -0.0138-0.0441 = -0.0579 S₁₁ = (0.23)(0.74)+(-0.09)(-0.08) = 0.1702+0.0072 = 0.1774 S₁₂ = (0.23)(-0.26)+(-0.09)(0.32) = -0.0598-0.0288 = -0.0886 Row 2 (query = "sat"): S₂₀ = (-0.52)(-0.06)+(0.78)(0.49) = 0.0312+0.3822 = 0.4134 S₂₁ = (-0.52)(0.74)+(0.78)(-0.08) = -0.3848-0.0624 = -0.4472 S₂₂ = (-0.52)(-0.26)+(0.78)(0.32) = 0.1352+0.2496 = 0.3848 S = Q·Kᵀ: [[ 0.1511, 0.2236, 0.0236], [-0.0579, 0.1774, -0.0886], [ 0.4134, -0.4472, 0.3848]]

Step 2: Scale by √d_k

We divide every entry by √d_k = √2 ≈ 1.4142.

Why scale? When d_k is large, the dot products grow in magnitude, pushing the softmax into regions where gradients become vanishingly small [Vaswani et al., 2017]. Dividing by √d_k keeps the scores in a range where softmax produces meaningful gradients. This is one of those design choices that looks arbitrary in a formula but has a precise justification rooted in the statistics of dot products between high-dimensional random vectors.

Ŝ = S / √2: [[ 0.1068, 0.1581, 0.0167], [-0.0409, 0.1254, -0.0626], [ 0.2923, -0.3162, 0.2720]]

Step 3: Softmax over each row

Softmax converts each row of scores into a probability distribution. For a row vector s = [s₁, s₂, ..., sₙ]:

softmax(s)_i = exp(s_i) / Σ_j exp(s_j)

Row 0: Ŝ₀ = [0.1068, 0.1581, 0.0167] exp(0.1068) ≈ 1.1127, exp(0.1581) ≈ 1.1712, exp(0.0167) ≈ 1.0168 Sum = 3.3007 A₀ = [1.1127/3.3007, 1.1712/3.3007, 1.0168/3.3007] = [0.3371, 0.3549, 0.3081] Row 1: Ŝ₁ = [-0.0409, 0.1254, -0.0626] exp(-0.0409) ≈ 0.9599, exp(0.1254) ≈ 1.1336, exp(-0.0626) ≈ 0.9393 Sum = 3.0328 A₁ = [0.3165, 0.3738, 0.3097] Row 2: Ŝ₂ = [0.2923, -0.3162, 0.2720] exp(0.2923) ≈ 1.3393, exp(-0.3162) ≈ 0.7288, exp(0.2720) ≈ 1.3126 Sum = 3.3807 A₂ = [0.3963, 0.2156, 0.3882] Attention weight matrix A = softmax(Ŝ): [[0.3371, 0.3549, 0.3081], [0.3165, 0.3738, 0.3097], [0.3963, 0.2156, 0.3882]] Each row sums to 1.0.

Each row of A sums to 1.0. Read row i as the distribution of attention that token i pays to each other token (including itself). Notice that in our toy example with random weights the distributions are fairly uniform — no token dominates. In a trained model, these weights become sharp: a pronoun might attend heavily to the noun it refers to; a verb might attend to its subject.

Step 4: Weighted sum with V

The final attention output is the weighted sum of Value vectors, with weights given by A:

Attention(Q, K, V) = A · V

A ∈ ℝ^(3×3), V ∈ ℝ^(3×2), so output Z ∈ ℝ^(3×2) Row 0 ("the"): Z₀₁ = (0.3371)(0.34)+(0.3549)(0.02)+(0.3081)(-0.13) = 0.1146+0.0071-0.0401 = 0.0816 Z₀₂ = (0.3371)(0.36)+(0.3549)(-0.07)+(0.3081)(0.15) = 0.1214-0.0248+0.0462 = 0.1428 Row 1 ("cat"): Z₁₁ = (0.3165)(0.34)+(0.3738)(0.02)+(0.3097)(-0.13) = 0.1076+0.0075-0.0403 = 0.0748 Z₁₂ = (0.3165)(0.36)+(0.3738)(-0.07)+(0.3097)(0.15) = 0.1139-0.0262+0.0465 = 0.1343 Row 2 ("sat"): Z₂₁ = (0.3963)(0.34)+(0.2156)(0.02)+(0.3882)(-0.13) = 0.1347+0.0043-0.0505 = 0.0885 Z₂₂ = (0.3963)(0.36)+(0.2156)(-0.07)+(0.3882)(0.15) = 0.1427-0.0151+0.0582 = 0.1858 Z = A·V: [[0.0816, 0.1428], [0.0748, 0.1343], [0.0885, 0.1858]]

This is the output of one self-attention head: a new 2-dimensional representation for each of our three tokens, enriched by information from every other position in the sequence.

3. Multi-Head Attention: Running Attention in Parallel

Single-head attention learns one kind of relationship between tokens: maybe it learns to track grammatical agreement, or co-reference, or proximity. But language is layered — a single attention pattern cannot simultaneously capture syntax, semantics, and discourse structure.

Multi-head attention solves this by running h independent attention operations in parallel, each with its own W_Q^(i), W_K^(i), W_V^(i) projection matrices [Vaswani et al., 2017]. Each head sees the same input X but projects it into a different subspace before computing attention.

The outputs of all h heads are concatenated along the feature dimension and then projected back to d-dimensional space via a learned output projection W_O ∈ ℝ^(h·d_v × d):

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W_O where head_i = Attention(X·W_Q^(i), X·W_K^(i), X·W_V^(i))

In GPT-3, h = 96 and d_k = d_v = 128 [Brown et al., 2020]. The concatenated head outputs have dimension 96 × 128 = 12,288, which matches d exactly. The output projection W_O then blends the views of all 96 heads into a single coherent representation.

Re-deriving the full arithmetic for multi-head attention would triple the length of this post without adding new conceptual content. The computation inside each head is identical to what we worked through above. The key insight is additive: more heads give the model more simultaneous "points of view" on the sequence, and the output projection learns to integrate them.

4. Feed-Forward Network, Residual Connections, and Layer Normalization

After multi-head attention, each token's representation passes through three more operations before reaching the next layer: a position-wise feed-forward network (FFN), a residual connection, and layer normalization. We treat each in turn.

The position-wise feed-forward network

The FFN in the original transformer applies the same two-layer fully connected network independently to each token position [Vaswani et al., 2017]:

FFN(x) = W₂ · σ(W₁·x + b₁) + b₂ where σ is a non-linearity (ReLU in original transformer, GELU in GPT-2+) W₁ ∈ ℝ^(d_ff × d) expands to inner dimension d_ff (typically 4d) W₂ ∈ ℝ^(d × d_ff) projects back down

Why expand and then contract? The expansion creates capacity for the model to store and retrieve factual associations — something more like a key-value memory than raw attention. Recent theoretical work suggests the FFN layers act as associative memories [Geva et al., 2021], though that is a separate post.

For our worked example, we apply the FFN to the embedding of "sat" after the attention step, using a tiny inner dimension d_ff = 3. We will use ReLU for clarity (ReLU(x) = max(0, x)).

Toy FFN weights: W₁ (3×4): [ 0.5, -0.3, 0.4, 0.2 ] [-0.2, 0.8, -0.1, 0.6 ] [ 0.3, 0.1, 0.7, -0.5 ] b₁ = [0.1, -0.1, 0.0] W₂ (4×3): [ 0.4, -0.3, 0.5 ] [ 0.2, 0.6, -0.2 ] [-0.1, 0.4, 0.3 ] [ 0.7, -0.2, 0.1 ] b₂ = [0.0, 0.0, 0.0, 0.0]

Step 1: Compute h = W₁·x₂ + b₁ (where x₂ = [-0.3, 0.7, 0.2, -0.4])

h₁ = (0.5)(-0.3)+(-0.3)(0.7)+(0.4)(0.2)+(0.2)(-0.4)+0.1 = -0.15-0.21+0.08-0.08+0.10 = -0.26 h₂ = (-0.2)(-0.3)+(0.8)(0.7)+(-0.1)(0.2)+(0.6)(-0.4)-0.1 = 0.06+0.56-0.02-0.24-0.10 = 0.26 h₃ = (0.3)(-0.3)+(0.1)(0.7)+(0.7)(0.2)+(-0.5)(-0.4)+0.0 = -0.09+0.07+0.14+0.20+0.00 = 0.32 h = [-0.26, 0.26, 0.32]

Step 2: Apply ReLU — ReLU(x) = max(0, x)

h⁺ = ReLU(h) = [max(0,-0.26), max(0,0.26), max(0,0.32)] = [0.00, 0.26, 0.32]

Step 3: Project back with W₂

FFN(x₂) = W₂·h⁺ + b₂ FFN₁ = (0.4)(0.00)+(-0.3)(0.26)+(0.5)(0.32) = 0.00-0.078+0.160 = 0.082 FFN₂ = (0.2)(0.00)+(0.6)(0.26)+(-0.2)(0.32) = 0.00+0.156-0.064 = 0.092 FFN₃ = (-0.1)(0.00)+(0.4)(0.26)+(0.3)(0.32) = 0.00+0.104+0.096 = 0.200 FFN₄ = (0.7)(0.00)+(-0.2)(0.26)+(0.1)(0.32) = 0.00-0.052+0.032 = -0.020 FFN(x₂) = [0.082, 0.092, 0.200, -0.020]

Residual connections

Both the attention sublayer and the FFN sublayer are wrapped in a residual connection [He et al., 2016], also called a skip connection:

y = Sublayer(x) + x

The sublayer output is added element-wise to the original input before being passed to layer normalization. For our "sat" token:

y₂ = FFN(x₂) + x₂ = [0.082, 0.092, 0.200, -0.020] + [-0.300, 0.700, 0.200, -0.400] = [-0.218, 0.792, 0.400, -0.420]

Residual connections are not a cosmetic addition. They enable gradients to flow directly from later layers back to earlier ones without vanishing — the critical property that allows very deep networks (hundreds of layers) to train effectively [He et al., 2016].

Layer Normalization

Before passing the residual output to the next sublayer, the transformer applies Layer Normalization (LayerNorm) [Ba et al., 2016]. Unlike Batch Normalization, which normalizes across the batch dimension, LayerNorm normalizes across the feature dimension within a single training example. This makes it well-suited to variable-length sequences.

LayerNorm(y) = γ ⊙ (y - μ) / √(σ² + ε) + β where μ, σ² = mean and variance of y across the d feature dimensions ε = small constant for numerical stability (typically 10⁻⁵) γ, β ∈ ℝ^d = learned scale and shift (initialized to all-ones and all-zeros)

Worked example — LayerNorm on y₂ = [-0.218, 0.792, 0.400, -0.420]: Mean: μ = (-0.218 + 0.792 + 0.400 + (-0.420)) / 4 = 0.554 / 4 = 0.1385 Variance: σ² = [(-0.218-0.1385)² + (0.792-0.1385)² + (0.400-0.1385)² + (-0.420-0.1385)²] / 4 = [(-0.3565)² + (0.6535)² + (0.2615)² + (-0.5585)²] / 4 = [0.1271 + 0.4271 + 0.0684 + 0.3119] / 4 = 0.9345 / 4 = 0.2336 Standard deviation: σ = √0.2336 ≈ 0.4833 Normalized values (γ=[1,1,1,1], β=[0,0,0,0], ε ignored for brevity): ŷ₁ = (-0.218 - 0.1385) / 0.4833 = -0.3565 / 0.4833 ≈ -0.7376 ŷ₂ = (0.792 - 0.1385) / 0.4833 = 0.6535 / 0.4833 ≈ 1.3520 ŷ₃ = (0.400 - 0.1385) / 0.4833 = 0.2615 / 0.4833 ≈ 0.5411 ŷ₄ = (-0.420 - 0.1385) / 0.4833 = -0.5585 / 0.4833 ≈ -1.1557 LayerNorm(y₂) ≈ [-0.738, 1.352, 0.541, -1.156] Verify: mean ≈ 0, variance ≈ 1 ✓

5. Output Projection and Softmax Over the Vocabulary

After N transformer layers have processed the input sequence, the final hidden state of the last token position — call it h_last ∈ ℝ^d — is the representation the model uses to predict the next token.

This representation is projected to logit scores over the entire vocabulary via a learned matrix W_out ∈ ℝ^(|V| × d):

ℓ = W_out · h_last P(next token = t | context) = softmax(ℓ)_t = exp(ℓ_t) / Σ_j exp(ℓ_j)

In many implementations, W_out is tied to (i.e., shares weights with) the embedding matrix E — a technique called weight tying that reduces parameters and improves perplexity [Press & Wolf, 2017]. In GPT-2 and later, this is standard.

Worked example — using LayerNorm output from previous section as h_last:

h_last = [-0.738, 1.352, 0.541, -1.156] W_out (5×4): [ 0.3, -0.2, 0.5, 0.1 ] ← "the" [-0.1, 0.6, -0.3, 0.4 ] ← "cat" [ 0.4, 0.2, 0.1, -0.2 ] ← "sat" [ 0.2, 0.5, 0.3, 0.6 ] ← "on" [-0.3, 0.1, 0.4, 0.2 ] ← "mat" Logits ℓ = W_out · h_last: ℓ₀ ("the") = (0.3)(-0.738)+(-0.2)(1.352)+(0.5)(0.541)+(0.1)(-1.156) = -0.221-0.270+0.271-0.116 = -0.336 ℓ₁ ("cat") = (-0.1)(-0.738)+(0.6)(1.352)+(-0.3)(0.541)+(0.4)(-1.156) = 0.074+0.811-0.162-0.462 = 0.261 ℓ₂ ("sat") = (0.4)(-0.738)+(0.2)(1.352)+(0.1)(0.541)+(-0.2)(-1.156) = -0.295+0.270+0.054+0.231 = 0.260 ℓ₃ ("on") = (0.2)(-0.738)+(0.5)(1.352)+(0.3)(0.541)+(0.6)(-1.156) = -0.148+0.676+0.162-0.694 = -0.004 ℓ₄ ("mat") = (-0.3)(-0.738)+(0.1)(1.352)+(0.4)(0.541)+(0.2)(-1.156) = 0.221+0.135+0.216-0.231 = 0.341 Logit vector: ℓ = [-0.336, 0.261, 0.260, -0.004, 0.341] Softmax: exp(-0.336) ≈ 0.7146, exp(0.261) ≈ 1.2982, exp(0.260) ≈ 1.2969 exp(-0.004) ≈ 0.9960, exp(0.341) ≈ 1.4063 Sum = 5.7120 P ≈ [0.1251, 0.2272, 0.2270, 0.1744, 0.2462] (the) (cat) (sat) (on) (mat) Highest probability: "mat" (0.2462)

The model assigns its highest probability to "mat" (0.2462), followed very closely by "cat" and "sat". In a toy random model this is unsurprising — the probabilities are close to uniform. A trained model would be far more decisive: after "the cat sat on the ___", the probability mass for "mat" would dominate, perhaps exceeding 0.99.

6. Sampling: Temperature, Top-k, and Top-p

The probability vector above tells us how likely the model believes each next token to be. But it does not determine which token is actually chosen. That is the job of the sampling strategy.

Temperature scaling

Temperature T reshapes the probability distribution by dividing logits by T before applying softmax:

P_T(t) = exp(ℓ_t / T) / Σ_j exp(ℓ_j / T)

Using ℓ = [-0.336, 0.261, 0.260, -0.004, 0.341]: T = 0.5 (sharpens — more decisive): ℓ/0.5 = [-0.672, 0.522, 0.520, -0.008, 0.682] P₀.₅ ≈ [0.0651, 0.2147, 0.2143, 0.1264, 0.2520] ← more peaked toward "mat" T = 1.0 (baseline softmax): P₁.₀ ≈ [0.1251, 0.2272, 0.2270, 0.1744, 0.2462] T = 2.0 (flattens — more random/creative): ℓ/2.0 = [-0.168, 0.131, 0.130, -0.002, 0.171] P₂.₀ ≈ [0.1593, 0.2148, 0.2146, 0.1880, 0.2235] ← nearly uniform Summary table: Temp | "the" | "cat" | "sat" | "on" | "mat" T=0.5 | 0.065 | 0.215 | 0.214 | 0.126 | 0.252 T=1.0 | 0.125 | 0.227 | 0.227 | 0.174 | 0.246 T=2.0 | 0.159 | 0.215 | 0.215 | 0.188 | 0.224 At T→0: always picks argmax (greedy decoding) At T→∞: all tokens equally likely Production: T=0.7–1.0 for creative tasks; T=0.0–0.3 for factual tasks

Top-k sampling

Top-k restricts sampling to the k highest-probability tokens in the vocabulary, renormalizing among them. At k=3 in our example, we would sample only from {"mat", "cat", "sat"}, discarding "the" and "on".

Top-p (nucleus) sampling

Top-p (also called nucleus sampling, introduced by Holtzman et al. [2020]) selects the smallest set of tokens whose cumulative probability exceeds p, then samples from that set. At p=0.75 with T=1.0: sort probabilities descending: [0.2462, 0.2272, 0.2270, 0.1744, 0.1251]. Cumulative sums: [0.2462, 0.4734, 0.7004, 0.8748, 1.0000]. The nucleus at p=0.75 is {"mat", "cat", "sat"} (the first three tokens bring cumulative probability past 0.75). The model samples from this set.

7. Training: Cross-Entropy Loss and Gradient Descent

During pre-training, the model learns by predicting the next token at every position in a large text corpus and adjusting its parameters whenever it gets the prediction wrong. The measure of wrongness is the cross-entropy loss.

Cross-entropy loss

If the ground-truth next token at a given position is token t*, the loss is:

L = -log P(t*) where P(t*) is the probability the model assigned to the correct token.

Worked example: P = [0.1251, 0.2272, 0.2270, 0.1744, 0.2462] Ground truth: "on" (t* = 3) L = -log P(t*=3) = -log(0.1744) ≈ 1.7454 If model had assigned P("on") = 0.99: L = -log(0.99) ≈ 0.0101 (very good) If model had assigned P("on") = 0.01: L = -log(0.01) ≈ 4.6052 (very bad)

The loss is large when the model assigns low probability to the correct answer and small when it assigns high probability. Minimizing average cross-entropy across all positions in the training corpus is exactly what makes the model predictively accurate.

Perplexity is often used as an evaluation metric — it is simply exp(L), the exponent of the average cross-entropy loss. A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 equally likely options.

Gradient descent and backpropagation

With a loss computed, the model updates its parameters via gradient descent:

θ ← θ - η · ∇_θ L where η = learning rate ∇_θ L = gradient of loss with respect to all parameters θ

In practice, optimizers like Adam [Kingma & Ba, 2015] maintain per-parameter adaptive learning rates.

Backpropagation [Rumelhart et al., 1986] is the algorithm that computes ∇_θ L efficiently by applying the chain rule through every operation in the forward pass — from the output probabilities back through the output projection, the transformer layers, the attention operations, and the embedding lookup. The forward pass we computed above contains every operation that backpropagation must differentiate.

The important intuition: every weight matrix we used above — W_Q, W_K, W_V, W_1, W_2, W_out, and the embedding matrix E — has its gradient computed during backprop and is updated in the direction that decreases the loss on the current training batch. After hundreds of billions of such updates across trillions of tokens, the random toy numbers we used become the precise, information-dense weights of a functioning language model.

8. Scaling Laws: The Empirical Relationship Between Compute, Data, Parameters, and Loss

Everything we have described so far is fixed transformer architecture — the same operations regardless of model size. What determines how good the model gets is scale: how many parameters, how many training tokens, and how much compute is invested.

Kaplan et al. [2020] at OpenAI established that language model loss follows a clean power law relationship with model size N (number of parameters), dataset size D (training tokens), and compute budget C (floating point operations):

L(N) ≈ (N_c / N)^α_N L(D) ≈ (D_c / D)^α_D L(C) ≈ (C_c / C)^α_C where: α_N ≈ 0.076, α_D ≈ 0.095, α_C ≈ 0.050 (empirical estimates) N_c, D_c, C_c = reference constants If you double the model size, the loss falls by a predictable fraction.

These relationships hold across six orders of magnitude of compute.

A consequential conclusion from the original scaling laws analysis: for a fixed compute budget, it is almost always better to train a larger model for fewer steps than a smaller model for more steps [Kaplan et al., 2020].

Hoffmann et al. [2022] at DeepMind revisited this conclusion with a more carefully controlled experimental design, training over 400 models of different sizes on different amounts of data and holding total compute fixed. Their finding — the Chinchilla result — overturned the Kaplan recommendation:

For a given compute budget C, the optimal training tokens D* and optimal model size N* satisfy: N* ≈ D* ≈ √(C / 6) Practical implication: a compute-optimal model should train on approximately 20 tokens per parameter [Hoffmann et al., 2022]. GPT-3 (175B params) was trained on 300B tokens — under the Chinchilla recommendation of 175B × 20 = 3.5 trillion tokens. Chinchilla (70B params, 1.4T tokens) matched GPT-3 performance with 4x fewer parameters at inference time.

The Chinchilla paper fundamentally reoriented how frontier labs allocate compute: the industry shifted from "make the model bigger" to "make the model bigger AND train it on more tokens in the right proportion." Meta's LLaMA models, Google's PaLM 2, and Mistral's 7B model all followed Chinchilla-inspired training recipes.

What scaling laws do not say: they describe loss on next-token prediction, which correlates with but does not fully determine performance on downstream tasks. At sufficient scale, emergent capabilities appear that are not linearly predicted by loss alone — abilities like multi-step arithmetic, chain-of-thought reasoning, and in-context learning appear discontinuously as scale crosses certain thresholds [Wei et al., 2022]. The mathematical mechanisms that produce emergence are an active research frontier.

Key Takeaways

1. Text becomes vectors through an embedding lookup: each token ID indexes a row in a learned matrix E ∈ ℝ^(|V| × d). The vectors encode semantic relationships and are learned through training.

2. Self-attention is three linear projections plus a scaled dot-product: Q = X·W_Q, K = X·W_K, V = X·W_V, followed by softmax(Q·Kᵀ / √d_k)·V. The scaling by √d_k prevents gradient vanishing in high dimensions.

3. Multi-head attention runs this process in parallel h times, each head projecting into a different subspace and learning a different type of token relationship. Outputs are concatenated and projected back.

4. Feed-forward layers add capacity between attention layers: a two-layer MLP applied independently at each position. The inner dimension is typically 4× the embedding dimension, acting as associative memory.

5. Residual connections and LayerNorm are not optional: residuals enable gradient flow in deep networks; LayerNorm stabilizes training by normalizing feature activations across the feature dimension (not the batch dimension).

6. The output is a probability distribution over the vocabulary: logits from a linear projection are softmaxed. Temperature, top-k, and top-p control how that distribution is sampled during generation.

7. Training minimizes cross-entropy loss via gradient descent: the model assigns probability to the correct next token; backpropagation distributes the gradient of that loss through every parameter.

8. Scale follows power laws: loss decreases predictably with model size and data. The Chinchilla result shows that compute-optimal training requires roughly 20 training tokens per parameter — a result that has reshaped how frontier labs build models.

References

Primary Sources (peer-reviewed papers and preprints):

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS, 30. arXiv:1706.03762. https://arxiv.org/abs/1706.03762https://arxiv.org/abs/1706.03762

[2] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361. https://arxiv.org/abs/2001.08361https://arxiv.org/abs/2001.08361

[3] Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556. https://arxiv.org/abs/2203.15556https://arxiv.org/abs/2203.15556

[4] Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450. https://arxiv.org/abs/1607.06450https://arxiv.org/abs/1607.06450

[5] Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv:1606.08415. https://arxiv.org/abs/1606.08415https://arxiv.org/abs/1606.08415

[6] Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language Models are Few-Shot Learners (GPT-3). NeurIPS, 33. arXiv:2005.14165. https://arxiv.org/abs/2005.14165https://arxiv.org/abs/2005.14165

[7] Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units (BPE). ACL. arXiv:1508.07909. https://arxiv.org/abs/1508.07909https://arxiv.org/abs/1508.07909

[8] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR. arXiv:1512.03385. https://arxiv.org/abs/1512.03385https://arxiv.org/abs/1512.03385

[9] Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015. arXiv:1409.0473. https://arxiv.org/abs/1409.0473https://arxiv.org/abs/1409.0473

[10] Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text Degeneration (nucleus sampling). ICLR 2020. arXiv:1904.09751. https://arxiv.org/abs/1904.09751https://arxiv.org/abs/1904.09751

[11] Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR 2015. arXiv:1412.6980. https://arxiv.org/abs/1412.6980https://arxiv.org/abs/1412.6980

[12] Press, O., & Wolf, L. (2017). Using the Output Embedding to Improve Language Models. EACL. arXiv:1608.05859. https://arxiv.org/abs/1608.05859https://arxiv.org/abs/1608.05859

[13] Wei, J., Tay, Y., Bommasani, R., et al. (2022). Emergent Abilities of Large Language Models. TMLR. arXiv:2206.07682. https://arxiv.org/abs/2206.07682https://arxiv.org/abs/2206.07682

[14] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 6 (Backpropagation). https://www.deeplearningbook.org/

[15] Alammar, J. (2018). The Illustrated Transformer. Blog post. https://jalammar.github.io/illustrated-transformer/

[16] Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed. draft). Chapter 10: Transformers and Large Language Models. https://web.stanford.edu/~jurafsky/slpdraft/

Discussion: What aspect of the transformer pipeline surprised you most when you saw the actual numbers? And which section would you like to see extended into a follow-up post — MoE architectures, the mathematics of RLHF, or speculative decoding? Let us know in the comments.

The Math Behind Large Language Models: A Worked Walk-Through