How Language Models Work — and How We Teach Them

It's autocomplete that read everything

Strip away the hype and a language model does one humble thing: guess the next word. The magic is what it had to learn to get good at it.

You already use a tiny language model every day. When your phone keyboard suggests the next word, it's guessing from the few words you just typed. It's often wrong, because it has read almost nothing.

Now imagine that same idea, but the model has read essentially everything humans have written — books, code, conversations, science, the whole messy internet — billions of times over. To get genuinely good at guessing the next word across all of that, it had no choice but to absorb grammar, facts, reasoning patterns, the structure of arguments, how code works. That absorbed understanding is the whole point. The "guess the next word" game is just how it was forced to learn.

Think of it like

A student who, to ace one weird exam — "predict the next word in any text ever written" — ended up having to learn a little bit of everything. The exam was trivial. The studying made them brilliant.

So whenever you read "the model generates text," picture this loop: it looks at everything so far, produces a ranked guess for the next word, picks one, adds it, and looks again. One word at a time, very fast. Everything in this guide is about either how that guessing machine is built (Act I) or how we shape its guesses to be helpful, well-mannered, and correct (Acts II–IV).

Tokens: the puzzle pieces it reads

A model doesn't see letters or words. It sees chunks called tokens — and each one is really just a number.

Here's the first surprise: the model never sees the word "tokenization." It sees pieces — maybe token, iz, ation. These pieces are called tokens, and a typical token is about ¾ of a word. Common words are a single token; rare ones get split into parts; even an emoji or a snippet of code becomes tokens.

Why chop things up this way? It's a sweet spot. Whole words would need a dictionary of millions and still miss new ones. Single letters would make sequences painfully long. Sub-word pieces give you a manageable set that can still spell out anything.

Text → tokens → ID numbers. The model only ever sees the numbers.

live tokenizer · loading…

enough diagrams — you try it ↓

tokens

characters

tokens / word

Uses OpenAI's real GPT-4 tokenizer (cl100k) when online; falls back to a lightweight approximation offline.

The vocabulary

The full set of pieces a model is allowed to use is its vocabulary — usually 30,000 to 250,000 tokens, each with a fixed ID number. A bigger vocabulary means fewer pieces per sentence (faster, cheaper) but a heavier model at both ends. The vocabulary is built once by scanning a big pile of text and repeatedly merging the most common pairs of characters into tokens until the list is full.

Why you'll care laterIf you ever add new words to a model — special tags, domain jargon — you're growing this vocabulary, and those new pieces start as meaningless noise until you train them. That's a real gotcha we'll meet again in fine-tuning.

+Go deeper
What tokens really look like, what a real vocabulary file is, and the weird failures it causes

What tokens actually look like

Run a few strings through a typical tokenizer (roughly — exact splits vary by model):

Text	Tokens	What to notice
`hello world`	`["hello", " world"]`	The space is glued to the front of " world". Spaces live inside tokens.
`tokenization`	`["token", "ization"]`	Common stem + ending. Frequent chunks stay whole.
`antidisestablish…`	`["ant","idis","establish","ment","arian","ism"]`	A rare word shatters into many pieces.
`The` · `the` · `the`	three different IDs	Capitalisation and a leading space each change the token.
`1234567`	`["123","4567"]`	Numbers chop inconsistently — a big reason arithmetic is shaky.
an emoji / 中文	several tokens	Under the hood it's bytes, so non-Latin text costs more tokens.

Common words are a single token; rare or invented ones get spelled out in sub-word pieces. English averages about ¾ of a word (~4 characters) per token.

What a real vocabulary file looks like

The vocabulary isn't an abstraction — it's a file on disk, and one common format (OpenAI's tiktoken) stores every token as base64 of its bytes, paired with its ID. Decode the base64 and the actual token text pops out:

IHRoZQ==   1169    # → " the"   (note the leading space is part of it)
aGVsbG8=   31373   # → "hello"
0LTQsA==   48745   # → "да"     (Cyrillic — several bytes per letter)

The base64 here is a storage wrapper, not the tokenization: tokens are raw bytes, and they include things that wreck a plain text file (leading spaces, newlines, punctuation, and partial multi-byte sequences for Cyrillic or emoji). Base64 turns any bytes into safe printable ASCII so the file stays clean. The vocabulary, written down — nothing more.

Two conventions you'll meet in the wild

Format	How a token is stored
tiktoken	base64 of the bytes + an ID number (the file you saw).
Hugging Face `vocab.json`	Visible placeholder characters — a leading space becomes `Ġ`, so `" the"` shows as `Ġthe`.
SentencePiece `.model`	A binary protobuf — not human-readable at all.

And cp1251 (Windows-1251) is a different kind of encoding: it maps bytes ↔ Cyrillic letters (which characters bytes mean), whereas base64 maps bytes ↔ safe ASCII (how to store bytes). Seeing both around a Russian text file makes sense: one says "these bytes are Cyrillic," the other says "here are the token bytes, packed for the file."

The thing to rememberIf you ever open a tokenizer and see base64 strings next to numbers (or odd Ġthe tokens), you're looking at the vocabulary itself — the model's entire alphabet, serialized.

Why it matters — and your security angle

Cost & context are measured in tokens — you're billed and capped by them, and a bigger vocabulary means fewer tokens per sentence.
Arithmetic & letter-counting stumble because the model sees chunks, not digits or letters — "how many r's in strawberry?" is hard when it sees str/aw/berry.
Other languages cost more tokens (they're under-represented in the vocabulary), so they're pricier and sometimes weaker.
It's an attack surface. Odd Unicode and token boundaries enable "token smuggling" past filters, and glitch tokens — strings like the infamous SolidGoldMagikarp that sat in the vocabulary but were barely trained — make models behave bizarrely when they appear. A genuinely interesting red-team thread.

Say it in one breathTokens are learned, meaningful sub-word chunks — not a fixed mechanical re-encoding like Base64 — and the way text gets chopped quietly explains a lot of model weirdness and a few real exploits.

Embeddings: turning words into a map of meaning

A number like "7782" means nothing on its own. The first thing the model does is turn each token into coordinates on a vast invisible map — where distance is meaning.

A token ID is just a name tag; it carries no meaning. So the model's very first layer looks each ID up in a giant table and replaces it with a long list of numbers — often thousands of them. Think of those numbers as coordinates. They place the token as a single point on an enormous map.

This map isn't drawn by hand. The model arranges it during training so that tokens used in similar ways end up near each other. "King" and "queen" land in the same neighbourhood. "Banana" is far away in fruit-country. Even relationships become directions you can travel: the step from "man" to "woman" is roughly the same step as "king" to "queen." That single list of coordinates is called an embedding.

Real maps have thousands of dimensions; here's the gist in two. Nearness = similarity; direction = relationship.

This is why models feel like they "understand" — they're doing geometry on meaning. It's also the engine behind search and the "look things up" trick we'll meet in Act IV: to find relevant text, you just look for nearby points on the map.

One subtlety worth keepingThe map gives each token a fixed starting point — "bank" begins in the same spot whether you mean a river or money. The next chapter is about how the model un-sticks that, letting context pull "bank" toward the right meaning.

+Go deeper
Vocabulary, the embedding matrix, static vs contextual, and where the weights live

Vocabulary, in depth

The fixed set of tokens (30k–250k), each with an ID. It's built once by scanning text and repeatedly merging the most common character pairs (Byte-Pair Encoding). Byte-level BPE means anything decomposes to bytes — no "unknown token." Beyond words there are control tokens (begin/end, and chat markers like <|im_start|> that delimit roles). Add new tokens and you must train their fresh embedding rows, or they stay random noise.

The embedding matrix

A lookup table of shape [vocabulary × hidden] — pick a token's row, get its vector. The width hidden (768 small → 4096–8192 large) is the model's "thickness" and stays constant through every layer. The vectors start random and the meaning-geometry is learned during pretraining.

Static vs contextual — the nuance most people miss

The embedding-matrix vector is static: "bank" is the same everywhere. As it flows up through attention, its hidden state becomes contextual — "river bank" and "bank account" diverge. (Old word2vec = static; transformer hidden states = contextual.) At the very end, an output head of shape [hidden × vocabulary] turns the final vector back into a score per token — and it's often the same matrix as the input embedding ("weight tying"), saving a big chunk of parameters.

Sentence embeddings & embedding models

Pool a whole text's token vectors into one vector and you can compare entire documents by meaning. Dedicated embedding models (E5, BGE, OpenAI's, your VibeGuard ONNX model) are trained so similar meanings sit close. This is the RAG engine (Ch 13): a fast bi-encoder retrieves candidates, a precise cross-encoder reranks them.

Where the weights actually live

Component	Share
Feed-forward (FFN) matrices	the largest slice (~⅔) — most of the "knowledge"
Attention projections (Q/K/V/O)	next biggest
Embedding + output head	scales with vocab (big vocab = big matrices)

Full fine-tuning nudges all of these; LoRA adds small notes without touching them; quantization stores the same numbers smaller. The weights, collectively, are the model's knowledge.

Say it in one breathVocabulary = the alphabet of tokens; embeddings = a learned lookup turning each into a meaning-vector; weights = all the learned numbers, mostly in the feed-forward layers — and fine-tuning just nudges them.

Attention: how words look at each other

This is the idea that made modern AI possible. It's simpler than it sounds: to understand a word, let it glance at the other words and decide which ones matter.

Read this sentence: "The trophy didn't fit in the suitcase because it was too big." What does "it" mean — the trophy or the suitcase? You know instantly it's the trophy. How? You let "it" look back at the other words and lean on the one that makes sense.

That move — a word gathering meaning from other words — is called attention, and a model does it for every token at once. Each word looks around, decides how much to "pay attention" to every other word, and rebuilds its own meaning as a blend of the ones it cared about. After this, "bank" near "river" has quietly absorbed some "river," and now means the right thing.

Think of it like

Every word fills out a tiny dating profile. A Query says "here's what I'm looking for." A Key says "here's what I offer." A Value is "here's what I'll actually share if you pick me." Each word matches its query against everyone's keys, then blends the values of whoever matched best. "It" is looking for a thing-that-can-be-big; "trophy" advertises exactly that; they match; "it" pulls in "trophy."

"it" spreads its attention across the sentence, but pours most of it onto "trophy" — so that's what "it" comes to mean.

Stacking it up: the transformer block

Attention is one half of the model's repeating unit. The other half is a small per-word "thinking" step called a feed-forward network — once a word has gathered context, this step processes it on its own. Wrap both in "shortcuts" (so nothing important gets lost) and a bit of math hygiene to keep numbers stable, and you have a transformer block. Stack that block dozens of times and you have the model.

Attention mixes words together; the feed-forward step thinks about each one; shortcuts keep the original safe. Repeat ~30–80 times.

Three shapes: encoder, decoder, or both

Once you can stack blocks, one quiet design choice splits the whole field into three families: which direction may the words look? Let every word look both ways and you get a brilliant reader that can't write. Let each word look only backwards and you get a writer — because to honestly generate the next word, you must not be allowed to peek at it. Bolt a reader onto a writer and you get a translator.

📖

Encoder-only

Reads the whole input at once, in both directions, and builds a rich representation of it. Superb for classification, search, and embeddings — but it doesn't generate text. BERT.

✍️

Decoder-only

Predicts the next token given everything before it; a causal mask stops it peeking ahead. This is every modern chat model. GPT, Claude, Llama, Mistral.

🔁

Encoder-decoder

An encoder digests the input; a decoder writes the output while glancing back at it. Natural for translation and summarisation — sequence in, sequence out. The original Transformer, T5.

Here's a common surprise: the assistants you actually chat with — GPT, Claude, Llama, Mistral — are decoder-only, not encoder-decoder. There's no separate "understanding" module that reads your message before a "writing" module replies. One causal stack does both jobs: it understands your prompt simply by predicting its way through it, then keeps predicting to produce the answer. The encoder-decoder shape is the translation-era design (it's what the original 2017 Transformer paper built); the encoder-only shape lives on quietly inside search and the embedding models from Chapter 3.

Why decoder-only won the chat eraOne architecture, one training game, every task. Translation, summarisation, Q&A, code — all of it can be phrased as "continue this text," so the simplest shape that scales ate the specialised ones. When you hear "causal language model" or "autoregressive," it means exactly this: decoder-only, left to right, no peeking.

+Go deeper
Encoder vs decoder vs encoder-decoder — the three families, mapped

Shape	Each word sees	Trained by	Generates?	Examples
Encoder-only	everything, both directions	fill-in-the-blank (masked tokens)	no	BERT, RoBERTa, most embedding models
Decoder-only	only what came before (causal mask)	predict the next token	yes — left to right	GPT, Claude, Llama, Mistral, Qwen
Encoder-decoder	encoder: everything · decoder: its own past + the encoder's output	map input sequence → output sequence	yes, from a digested input	original Transformer, T5, translation models; Whisper for speech

Encoder-only — the reader

With no need to generate, attention can run bidirectionally — every token sees the full sentence, both sides. Training hides a random ~15% of tokens and asks the model to fill in the blanks (masked language modelling). The output isn't text; it's a rich vector per token, which is exactly what you want for classification, named-entity tagging, rerankers — and the bi-encoders and cross-encoders doing retrieval in Chapter 13's RAG pipeline.

Decoder-only — the writer

The causal mask from this chapter is the defining feature: block all attention to future positions, train on plain next-token prediction, and the same stack learns to understand and to write. "Understanding" is implicit — by the time the model has predicted its way through your prompt, its hidden states already encode what you meant. The KV cache in Chapter 12 exists precisely because of this left-to-right shape.

Encoder-decoder — the translator

Two stacks. The encoder reads the source bidirectionally; the decoder writes the target causally, and in every block an extra cross-attention step lets the decoder's words query the encoder's output — its Q against the encoder's K and V. A clean fit when input and output are genuinely different sequences: languages, document → summary, audio → transcript.

So why did chat go decoder-only?

Three compounding reasons: simplicity scales (one stack, one objective, no cross-attention plumbing — easier to grow to hundreds of billions of parameters); everything is continuation (any task can be posed as "continue this text," so no per-task architecture); and in-context learning emerges (a big enough next-token predictor picks up patterns from examples in the prompt — the few-shot trick of Chapter 13 — which encoder-style training never produced). The encoder-decoder shape still wins some seq-to-seq benchmarks per-parameter, but the decoder-only recipe is what scaled into the assistants you use.

Say it in one breathEncoders read (both directions, no generation), decoders write (one direction, causal mask), encoder-decoders translate (read then write) — and every chat assistant you've used, Claude and GPT included, is decoder-only.

+Go deeper
Multi-head attention, the causal mask, the feed-forward layer, residuals, norm & position

Multi-head attention

The model doesn't run attention once — it runs several in parallel ("heads"), each with its own small Q/K/V. Different heads learn different relationships (grammar, coreference, long-range links), and their results are combined. It's like a committee where each member tracks a different kind of connection.

The causal mask

When generating, a word must not peek at future words — so those attention scores are blocked. That's the "causal" in causal language model: each token sees only what came before it.

The feed-forward network

After attention mixes words together, the FFN processes each word alone — two linear layers with a nonlinearity, widening then narrowing. This is where a large share of parameters and stored knowledge live, and exactly what Mixture-of-Experts swaps for many specialist sub-networks.

Residuals & normalisation

Each step adds its input back to its output (a "shortcut"), giving gradients a clean path down dozens of layers and letting a block leave things unchanged if best. Normalisation (RMSNorm, applied before each step) keeps the numbers in a healthy range so deep training doesn't diverge.

Position (RoPE)

Attention alone is order-blind. Position is injected by rotating the query/key vectors by an angle set by each token's position, so relative order is baked into the matching. Stretching those rotations is how context windows get extended.

Say it in one breathAttention (in parallel heads) mixes words; the feed-forward layer thinks about each one; shortcuts and norm keep the deep stack trainable; RoPE tells attention what order things came in.

for the curious — the actual attention formula

Every token produces three vectors: a query Q, key K, and value V. The whole operation is one line:

$$\text{Attention}(Q,K,V)=\text{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V$$

QKᵀ compares every query to every key (the "who matches whom" grid). softmax turns those scores into percentages that add to 100%. Multiplying by V blends everyone's values by those percentages. That's the dating-profile analogy, written in math.

Pretraining: where the knowledge comes from

Take that stack of transformer blocks, freshly built and totally clueless, and play one game with it a trillion times.

A fresh model is random — its map of meaning is noise, its attention points nowhere. Pretraining is the marathon that fixes that, and the game is exactly the one from Chapter 1: hide the next word, ask the model to guess, and gently correct it when it's wrong. Do this across trillions of words of text.

To win this game consistently, the model is forced to organise its map of meaning, aim its attention sensibly, and store patterns that look a lot like facts and reasoning. Nobody programs in "Paris is the capital of France" — it falls out of needing to finish millions of sentences correctly. This is the single most expensive step in all of AI, and it produces the base model: brilliant, knowledgeable, and completely unrefined.

Think of it like

Raising a prodigy who has read every library on Earth but has never been told how to behave. Ask a question and it might answer, or continue your sentence, or write three more questions. It knows almost everything and how to do almost nothing. That's where Act II comes in.

How "gently correct it" actually works

We've been leaning on the most hand-wavy phrase in this guide: the model guesses, and we gently correct it. Time to open that box, because the same machinery runs underneath every training method to come. Each correction has three parts: a loss that scores how wrong the guess was, a gradient that says which way to nudge each weight, and an optimizer that takes the step. That trio, repeated trillions of times, is training.

The loss: one number for "how wrong"

Before each hidden word is revealed, the model has spread its bets across the whole vocabulary — 90% on "the", 4% on "a", a sliver on everything else. The loss collapses the whole situation into a single number: how badly did you just do? Lower is better. For next-token prediction the loss is cross-entropy, and the idea fits in one word: surprise. It's the negative log of the probability the model gave the token that actually came next. Bet 90% on the right word and the surprise is near zero. Gave it 0.1%? Enormous. Training is nothing more than the long, patient lowering of average surprise.

for the curious — cross-entropy, written down

For one position, where $p(\text{correct})$ is the probability the model assigned to the true next token:

$$\mathcal{L}=-\log p\big(\text{correct next token}\big)$$

Right and confident → $-\log(0.9)\approx 0.1$. Wrong and confident → $-\log(0.001)\approx 6.9$. Average this over every position in the batch and you have the number training pushes down. Perplexity, often quoted on benchmarks, is just $e^{\mathcal{L}}$ — "how many equally-likely options the model is torn between."

The gradient: which way is downhill

Think of it like

Standing on a hillside in thick fog. You can't see the valley — but you can feel the slope under your feet, and that's enough: step downhill, feel again, step again. The altitude is the loss; your position is the current setting of all the weights; the slope under your feet is the gradient.

For each individual weight, ask one tiny question: if I nudge this weight up a hair, does the loss go up or down — and how steeply? The answer is a slope. The gradient is simply the complete list of those slopes, one per weight, all billions of them at once. By convention it points in the direction of steepest increase of the loss — so the model steps the opposite way. That's the entire idea of "learning": measure the slope, walk downhill.

Asking that nudge-question billions of times separately would be hopeless. Backpropagation is the shortcut that answers all of them in one pass: because the network is a chain of simple steps, calculus's chain rule lets the error flow backwards through the layers, handing every weight its slope along the way. One forward pass to compute the loss, one backward pass to get every gradient. It's not a separate learning method — it's just bookkeeping, done fast.

The optimizer: how the step gets taken

The simplest move is gradient descent: new weight = old weight − learning rate × slope. The learning rate is the stride length — too bold and you overshoot the valley and diverge; too timid and training takes geological time. In practice nobody measures the slope against the entire internet at once: you grade a small random mini-batch of examples and step on that — the "stochastic" in SGD. Each step is cheap but noisy, like feeling the slope during an earthquake.

Momentum tames the noise: keep a running average of recent gradients — a velocity — so you roll through the jitter like a heavy ball instead of twitching at every bump. The modern default, Adam, goes one further: it tracks both the gradient's recent average (which way, typically?) and its recent variance (how jumpy?) for every weight, and gives each one its own step size — bold where the signal is steady, cautious where it's erratic. Its weight-decay-corrected variant, AdamW, is what trains essentially every modern LLM. You've already met it in the wild: the paged_adamw_8bit in Chapter 9's recipe is exactly this, stored small.

for the curious — the update rule, and Adam's two memories

Plain gradient descent, for each weight $w$ with learning rate $\eta$:

$$w \;\leftarrow\; w-\eta\,\frac{\partial \mathcal{L}}{\partial w}$$

Adam keeps two running averages of the gradient $g$ — its mean $m$ (1st moment) and its uncentred variance $v$ (2nd moment):

$$m \leftarrow \beta_1 m+(1-\beta_1)\,g$$

$$v \leftarrow \beta_2 v+(1-\beta_2)\,g^{2}$$

$$w \;\leftarrow\; w-\eta\,\frac{m}{\sqrt{v}+\epsilon}$$

Dividing by $\sqrt{v}$ is the adaptive part: a weight whose gradients swing wildly gets a smaller effective step, a steady one gets a bigger one — a personal learning rate per parameter. AdamW's fix: apply weight decay directly to $w$ instead of mixing it into the gradient, which makes the decay actually work as intended.

+Go deeper
Loss, gradients & optimizers — backprop, warmup, schedules and the family tree

Backprop, slightly more precisely

A network is a deep composition of simple functions, and the chain rule says the slope of a composition is the product of the slopes along the way. Backprop evaluates that product from the loss backwards, layer by layer, reusing every intermediate result — so the cost of getting all the gradients is only about the same as one extra forward pass. Frameworks do this automatically ("autograd"): you define the forward computation, they record it and derive the backward one. The residual shortcuts from Chapter 4 exist largely to give this backward flow a clean, un-shrunk path down dozens of layers.

The optimizer family tree

Optimizer	The idea	What it fixed
Gradient descent	step against the full gradient	— (the textbook starting point)
SGD	same step, but on random mini-batches	made the step affordable at scale
+ Momentum	keep a velocity; average out the noise	stops twitching, rolls through ravines
Adam	momentum + per-weight step size from running mean & variance	different parameters need wildly different strides
AdamW	Adam with decoupled weight decay	made the regularisation honest — today's LLM default

The learning rate — the #1 hyperparameter

Nothing else breaks a run faster. Too high: the loss spikes or NaNs. Too low: it crawls and may settle somewhere mediocre. Typical scales you've already seen in Chapter 9's table: ~1e-4 for LoRA SFT, ~10× lower for full fine-tuning, ~5e-6 for DPO, ~1e-6 for GRPO — the gentler the surgery, the smaller the stride.

Warmup & schedules

The learning rate isn't constant. Warmup ramps it from near-zero over the first few percent of steps — early on, Adam's mean-and-variance estimates are built from almost no data, so big strides on garbage statistics destabilise training. After warmup it decays, usually along a cosine curve, so training ends with fine, careful steps. That's exactly the "cosine / 3–10% warmup" line in Chapter 9's hyperparameter table.

The hidden memory bill

Adam's two memories cost real VRAM: m and v are two extra full-size numbers per weight, which is why optimizer state — not weights — dominates full fine-tuning's memory (Chapter 12), and why 8-bit optimizers like paged_adamw_8bit exist: same algorithm, moments stored small.

Say it in one breathLoss = surprise at the right answer (cross-entropy); gradient = every weight's personal "which way is downhill," computed all at once by backprop; optimizer = how the step is taken — and AdamW, momentum plus a per-weight stride, is the one that trains everything.

Bigger isn't automatically betterA famous finding (nicknamed Chinchilla) showed many giant models were actually undertrained — they needed more data, not more size. Roughly 20 words of training text per parameter is the healthy balance. Data quality and quantity matter as much as raw scale.

+Go deeper
The objective, decoder vs encoder, scaling laws & emergence

The objective, precisely

At each position, predict the next token; score the guess by cross-entropy (the surprise score you met above) and nudge the weights to make the right token more likely. Trillions of times. No facts are typed in — they're a side effect of needing to finish sentences correctly.

Decoder vs encoder

As Chapter 4's three shapes laid out: decoder-only models (GPT, Claude, Llama, Qwen) predict left-to-right — they generate. Encoder models (BERT) fill in masked blanks — good for understanding tasks. Today's generative LLMs are decoder-only, so "pretraining" in practice means this next-token game.

Scaling laws

Performance improves predictably with more parameters, data and compute — but the Chinchilla balance (~20 tokens per parameter) means data, not just size, is the lever. Many famous models left performance on the table by being too big for their data.

Emergence

Some abilities appear fairly suddenly as scale grows rather than improving smoothly — part of why scaling has been such a powerful (and surprising) bet.

Say it in one breathOne trivial game — guess the next token — played at planetary scale forces a model to learn grammar, facts and reasoning; scaling laws tell you how to balance size against data.

Generation: how it actually writes

The model never outputs a sentence. It outputs a single next-word lottery — and we get to set how reckless the draw is.

At every step the model produces a score for every token in its vocabulary — a giant ranked list of "how likely is each word to come next." A final squeeze turns those scores into clean percentages. Then we draw one. Add it. Repeat. That repetition is the whole of "generating text."

The interesting knob is temperature — how reckless the draw is. Turn it low and the model almost always takes its top pick: safe, focused, a little repetitive. Turn it up and longshots get a real chance: surprising, creative, sometimes unhinged. There are tidier variants (only consider the top few options, or the smallest set that covers 90% of the probability) but temperature is the one to hold in your head.

Low temperature trusts the favourite; high temperature spreads the odds and invites surprises.

the next-word lottery · drag the dial

temperature = 1.0 · left = safe favourite, right = wild card

Why this matters soonBecause generation is a dice roll, the same prompt gives different answers — and that's a feature. In Act III, a method called GRPO will deliberately roll the dice several times and learn from which answers came out best.

+Go deeper
All the decoding knobs, and the loss & learning that power it

The decoding knobs

Greedy: always the top token — deterministic, repetitive.
Temperature: the recklessness dial; low = confident, high = creative, near-zero = greedy.
Top-k: only consider the k most-likely tokens.
Top-p (nucleus): consider the smallest set covering, say, 90% of the probability — adapts to how sure the model is.
Repetition penalties: down-weight tokens already used, to stop loops.
Beam search: keep several candidate sequences alive; great for translation, rare for chat.

Cross-entropy — the training loss, recapped

The same surprise score from Chapter 5: for each token the loss is the negative log of the probability the model gave the correct next token. Confident and right → tiny loss; confident and wrong → huge loss. Averaging this is what pretraining and SFT minimise. Perplexity (e to that loss) re-expresses it as "how many equally-likely options the model is torn between" — lower is better.

Backpropagation — how it learns, recapped

After a forward pass produces the loss, backprop walks backwards through the layers assigning each weight a "gradient" — how much it contributed to the error and which way to move it (Chapter 5, in full). The optimizer nudges each weight a little down its gradient. Repeat millions of times. It's automatic blame-assignment.

Say it in one breathThe model scores every next word; decoding picks one; cross-entropy grades how much probability it gave the right one; backprop walks the error back and nudges every weight.

The map of fine-tuning

Three stages turn a rambling genius into a helpful assistant. Knowing which stage does what is half of understanding modern AI.

There's a crucial idea to internalise first: fine-tuning mostly changes behaviour, not knowledge. If a model is missing facts, you usually hand it the facts (Act IV's "look it up" trick), not retrain it. Fine-tuning is for teaching it how to act — to answer instead of ramble, to adopt a tone, to follow a format, to reason carefully.

Pretraining (Act I) builds the brain. The next three stages — our Act II and III — shape how it behaves.

📝

SFT

Show it thousands of great example answers until it copies the style. Teaches format and instruction-following.

⚖️

Preference tuning

Show it "answer A is better than B" so it develops taste — helpfulness, tone, safety. (RLHF / DPO.)

🎯

RL with rewards

For tasks with a checkable answer, reward correctness directly. This is how reasoning models are made (GRPO).

SFT: learning by example

The simplest and most important fine-tuning step. Show, don't tell.

Supervised fine-tuning (SFT) is exactly what it sounds like: you collect a pile of example pairs — a question and an ideal answer — and let the model practise on them using the very same "guess the next word" game from pretraining. The only difference is the diet. Feed it thousands of polite, well-formatted, helpful answers and it learns to produce polite, well-formatted, helpful answers.

Think of it like

An apprenticeship. The genius already knows everything; now it watches a few thousand worked examples of "this is how we answer here," and starts matching the house style.

The craft is almost entirely in the data. A few thousand excellent examples beat a hundred thousand mediocre ones. Push too hard — too many repetitions — and the model overfits: it parrots your examples and loses some of its general spark. So SFT is a light touch: a little high-quality practice, then stop.

The #1 real-world bugEach model expects answers wrapped in a specific invisible format (who's the "user," who's the "assistant"). Get that wrapper wrong during SFT and the model learns garbage — even with perfect data. Boring, and the cause of endless wasted training runs.

+Go deeper
Chat templates, loss masking, packing, and avoiding catastrophic forgetting

Chat templates

Every instruct model expects turns wrapped in special tokens marking system/user/assistant. You must format with the model's own template — a mismatch silently wrecks results.

Completion-only loss masking

Compute the loss only on the assistant's reply, masking the prompt — otherwise the model wastes capacity learning to predict the user's questions instead of answering them.

Packing

Stuff several short examples into one max-length sequence so you're not wasting compute on padding — a big throughput win.

Catastrophic forgetting

Push SFT too hard (high learning rate, too many epochs) and the model narrows — it parrots your data and loses general ability. Mitigations: low LR, 1–3 epochs, use LoRA (the frozen base preserves knowledge), and mix in some general data. Always re-check a general benchmark afterwards.

Data > everything

2k–10k excellent examples beat 100k mediocre ones. Curation — filtering, deduping, matching real usage — is the actual lever.

Say it in one breathShow curated examples in the model's exact chat format, train only on the answers, keep it light — quality data and a gentle touch beat brute force every time.

LoRA: the sticky-note trick

A 70-billion-number brain is far too big to retrain casually. So we don't. We clip a tiny adjustable note onto it instead.

Fully retraining a large model means nudging every one of its billions of numbers — enormous memory, enormous cost, and a fresh full-size copy for every task. LoRA (Low-Rank Adaptation) noticed something lovely: the change you actually need is small and simple. So freeze the whole giant model, and learn only a tiny pair of "correction" matrices that get added on top. You end up training under 1% of the numbers.

The giant brain stays locked. You train a small, cheap "note" that adds your changes — and can be peeled off or swapped.

The payoff is huge: you can fine-tune on a single modest GPU, keep dozens of tiny notes for different tasks on one shared brain, and peel a note off if it misbehaves. There are two dials worth a name: rank (how much the note can hold — start around 16) and which parts of the model get notes (attention plus the thinking layers, for bigger changes).

And its partner, QLoRAYou can shrink the frozen brain too — storing its numbers at lower resolution, like a smaller JPEG (we'll see this in Chapter 13). That's QLoRA, and it's what lets a 70-billion-parameter model be fine-tuned on a single gaming GPU. LoRA + compression is the default starting move in the field today.

+Go deeper
The actual config — every knob, recipes, VRAM math & how to debug it

This is the part that makes you sound like you've run training, not just read about it. Field names drift between library versions — treat them as "check the current docs" — but the knobs and the reasoning are stable.

The LoRA config, field by field

Field	Controls	Typical	Why
`r`	Rank = the note's capacity	8–64 (start 16)	Higher = more expressive, more memory, more overfit risk. The main knob.
`lora_alpha`	How loud the note is (gain = α/r)	r or 2r	Raising r without α quietly weakens the update — a classic trap.
`target_modules`	Which layers get a note	attn+MLP or `all-linear`	More coverage = bigger behavioural shifts.
`lora_dropout`	Light regularisation	0.0–0.1	0.05 is a safe default.
`use_dora`	DoRA variant	off / on	Better quality at low rank, a bit slower.
`use_rslora`	Stable scaling (α/√r)	off / on	Lets you push rank ≥ 64 without blowing up.
`init_lora_weights`	How the note starts	default	`pissa`/`loftq` converge faster, esp. with quantised bases.
`modules_to_save`	Extra fully-trained parts	none	Set `embed_tokens`/`lm_head` when you add new tokens.

Choosing rank & alpha

Rank	Good for
4–8	Tone, style, formatting, light instruction-following.
16–32	The workhorse — domain adaptation, tool-use, most jobs. Start here.
64–128	Big shifts / lots of data — pair with rsLoRA.

The note is applied as (α/r) × note, so α and r together set the effective strength. Conventions: α = r (gain 1) or α = 2r (gain 2).

Which target_modules

Cheapest: just the query & value projections (q_proj, v_proj).
All attention: add k_proj, o_proj.
Recommended: attention + the MLP layers (gate/up/down_proj) — the MLP holds most of the "knowledge," so adapting it matters for bigger changes.
"all-linear" targets everything and sidesteps model-specific names.

The PEFT family beyond LoRA

Method	Idea	When
LoRA / QLoRA	Low-rank notes on frozen weights	The default for ~everything
DoRA	Splits magnitude + direction	More quality at low rank
VeRA	Shared random notes + tiny scalings	Extreme efficiency, many tasks
(IA)³	Learned scaling vectors	Even fewer params than LoRA
Prompt/Prefix tuning	Train soft "prompt" vectors, model frozen	Cheapest; weak for big shifts

The training hyperparameters around it

Knob	Typical (LoRA SFT)	Note
learning_rate	1e-4 – 3e-4	Full FT is 10× lower; DPO ~5e-6; GRPO ~1e-6.
scheduler / warmup	cosine / 3–10%	Warmup avoids early instability.
epochs	1–3	More → overfitting.
batch × grad-accum	fit × 4–32	Effective batch = batch × accum × #GPUs.
optimizer	paged_adamw_8bit	8-bit/paged saves memory.
bf16 · grad-checkpointing	on · on	bf16 over fp16; checkpointing trades ~30% speed for memory.
max_grad_norm	1.0	Clips spikes for stability.

Effective batch sizebatch_size × gradient_accumulation_steps × number_of_GPUs. Can only fit 2 but want 32 on one GPU? Set accumulation to 16. Bigger effective batches = smoother training.

A starting recipe you can copy

# QLoRA + SFT — a sane default
lora = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules="all-linear", task_type="CAUSAL_LM")

args = SFTConfig(learning_rate=2e-4, lr_scheduler_type="cosine",
    warmup_ratio=0.03, num_train_epochs=2,
    per_device_train_batch_size=4, gradient_accumulation_steps=8,  # eff batch 32
    optim="paged_adamw_8bit", bf16=True, gradient_checkpointing=True,
    max_grad_norm=1.0, max_length=2048, packing=True)

# for DPO instead: lower LR, fewer epochs
dpo = DPOConfig(beta=0.1, learning_rate=5e-6, num_train_epochs=1)
# for GRPO: a group of samples, tiny LR, reward fns do the work
grpo = GRPOConfig(num_generations=8, beta=0.04, learning_rate=1e-6)

Will it fit? (rough VRAM)

Full fine-tune: ~16–20 GB per billion params → a 7B needs ~120 GB. Multi-GPU territory.
LoRA (16-bit base): ~2 GB/B → 7B ≈ 16–20 GB.
QLoRA (4-bit base): ~0.5–0.75 GB/B → 7B base ≈ 5 GB; even 70B fits on a 48 GB card.

When you run out of memory, in order: raise grad-accumulation (lower batch) → gradient checkpointing → QLoRA → 8-bit optimizer → shorter sequences → shard across GPUs.

Debugging — symptom → cause → fix

Symptom	Likely cause → fix
Won't learn (loss flat)	LR too low · rank too low · too few target modules → raise LR/rank, target more.
Loss NaN / spikes	LR too high · fp16 overflow → lower LR, switch to bf16, add warmup.
Gibberish / lost skills	Too many epochs · wrong chat template → fewer epochs, verify the template.
Out of memory	Batch/seq too big → accumulation, checkpointing, QLoRA.
Train loss ↓ but eval ↑	Overfitting → fewer epochs, more data, dropout, lower rank.

Merging & serving

Merge the note into the base for zero-overhead inference (but you lose swappability), or keep it separate and hot-swap many notes on one shared model — ideal for per-customer adapters, which is a strong pattern for a multi-tenant trace product.

Preference tuning: learning good taste

You can't write the one perfect joke. But you can say which of two jokes is funnier — and that's enough to teach taste.

SFT teaches "a good answer." But for open-ended things — helpfulness, tone, harmlessness, what counts as a better reply — there's no single right answer to copy. There's only comparison. So we collect human judgments of the form "between these two answers, this one is better," and train the model to lean toward the winners. This is preference tuning, and it's what makes a model feel genuinely helpful rather than merely correct.

There are two ways to do it. The original, RLHF (Reinforcement Learning from Human Feedback), is powerful but a heavy machine: you train a separate "taste judge" model, then use reinforcement learning to push the model toward what the judge likes — four moving parts, finicky to balance. Then someone found a shortcut. DPO (Direct Preference Optimization) proved you can skip the judge and the RL entirely and get the same result with one clean training step on the comparison pairs — just two moving parts.

Same destination — a model with better taste — but DPO throws away two of the four moving parts. It's why most teams reach for it first.

There's a whole family of DPO-style methods for special cases — for example, when all you have is a thumbs-up / thumbs-down rather than neat pairs. But the core idea never changes: learn from comparisons, lean toward the better answer.

Reward and penalty: the carrot and the leash

The RL picture keeps two words doing heavy lifting, so let's pin them down. A reward is a single number scoring a whole answer — not each token — handed out after the model finishes writing. Usually it comes from that "taste judge," a reward model: a network trained on the human comparisons until it can read a prompt-and-answer and emit one scalar — "this is an 8.3-of-10 kind of reply." Reinforcement learning then adjusts the model so the answers it tends to produce score higher. Higher reward, more of that; lower reward, less.

Left alone, that chase ends somewhere ugly. A model hunting reward and nothing else will find the judge's blind spots — flattery, padding, confident waffle, eventually outright degenerate text that happens to score well but reads like noise. So a penalty rides along: the KL leash, a term that measures how far the model has drifted from the frozen copy of its original self and charges it for the distance. The result is a tight bargain — get better at what the reward wants, without forgetting how to write.

Think of it like

A chef chasing a single star rating per dish — one number for the whole plate, no notes per ingredient — while the restaurant insists they stay recognisably themselves. Chase only the stars and they'll end up deep-frying everything the critics secretly favour; the "stay yourself" rule is the leash that keeps the cooking honest while the ratings climb.

And the reward doesn't have to come from a learned judge at all. When the task has a checkable answer — does the maths come out right, do the unit tests pass — the score can be programmatic: right or wrong, decided by a rule, with nothing to flatter. Those verifiable rewards are exactly the fuel the next chapter's GRPO runs on to teach models to reason.

for the curious — the whole game in one line

The model $\pi$ is trained to maximise reward minus $\beta$ times its drift from the frozen reference $\pi_{\text{ref}}$:

$$\max_{\pi}\;\underbrace{\mathbb{E}\big[\,r(y)\,\big]}_{\text{the carrot}}\;-\;\underbrace{\beta\,\mathrm{KL}\!\big(\pi\,\|\,\pi_{\text{ref}}\big)}_{\text{the leash}}$$

$r(y)$ is the reward for a whole answer $y$, and $\beta$ sets how tight the leash is. It's the same $\beta$ you'll meet in DPO's loss below — DPO is this exact objective solved in closed form — and the same beta knob in Chapter 9's GRPO config.

+Go deeper
Reward models, reward hacking & the KL leash — how the carrot gets gamed

How a reward model is trained

Take the preference pairs, and train a copy of the language model (with a small score-head bolted on) so the winner outscores the loser — under the Bradley–Terry view, a bigger score gap means a more confident preference. The result reads any prompt-and-answer and returns one scalar. Note what that scalar is: a learned guess at what humans would prefer — not truth, not correctness — which is exactly why it can be gamed.

One number for a thousand tokens

The reward lands once, at the end — but the model made a thousand token-level choices on the way. Which of them deserve the credit or the blame? Spreading that single end-of-answer score back across every decision is RL's credit assignment problem; it's what PPO's value "critic" estimates, and what GRPO sidesteps by comparing whole answers against their siblings instead.

Reward hacking — the failure mode to respect

Optimise hard against any imperfect judge and the model becomes an exploit-finder: longer answers that feel thorough, agreeable answers that feel aligned ("sycophancy"), confident tone over correct content — and, against programmatic checkers, literal cheats like hard-coding a unit test's expected output. The model isn't malicious; it's doing precisely what it was told. A gameable reward is a vulnerability, which is why the verifier-integrity mindset from the security world maps onto this so cleanly.

What KL actually measures — and the β trade-off

KL divergence compares the next-token bets the tuned model makes against the bets the frozen reference would have made, summed over the answer — near zero when they'd write almost the same things, large when the new model has wandered. beta prices that distance: too high and the leash is so short nothing improves; too low and the model sprints to the reward and degenerates. Typical values sit around 0.1 for DPO and lower for GRPO — and some recent recipes (DAPO, next chapter) drop the term entirely and control drift other ways.

Say it in one breathReward = one scalar for a whole answer, usually from a judge trained on human comparisons; penalty = the KL leash that charges the model for drifting from its frozen self — optimise the first minus β times the second, and watch for the hacks.

+Go deeper
The RLHF machine, why DPO replaced it, and the whole DPO family

The classic RLHF pipeline, in full

Three stages: (1) SFT a base model; (2) collect preference pairs and train a separate reward model to score answers — under the Bradley–Terry idea, a bigger score gap means a more confident preference; (3) optimise the model with PPO reinforcement learning to chase high reward, minus a penalty (a KL "leash") for drifting too far from the SFT model.

PPO takes deliberately small, clipped steps so it can't lurch and destabilise. The cost is the four-models-in-memory rig — the policy, a frozen reference, the reward model, and a value "critic" — plus genuine training fragility. Powerful (it built the first great assistants) but heavy.

Why DPO won the default slot

A neat derivation shows you can rewrite "reward" purely in terms of the model's own probabilities versus the frozen reference — collapsing the whole reward-model-plus-RL rig into a single supervised-style loss on the comparison pairs. Two models, no RL loop, far more stable. The trade-off: it's offline (fixed dataset, no fresh sampling) and at the very frontier PPO can still edge ahead.

The family — pick by what data you have

Method	Needs ref?	Data	One-liner
DPO	yes	pairs	The default direct method.
IPO	yes	pairs	Anti-overfit tweak to DPO.
ORPO	no	pairs	Folds SFT + preference into one stage.
KTO	yes	thumbs 👍/👎	Learns from unpaired good/bad labels — great for product telemetry.
SimPO	no	pairs	Reference-free, simpler, competitive.

Each one, explained

DPO — Direct Preference Optimization. The baseline everything else tweaks. It turns "the winner should be more likely than the loser, relative to a frozen reference copy of the model" into a single classification loss. Stable, simple, two models. Its weakness: because it only ever sees a fixed dataset of pairs, it can over-optimise — pushing the winner up so hard it also drags unrelated text down, or exploiting that longer answers tend to score higher.

IPO — Identity Preference Optimization. A fix for exactly that over-optimisation. DPO's loss has no natural "stop" — if the data says winner > loser, it'll keep widening that gap forever, eventually overfitting. IPO swaps the loss shape so there's a built-in target margin it settles at instead of running away, which makes it more robust when your preference data is small or noisy.

ORPO — Odds-Ratio Preference Optimization. Removes a whole stage. Normally you do SFT first, then preference-tune on top with a reference model. ORPO fuses them: a single loss that teaches the model the good answers and to prefer them over the bad ones at the same time, with no separate reference copy. Fewer moving parts and one training pass — appealing when you want a lean pipeline.

KTO — Kahneman–Tversky Optimization. The practical one for real products. DPO/IPO/ORPO/SimPO all need pairs (this answer beat that answer), which are expensive to collect. KTO learns from unpaired single judgments — a plain 👍 or 👎 on one answer — using a model of how humans over-weight losses versus gains. Since production telemetry is almost always individual thumbs, not curated comparisons, KTO turns the feedback you actually have straight into training signal.

SimPO — Simple Preference Optimization. Strips DPO down. It drops the frozen reference model entirely (less memory, simpler setup) and instead scores answers by their average per-token likelihood with a target margin — which also naturally counters DPO's bias toward long answers. Reference-free and competitive in quality, at the cost of being a little more sensitive to its settings.

The knobs that matter

beta (~0.1) sets how tightly it stays near the reference; loss_type swaps the family member (sigmoid = vanilla DPO, ipo, kto_pair…); label_smoothing helps when your preference labels are noisy.

Interview-readyDPO is simpler, stable, two models, offline; PPO is heavier, online, higher ceiling. If all they have is thumbs-up/down telemetry — very likely for a product — KTO is the slick answer because it needs no pairs.

for the curious — what DPO's one step optimizes

For each pair (winner $y_w$, loser $y_l$), DPO simply increases how much more the model prefers the winner over the loser, compared to where it started:

$$\mathcal{L}_{\text{DPO}}=-\log\sigma\!\Big(\beta\big[\,\underbrace{\log\tfrac{\pi(y_w)}{\pi_{\text{ref}}(y_w)}}_{\text{raised the winner}}-\underbrace{\log\tfrac{\pi(y_l)}{\pi_{\text{ref}}(y_l)}}_{\text{lowered the loser}}\big]\Big)$$

$\beta$ controls how far it's allowed to drift from the original model. No judge, no reinforcement learning — just "make the gap between good and bad bigger."

GRPO: how models learned to reason

For anything with a checkable answer — maths, code — you don't need a human judge at all. You need a curve.

Preference tuning needs someone to say which answer is better. But for a maths problem, the universe already knows: the answer is right or wrong. GRPO (Group Relative Policy Optimization) exploits this. For one problem, it has the model generate a whole group of attempts (remember, generation is a dice roll, so they differ). It scores each against the checker, then grades them on a curve: beat the group's average and you get reinforced; fall below it and you get discouraged.

Think of it like

A teacher who hands the same problem to eight students, marks them, and tells each one "do more of what you just did" or "do less" depending on whether they beat the class average. No answer key shown, no praise for absolute scores — just relentless, relative improvement. Repeat for millions of problems and the model teaches itself to reason.

No human judge, no answer key revealed — just "did you beat your siblings?" This simple loop is what produced the recent wave of reasoning models.

grade on a curve · flip the 8 attempts

click each attempt to toggle correct ✓ / wrong ✗ — watch the grades recompute

mean μ = — · spread = —

The catch that's also your edgeThe whole thing rests on the checker being honest. If the reward can be gamed — a loophole that scores high without truly solving the problem — the model will find it. That's "reward hacking," and spotting it before the model does is exactly the kind of adversarial thinking a security background brings.

+Go deeper
GRPO mechanics, verifiable rewards, the cheaper alternative, and the variants

Why it drops the critic

PPO needs a separate "value" network to judge how good each move was — double the memory, more instability. GRPO replaces it with the group itself: an answer's "advantage" is just how far above or below the group's average it scored, divided by the group's spread. The siblings are the baseline.

RLVR — where the reward comes from

Reinforcement Learning with Verifiable Rewards swaps a learned, hackable judge for a deterministic checker: did the maths match, did the unit tests pass, is the output valid? Cheap, hard to fool, and exactly what GRPO consumes. DeepSeek-R1 famously used pure GRPO with rule-based rewards — and long chain-of-thought reasoning emerged on its own.

The cheaper cousin: rejection-sampling

Before full RL, often you just generate many attempts, keep the ones the checker marks correct, and SFT on those — repeat. (Names: STaR, ReST, rejection-sampling fine-tuning.) It captures much of the gain with a plain supervised loop, and it's the first thing to try when you have a checker and lots of traces.

The variants people name-drop

Vanilla GRPO works, but it has measurable biases — most famously it tends to inflate answer length (longer answers sneak higher scores) and its grading can be skewed by how it normalises. A fast-moving set of variants patches these:

DAPO. An open, reproducible recipe (Yu et al., 2025) that hardens GRPO with four tweaks: Clip-Higher (asymmetric clipping that gives good answers more room to rise than bad ones to be crushed, avoiding the "entropy collapse" where the model stops exploring), dynamic sampling (discard prompts where every answer scored the same — all-right or all-wrong give zero gradient — and resample until the batch has a useful spread), a token-level loss (so long answers don't dominate the update), and a soft penalty for over-long, truncated answers. It also drops the KL term entirely. The payoff is stable training that doesn't stall — with the full recipe published.

Dr.GRPO ("GRPO Done Right", Sea AI Lab). It shows that two of GRPO's normalisation terms quietly bias training: dividing by answer length inflates response length (especially for wrong answers), and dividing by the group's standard deviation over-weights very easy or very hard questions. Dr.GRPO removes both for cleaner, unbiased credit assignment — and notably curbs the runaway answer-length problem.

GSPO — Group Sequence Policy Optimization (Qwen team, used in Qwen3). It pinpoints GRPO's instability in its token-level importance weighting, which injects high variance that compounds on long answers and can collapse the training of large mixture-of-experts models. GSPO moves the weighting and clipping to the whole-sequence level, aligning the unit of optimisation with the sequence-level reward — markedly more stable for long generations and MoE.

The meta-point for an interview: GRPO is a strong baseline, not the final word — knowing that people actively patch its length bias and stability shows you follow the field, not just the headline.

Process vs outcome rewards

An outcome reward grades only the final answer; a process reward grades each reasoning step — denser signal, catches right-answer-wrong-reasoning, but needs step-level labels.

Say it in one breathGenerate several attempts, grade them on a curve against each other, do more of what beat the average — no critic, no reward model, just a checker.

Making it fit and fly: quantization & the memory it carries

Two tricks decide whether a model needs a server farm or runs on your desk: how small you store it, and how it remembers what it just read.

Quantization — the smaller-JPEG trick

A model's knowledge lives in billions of numbers, and by default each is stored at high precision — heavy. Quantization stores them at lower resolution instead, like saving a photo as a smaller JPEG. A bit of detail is lost, but usually you can't tell — and a model that needed 140 GB suddenly fits in 35. This is what makes big models runnable on ordinary hardware, and it's the "compression" half of QLoRA from Chapter 9.

The KV cache — the model's sticky-notes

Here's a problem hiding in "generate one word at a time": to write word 500, attention needs to look back at words 1–499. Redoing that from scratch every single step would be agonisingly slow. So the model keeps sticky-notes of what it already computed for each past word — the KV cache — and just glances at them. Brilliant for speed.

The KV cache is a space-for-time bargain: it buys huge speed, but for long contexts it — not the model's weights — becomes the thing that fills your GPU.

Almost every serving trick you'll hear about — clever ways to manage that pile of notes, or to let many users share one model — is really about taming the KV cache. It's the quiet bottleneck of running models at scale.

+Go deeper
Precision & quantization types, bitsandbytes, the KV-cache cluster & the serving stack

Precision — how many bits per number

A weight can be stored at different resolutions. FP32 (4 bytes) is the safe default; FP16 (2 bytes) is fast but can overflow; BF16 (2 bytes) keeps FP32's range with less precision — the modern training default because it's stable without tricks; FP8 (1 byte, on H100-class chips) pushes further. Mixed precision computes in bf16 for speed while keeping a master copy in fp32.

What quantization actually is

Pick a scale that maps a range of real values onto a small grid of integer levels, then store the level. Doing this per small group of weights (say every 64) keeps it accurate. Two distinctions to know:

PTQ vs QAT: quantize an already-trained model (fast) vs simulate quantization during training so the model adapts (more accurate, costlier).
The outlier problem: LLMs have a few huge activation values that naive low-bit quantization wrecks — the methods below mostly differ in how they protect those.

The format zoo

Name	What it is
LLM.int8()	8-bit inference that keeps rare outliers in 16-bit. (Lives in bitsandbytes.)
NF4	4-bit format tuned to the bell-curve of weights — QLoRA's format.
GPTQ	One-shot 3–4-bit weight-only quantization using curvature info. Popular for inference.
AWQ	Protects the ~1% most important weight channels. Fast, accurate 4-bit.
GGUF	Not an algorithm — the file format (llama.cpp) for running quantized models on CPUs/Macs.

bitsandbytes — the library people mentionIt's the plumbing, not a method: it gives you 4-bit/8-bit weights and 8-bit optimizers in one line, and it's the engine doing the work under QLoRA. NF4 and LLM.int8 are the techniques inside it.

Fitting training in memory

Gradient checkpointing: don't store all the intermediate work of the forward pass — recompute it during backprop. ~30% more compute for a big memory saving.
Optimizer states: the hidden memory hog. Adam keeps two extra numbers per weight, so full fine-tuning's memory is mostly optimizer state, not weights. 8-bit optimizers shrink it ~4×.
Flash Attention: computes the exact same attention far faster by working in fast on-chip tiles instead of writing the giant token×token grid to slow memory. Unlocks long context.
FSDP / ZeRO (DeepSpeed): when a model won't fit on one GPU, shard its weights, gradients and optimizer state across many — gathering each piece only when needed.

The KV-cache cluster (serving)

The cache grows with context length × number of concurrent requests, so it dominates memory at scale. Everything here exists to tame it:

PagedAttention (vLLM): manage the cache in small fixed blocks like an operating system's memory paging — almost no waste, and shared prompt-prefixes — for far higher throughput.
GQA / MQA: let query heads share key/value heads, directly shrinking the cache. GQA is the modern default.
KV-cache quantization: store the cache in 8-bit to halve its memory.
Continuous batching: swap finished requests out and new ones in on the fly, keeping the GPU saturated instead of idling.
Speculative decoding: a small fast model drafts several tokens; the big model verifies them in one pass — lossless speed-up.

Two more worth a sentence

Mixture of Experts (MoE): many "expert" sub-networks, only a couple active per token — huge capacity at small running cost (but all experts still sit in memory).
Distillation: train a small "student" to imitate a big "teacher" — the cheapest way to get big-model behaviour in a deployable size, and it ties straight to the flywheel: traces are distillation data.

Giving it a library and hands: RAG & agents

Two patterns turn a clever text-guesser into something that knows current facts and can actually do things.

RAG — let it look things up

A model's knowledge is frozen at training time and it can't see your private documents — so it guesses, and sometimes confidently makes things up. RAG fixes this by giving it a library card. Before answering, the system finds the most relevant pages from your documents — using the map-of-meaning trick from Chapter 3, where "relevant" just means "nearby on the map" — and pastes them into the prompt. Now the model answers from real, current sources instead of hazy memory.

Agents — let it act

An agent is a model put in a loop and given tools — a search engine, a calculator, your code, an API. It thinks about the goal, picks a tool, sees the result, and decides the next move. Round and round until the job's done. The model stops being a text generator and becomes a doer.

Think → act → observe, in a loop. And a full recording of one trip around this loop has a name you've been waiting for: a trace.

+Go deeper
Prompting tricks, the RAG pipeline in full, and how agents really call tools

Prompting & in-context learning

Steering a frozen model through its input alone. Show a few examples and it picks up the pattern with no training (few-shot). Ask it to "think step by step" (chain-of-thought) and hard reasoning improves — the seed that reasoning models later bake in via RL. Always try prompting before fine-tuning: it's instant, free, reversible.

The RAG pipeline, step by step

Offline: split documents into chunks, turn each into an embedding, store in a vector database.
At query time: embed the question, find nearest chunks (cosine similarity / nearest-neighbour search), paste the top matches into the prompt.
Refine: a reranker (cross-encoder) re-scores candidates; hybrid search blends keyword + vector; chunking strategy matters a lot.

RAG vs fine-tuning: knowledge that changes → RAG; durable behaviour/format → fine-tune. Mature answer: both.

How agents call tools

The model emits a structured request — which tool, what arguments (the same structured-output problem that makes valid-JSON guarantees matter); your code runs it; the result returns to the context. The dominant pattern is ReAct: Reason → Act → Observe, looped until done. Multi-agent systems add an orchestrator delegating to workers — turning traces into branching trees.

What a trace really is

A structured recording of that loop — prompt, each reasoning step, every tool call and result, the final output, and the outcome. (Often standardised via OpenTelemetry-style "spans.") That recording is both the debugging surface and the training data of the flywheel in the next chapter.

Say it in one breathRAG hands the model the right pages before it answers; an agent is the model in a think-act-observe loop calling tools; a trace is the recording of that loop — and recordings are training data.

The flywheel: where everything connects

Every idea in this guide clicks into one loop — and that loop happens to be a product.

A trace is the recording of an agent's run: every thought, every tool call, every result, and — crucially — whether it worked. Once you have piles of those, something powerful becomes possible.

Because a trace contains the outcome, it's training data in disguise. Keep the runs that succeeded and you have examples for SFT (Chapter 8). Pair a good run against a bad one and you have a comparison for preference tuning (Chapter 10). Have a checkable result and you have a reward for GRPO (Chapter 11). The model improves, runs again, and produces fresh traces. That's a flywheel — and it spins faster the more it's used.

Run → record → learn from the wins → improve → run again. A trace product sits exactly here, owning the most valuable thing in the loop: the data.

And this is also where the unglamorous, important problems live — the ones a security mind sees instantly. Traces are full of private data that must be scrubbed before they become training fuel. A malicious tool result hidden in a trace can quietly poison a model trained on it. A gameable reward is a vulnerability. Most people building these loops don't think this way. That's the gap worth standing in.

+Go deeper
Turning traces into training data — and the safety layer that's your edge

How a trace becomes training signal

From a trace you have…	You build…	Train with…
Runs that succeeded	Demonstrations	Rejection-sampling SFT (cheapest — do first)
A good run vs a bad run	Preference pairs	DPO
Only 👍/👎 per run	Binary labels	KTO
A checkable outcome	A verifiable reward	GRPO
Representative runs	Eval / regression sets	Continuous evaluation

The hard parts that show seniority: curation beats collection (raw traces are noisy — quality dominates), defining the outcome signal is the whole game, and the model drifts as usage shifts, so the loop must run continuously.

Safety & red-teaming — where your background pays off

Alignment & guardrails

Alignment trains good behaviour in (RLHF/DPO, or Constitutional AI where the model critiques itself against written principles instead of needing human labels). Guardrails enforce it at runtime — input/output filters, moderation classifiers, and structured-output validation as a guardrail. Layered, because any one check can be bypassed.

The attacks

Jailbreaks: prompts that talk a model out of its safety rules — role-play, obfuscation, many-shot, gradual escalation.
Prompt injection — the big one: untrusted content carries instructions the model then obeys. Indirect injection (a web page, document, or tool result hiding instructions) is the top agent vulnerability — and for a trace product, a poisoned tool result inside a captured trace becomes data poisoning the moment you fine-tune on it.
Extraction & inversion: coaxing out training data or secrets; membership inference and model inversion — your IOInversion territory.

Red-teaming as a fine-tuning loop

Attack your own system, turn what breaks into training data (refusals, preference pairs), fix it, re-test. It's the flywheel pointed at safety — and the same discipline catches reward hacking in GRPO before the model finds the exploit.

The synthesis to land in the roomFor a trace-and-fine-tuning product, four safety problems most ML engineers can't speak to: PII in traces (redact before training), data poisoning via injected tool results, verifier integrity (a gameable reward is a security hole), and tenant isolation on trace data. You can speak to all four — make sure they hear it.

The toolbox: what you actually build with

Nobody writes attention by hand. Every idea in this guide ships as a library — and knowing which tool does which job is half of being useful on day one.

You stand on a stack. At the bottom, a base library loads and runs models. On top of it, fine-tuning libraries implement LoRA and the trainers from Acts II–III. Quantization tools shrink the result. Serving engines run it fast for many users. And application tools wire it into RAG and agents. Here's the whole stack, with the names you'll hear — bitsandbytes among them — placed where they belong.

Five layers, bottom to top. Most projects touch all of them — and most "what tool do I use?" questions are really "which layer am I on?"

+Go deeper
The four tool families, mapped — what each one is for

1 · Training & fine-tuning

Tool	What it's for
HF Transformers	The base library — load, run, and train almost any open model. The lingua franca everything else builds on.
PEFT	Implements LoRA / QLoRA / DoRA and friends — the `LoraConfig` from Chapter 9 lives here.
TRL	The trainers for the methods in Acts II–III: SFT, DPO, GRPO, PPO.
Unsloth	Heavily optimised fine-tuning — markedly faster and lighter on VRAM; a favourite for single-GPU QLoRA.
Axolotl	Config-file-driven fine-tuning — write a YAML recipe instead of code; popular for reproducible runs.
DeepSpeed / Accelerate / FSDP	Distributed training & sharding across many GPUs (ZeRO). Accelerate is HF's device/distribution glue.

2 · Quantization & precision

Tool	What it's for
bitsandbytes	The one you keep hearing — 4-bit/8-bit weights + 8-bit optimizers in one line. The engine under QLoRA. A toolkit, not a single method.
GPTQ (AutoGPTQ)	One-shot 4-bit quantization for fast inference.
AWQ (AutoAWQ)	Activation-aware 4-bit — protects the most important weights; fast and accurate.
llama.cpp / GGUF	Run quantized models on CPUs and Macs; GGUF is the file format (Q4_K_M, etc.).
HQQ · AQLM · QuIP#	Research-grade pushes toward 2–3-bit.

3 · Serving & inference

Tool	What it's for
vLLM	The default high-throughput server — PagedAttention, continuous batching, serves many LoRA adapters on one base.
TGI	Hugging Face's production inference server.
Ollama	Dead-simple local model running (wraps llama.cpp) — the easy button for laptops and desktops.
SGLang	High-performance serving with very fast structured output.
TensorRT-LLM	NVIDIA's maximally-optimised engine — top performance on NVIDIA hardware, more setup.

4 · RAG, agents & data

Tool	What it's for
Vector DBs	Store & search embeddings: FAISS (library), Chroma, Qdrant, pgvector, Pinecone, Weaviate, Milvus.
Embedding models	Turn text into meaning-vectors: sentence-transformers, BGE, E5, OpenAI text-embedding-3.
Orchestration	Wire prompts, retrieval & tools together: LangChain, LlamaIndex (RAG-first), Haystack; LangGraph for agent loops.
Structured output	Force valid JSON/schemas: Outlines, Instructor, Guidance, XGrammar.
Eval & observability	Test and trace runs: LangSmith, Langfuse, Arize Phoenix, Braintrust; Ragas for RAG quality.

One honest caveat: this ecosystem moves fast — new tools arrive and leaders shift. Treat the categories as durable and the specific names as a snapshot to re-check.

The methods, decoded

Every acronym in one place — its full name and a deep explanation, no clicking required. This is the page to skim before an interview.

Fine-tuning — touching the weights

SFT

Supervised Fine-Tuning

The first and most important post-training step. You collect example pairs — a prompt and an ideal answer — and train the model on them using the same next-token-prediction game as pretraining, but scoring only the answer. This teaches it to answer instead of ramble, hold a format, follow instructions, and adopt a tone. The craft is almost entirely in the data: a few thousand excellent examples beat a hundred thousand mediocre ones, and over-training causes "catastrophic forgetting" of general ability. Everything after SFT is refinement on top of it.

PEFT

Parameter-Efficient Fine-Tuning

The umbrella term for fine-tuning that updates only a tiny fraction of a model's weights instead of all of them. Full fine-tuning is expensive — you carry gradients and optimizer state for every one of billions of parameters. PEFT methods freeze the base model and train small add-ons, cutting memory and cost by roughly 100× while keeping most of the quality. LoRA is the dominant member; it's why fine-tuning is possible at all on ordinary hardware.

LoRA

Low-Rank Adaptation

The flagship PEFT method. It freezes the whole model and learns two small "adapter" matrices whose product is added onto chosen weight matrices — exploiting the fact that the useful change during fine-tuning is low-rank (it lives in a small subspace). You train under 1% of the parameters, can keep many swappable adapters on one shared base, and can merge or peel them off. The two dials that matter: rank (how much the adapter can hold) and which layers get adapters.

QLoRA

Quantized Low-Rank Adaptation

LoRA on top of a base model compressed to 4-bit. Even frozen, a 70B model in full precision won't fit on one GPU; quantizing it to 4-bit (via the NF4 format) shrinks it ~4× so the adapters fit alongside, and gradients flow through the frozen 4-bit weights into the 16-bit adapters. This is what lets a 70-billion-parameter model be fine-tuned on a single consumer GPU — the default starting move today.

DoRA

Weight-Decomposed Low-Rank Adaptation

A refinement of LoRA that splits each weight into a magnitude and a direction, letting LoRA adapt the direction while a separate scalar handles magnitude. It closes much of the remaining quality gap to full fine-tuning, especially at low rank, for a small extra cost.

Alignment — learning taste from preferences

RLHF

Reinforcement Learning from Human Feedback

The original method for aligning a model to human preferences, and what turned raw models into usable assistants. Three stages: SFT a base; train a separate "reward model" on human judgments of which answer is better; then use reinforcement learning (PPO) to push the model toward higher reward while a KL penalty stops it drifting too far from itself. Powerful but heavy — four models in memory (policy, reward model, reference, critic) and genuinely finicky to stabilize. Its complexity is exactly what DPO was invented to avoid.

PPO

Proximal Policy Optimization

The reinforcement-learning algorithm at the heart of RLHF. It improves the model in small, "clipped" steps so a single update can't lurch too far and destabilize training, using a separate "value" network to estimate how good each move was. Effective, but the value network plus the reward and reference models make it memory-hungry and fiddly — which is what GRPO later streamlines for verifiable tasks.

DPO

Direct Preference Optimization

A shortcut that reaches RLHF's goal — learning from "A is better than B" comparisons — in a single supervised-style training step, with no reward model and no reinforcement learning. A clever derivation lets you express "reward" purely in terms of the model's own probabilities versus a frozen reference, collapsing the whole rig into one loss that raises the chosen answer and lowers the rejected one. Two models instead of four, far more stable — which is why most teams reach for it first. The trade-off: it's offline (fixed dataset), and at the frontier PPO can still edge it out.

KTO

Kahneman–Tversky Optimization

A preference method that learns from unpaired thumbs-up / thumbs-down labels instead of neat A-vs-B pairs, drawing on prospect-theory ideas about how humans weigh gains and losses. This matters in practice because real product telemetry is usually individual 👍/👎 signals, not curated comparisons — KTO turns that raw feedback straight into alignment training.

ORPO

Odds-Ratio Preference Optimization

A "monolithic" method that folds SFT and preference alignment into a single stage with no separate reference model. It adds an odds-ratio penalty to the normal fine-tuning loss, so the model learns the demonstrated answers and to prefer the better of a pair at the same time — fewer moving parts, one pass.

SimPO

Simple Preference Optimization

A reference-free simplification of DPO using a length-normalized reward and a target margin, removing the need for a frozen reference model. Simpler and lighter on memory, and competitive with DPO in quality.

Reasoning — learning to be correct

GRPO

Group Relative Policy Optimization

The method behind the recent wave of reasoning models. For a task with a checkable answer, it generates a group of attempts per problem, scores each, and grades them on a curve — reinforcing attempts that beat the group's average and discouraging those below it. By using the group itself as the baseline, it throws away PPO's value network entirely, making RL cheaper and more stable. It pairs with verifiable rewards, and it's how models like DeepSeek-R1 effectively taught themselves to reason. Its weak point — a gameable checker invites "reward hacking" — is exactly where adversarial thinking pays off.

RLVR

Reinforcement Learning with Verifiable Rewards

The reward style that powers reasoning models. Instead of a learned, hackable reward model, the reward comes from a deterministic checker — did the maths match, did the unit tests pass, is the output valid? Cheap, hard to game, and exactly what GRPO consumes. It's the difference between teaching taste (RLHF) and teaching correctness (RLVR).

Using & running the model

RAG

Retrieval-Augmented Generation

Not a fine-tuning method but the main alternative for knowledge. Before answering, the system retrieves the most relevant passages from a document store (by embedding similarity) and pastes them into the prompt, so the model answers from real, current sources instead of frozen memory. Rule of thumb: RAG for knowledge that changes, fine-tuning for durable behaviour — mature systems use both.

CoT

Chain of Thought

A prompting technique: asking the model to reason step by step before answering, which sharply improves hard problems by giving it room to work instead of guessing in one leap. Reasoning models bake this in through RL rather than relying on the prompt to ask for it.

MoE

Mixture of Experts

An architecture (not a training method) where each layer holds many "expert" sub-networks but a router activates only a couple per token. This decouples capacity from compute — huge total knowledge, but only a fraction runs per token, so inference stays cheap. The catch: all experts still sit in memory.

What a language model actually is, one piece at a time.

It's autocomplete that read everything

Tokens: the puzzle pieces it reads

The vocabulary

What tokens actually look like

What a real vocabulary file looks like

Two conventions you'll meet in the wild

Why it matters — and your security angle

Embeddings: turning words into a map of meaning

Vocabulary, in depth

The embedding matrix

Static vs contextual — the nuance most people miss

Sentence embeddings & embedding models

Where the weights actually live

Attention: how words look at each other

Stacking it up: the transformer block

Three shapes: encoder, decoder, or both

Encoder-only — the reader

Decoder-only — the writer

Encoder-decoder — the translator

So why did chat go decoder-only?

Multi-head attention

The causal mask

The feed-forward network

Residuals & normalisation

Position (RoPE)

Pretraining: where the knowledge comes from

How "gently correct it" actually works

The loss: one number for "how wrong"

The gradient: which way is downhill

The optimizer: how the step gets taken

Backprop, slightly more precisely

The optimizer family tree

The learning rate — the #1 hyperparameter

Warmup & schedules

The hidden memory bill

The objective, precisely

Decoder vs encoder

Scaling laws

Emergence

Generation: how it actually writes

The decoding knobs

Cross-entropy — the training loss, recapped

Backpropagation — how it learns, recapped

A genius that rambles isn't useful yet. Now we send it to finishing school.

The map of fine-tuning

SFT: learning by example

Chat templates

Completion-only loss masking

Packing

Catastrophic forgetting

Data > everything

How to fine-tune without a data centre — and how to give a model good judgment.

LoRA: the sticky-note trick

The LoRA config, field by field

Choosing rank & alpha

Which target_modules

The PEFT family beyond LoRA

The training hyperparameters around it

A starting recipe you can copy

Will it fit? (rough VRAM)

Debugging — symptom → cause → fix

Merging & serving

Preference tuning: learning good taste

Reward and penalty: the carrot and the leash

How a reward model is trained

One number for a thousand tokens

Reward hacking — the failure mode to respect

What KL actually measures — and the β trade-off

The classic RLHF pipeline, in full

Why DPO won the default slot

The family — pick by what data you have

Each one, explained

The knobs that matter

GRPO: how models learned to reason

Why it drops the critic

RLVR — where the reward comes from

The cheaper cousin: rejection-sampling

The variants people name-drop

Process vs outcome rewards

From guess the next word
to a self-improving loop.