How LLMs Actually Work: Tokenization, Training & Inference

Complete Guide to LLMs · Part 5 of 8

5.How LLMs Work ←

6.AI Alignment & Safety

7.Building with LLMs

8.Future of AI

← Model Landscape 2025 AI Alignment & Safety →

You've seen what LLMs can do. Now let's understand how they do it.

This builds on the transformer architecture from Part 2.

The Three Core Processes

Step 1

Tokenization

Text → Numbers (every message)

Step 2

Training

Learn patterns (costs millions, done once)

Step 3

Inference

Generate response (every API call)

Tokenization

LLMs don't see text. They see numbers. Tokenization is the bridge.

If you've ever seen “input tokens / output tokens” on an API bill, this is what it's talking about. Tokens are just the unit the model uses internally, and we pay (and wait) roughly in proportion to how many we send.

Beginner intuition: tokenization is like chopping text into LEGO pieces. Some words are one piece. Some words are multiple pieces. The model only ever sees the pieces.

Tokenization

Ready to tokenize

Input“Hello, how are you?”

Tokens

Hello, how are you?

IDs

154961562315750158771600416131

LLMs don't see words—they see token IDs. Common words are single tokens; rare words split into pieces.

Tokens aren't always words:

Whole words: "Hello"
Word pieces: "un" + "believ" + "able"
Single characters: punctuation

BPE Tokenization:

Start with characters (split text into individual letters)
Count pairs (find most common adjacent pairs)
Merge top pair (combine into one token)
Repeat 50K times (until vocabulary complete)

Training

The core task is simple:

""

— The Language Modeling Objective

Think of it like autocomplete trained on the whole internet, tuned to be useful in conversation.

The model isn't storing sentences and "looking them up." It's compressing patterns into weights. That's why it can generalize—and also why it can confidently make things up.

Training Loop:

Sample text (grab chunk from training data)
Tokenize (convert to token IDs)
Forward pass (model predicts next token)
Calculate loss (how wrong was the prediction?)
Update weights (nudge parameters to reduce loss)
Repeat billions of times (until patterns emerge)

Training Costs

Estimated Training Costs

GPT-22019

$50K

GPT-32020

$4.6M

GPT-42023

$100M+

This is why only a few companies can train frontier models. See Part 3 for the full GPT history.

Inference

What happens when you chat with AI:

This is the “live” part. Training is done once (by the model provider). Inference happens every time you hit Enter—on your laptop, on your phone, in an API call.

When You Press Enter:

Tokenize input (your message → token IDs)
Forward pass (process through all layers)
Get probabilities (distribution over vocabulary)
Sample token (pick one based on probabilities)
Repeat (until done)
Detokenize (token IDs → readable text)

Temperature

Low Temperature (0.1)

High Temperature (1.5)

Almost always top choice

Selection

More random

Focused, deterministic

Output

Creative, varied

Code, facts

Best for

Creative writing

Temperature Sampling

Prompt

The capital of France is

Output

…

Temperature: 0.60

FocusedCreative

" Paris"

92.5%

" Lyon"

3.3%

" Marseille"

2.4%

" definitely"

0.6%

"."

1.2%

Lower temperature = more deterministic, higher temperature = more random

Context Windows

Context Window Growth

GPT-32020

GPT-42023

32K

Claude 32024

200K

Gemini 2.52025

Deep dive: why context windows matter(optional)

Bigger context windows let you include more chat history and documents in one shot. It's still not long-term memory—if it's not in the prompt, the model can't use it.

Compare models in Part 4.

Why Hallucinations Happen

""

— Key Mental Model

If asked about something not in training data, they don't say "I don't know." They generate plausible-sounding text anyway.

The most practical way to think about hallucinations: the model will happily fill in missing information if your prompt implies there is an answer. Your job as a builder is to either:

provide sources (RAG),
ask the model to quote sources / show uncertainty,
or restrict it to a known database/tool.

LLMs ARE

LLMs ARE NOT

✓

Pattern matchers

Databases of facts

✓

Probability generators

Truth engines

✓

Trained on past data

Aware of current events

Learn about reducing hallucinations with RAG in Part 7.

Key Takeaways

Tokenization converts text to numbers (subword pieces).

Training is next-token prediction, billions of times.

Inference generates one token at a time.

Temperature controls randomness.

Hallucinations happen because pattern matching ≠ truth.

Common beginner mistakes

Assuming the model “knows” what it said earlier. If it’s not in context, it can’t reliably use it.
Using high temperature for factual tasks, then being surprised by inconsistency.
Blaming the model for hallucinations when the app provides no grounding or “I don’t know” path.

Quick Check

Quick check

What is an LLM trained to do?

Select an answer, then reveal

What's Next?

In Part 6, we explore AI alignment. How do we make sure these systems behave? RLHF, Constitutional AI, and the challenges ahead.

Complete Guide to LLMs · Part 5 of 8

1.Genesis of Language AI

2.Transformer Revolution

3.GPT Evolution

4.Model Landscape 2025

5.How LLMs Work ←

6.AI Alignment & Safety

7.Building with LLMs

8.Future of AI

← Model Landscape 2025 AI Alignment & Safety →