·8 min read

How LLMs Actually Work: Tokenization, Training & Inference

Kishore Gunnam

Kishore Gunnam

Developer & Writer

You've seen what LLMs can do. Now let's understand how they do it.

This builds on the transformer architecture from Part 2.


The Three Core Processes

Step 1

Tokenization

Text → Numbers (every message)

Step 2

Training

Learn patterns (costs millions, done once)

Step 3

Inference

Generate response (every API call)


Tokenization

LLMs don't see text. They see numbers. Tokenization is the bridge.

If you've ever seen “input tokens / output tokens” on an API bill, this is what it's talking about. Tokens are just the unit the model uses internally, and we pay (and wait) roughly in proportion to how many we send.

Beginner intuition: tokenization is like chopping text into LEGO pieces. Some words are one piece. Some words are multiple pieces. The model only ever sees the pieces.

Tokenization

Ready to tokenize
InputHello, how are you?
Tokens
Hello, how are you?
IDs
154961562315750158771600416131

LLMs don't see words—they see token IDs. Common words are single tokens; rare words split into pieces.

Tokens aren't always words:

  • Whole words: "Hello"
  • Word pieces: "un" + "believ" + "able"
  • Single characters: punctuation

BPE Tokenization:

  1. Start with characters (split text into individual letters)
  2. Count pairs (find most common adjacent pairs)
  3. Merge top pair (combine into one token)
  4. Repeat 50K times (until vocabulary complete)

Training

The core task is simple:

""
The Language Modeling Objective

Think of it like autocomplete trained on the whole internet, tuned to be useful in conversation.

The model isn't storing sentences and "looking them up." It's compressing patterns into weights. That's why it can generalize—and also why it can confidently make things up.

Training Loop:

  1. Sample text (grab chunk from training data)
  2. Tokenize (convert to token IDs)
  3. Forward pass (model predicts next token)
  4. Calculate loss (how wrong was the prediction?)
  5. Update weights (nudge parameters to reduce loss)
  6. Repeat billions of times (until patterns emerge)

Training Costs

Estimated Training Costs

GPT-22019
$50K
GPT-32020
$4.6M
GPT-42023
$100M+

This is why only a few companies can train frontier models. See Part 3 for the full GPT history.


Inference

What happens when you chat with AI:

This is the “live” part. Training is done once (by the model provider). Inference happens every time you hit Enter—on your laptop, on your phone, in an API call.

When You Press Enter:

  1. Tokenize input (your message → token IDs)
  2. Forward pass (process through all layers)
  3. Get probabilities (distribution over vocabulary)
  4. Sample token (pick one based on probabilities)
  5. Repeat (until done)
  6. Detokenize (token IDs → readable text)

Temperature

Low Temperature (0.1)
High Temperature (1.5)
Almost always top choice
Selection
More random
Focused, deterministic
Output
Creative, varied
Code, facts
Best for
Creative writing

Temperature Sampling

Prompt

The capital of France is

Output

Temperature: 0.60

FocusedCreative
" Paris"
92.5%
" Lyon"
3.3%
" Marseille"
2.4%
" definitely"
0.6%
"."
1.2%

Lower temperature = more deterministic, higher temperature = more random


Context Windows

Context Window Growth

GPT-32020
4K
GPT-42023
32K
Claude 32024
200K
Gemini 2.52025
2M
Deep dive: why context windows matter(optional)

Bigger context windows let you include more chat history and documents in one shot. It's still not long-term memory—if it's not in the prompt, the model can't use it.

Compare models in Part 4.


Why Hallucinations Happen

""
Key Mental Model

If asked about something not in training data, they don't say "I don't know." They generate plausible-sounding text anyway.

The most practical way to think about hallucinations: the model will happily fill in missing information if your prompt implies there is an answer. Your job as a builder is to either:

  • provide sources (RAG),
  • ask the model to quote sources / show uncertainty,
  • or restrict it to a known database/tool.
LLMs ARE
LLMs ARE NOT
Pattern matchers
Databases of facts
Probability generators
Truth engines
Trained on past data
Aware of current events

Learn about reducing hallucinations with RAG in Part 7.


Key Takeaways

Tokenization converts text to numbers (subword pieces).

Training is next-token prediction, billions of times.

Inference generates one token at a time.

Temperature controls randomness.

Hallucinations happen because pattern matching ≠ truth.


Common beginner mistakes

  • Assuming the model “knows” what it said earlier. If it’s not in context, it can’t reliably use it.
  • Using high temperature for factual tasks, then being surprised by inconsistency.
  • Blaming the model for hallucinations when the app provides no grounding or “I don’t know” path.

Quick Check

Quick check

What is an LLM trained to do?

Select an answer, then reveal

What's Next?

In Part 6, we explore AI alignment. How do we make sure these systems behave? RLHF, Constitutional AI, and the challenges ahead.