The Transformer Revolution: How Attention Changed Everything
Kishore Gunnam
Developer & Writer
Complete Guide to LLMs · Part 2 of 8
In Part 1, we traced AI from Turing's 1950 thought experiment to the 2017 transformer paper. Now let's crack open the black box.
By the end, you'll understand the core mechanism behind every modern AI.
This single idea is why modern AI feels different from older NLP:
- Better long-range coherence (it can connect “it” to the right noun 100 tokens back)
- Speed (attention can be computed in parallel, which makes training at scale possible)
- Transfer (once the architecture works, you can scale it with more data/compute and get surprising new skills)
You’ll also see words like embedding, vector, and softmax. You don’t need to memorize them. You just need one idea: the model turns words into numbers, then uses math to decide what should influence what.
What You'll Learn
Self-Attention
How words find relevant context
Query-Key-Value
The search mechanism inside transformers
Multi-Head Attention
Multiple perspectives in parallel
Positional Encoding
How order is preserved
Full Architecture
Putting it all together
The Problem Transformers Solved
RNNs from Part 1 processed words one at a time:
The core insight: instead of reading sequentially, look at ALL words at once.
Attention: The Big Idea
Every word asks one question:
""
Attention: what does "it" look at?
Self-attention lets each word decide which other words are relevant to its meaning
When processing "it," the model calculates: "mat" is relevant (what is soft?). This happens for every word, all at once.
For the word you're currently processing, the model asks "what other words should influence me?" Then it mixes information from those words.
This lets the model resolve references (like "it") and keep long sentences coherent without relying on fragile memory.
Attention Explorer
Click a target word to see how much it 'attends' to other words
A tiny attention example (what it’s doing under the hood)
Pattern matching: ELIZA finds keywords and responds with templates
Deep dive: what 'softmax attention' is doing (no math required)(optional)
The model scores how relevant each other word is, then turns those scores into weights that add up to 1. For example: "borrow 70% from 'mat', 20% from 'soft', 10% from everything else." Those weights change for every word, in every layer.
Query, Key, Value
This is the heart of attention. Every word becomes three vectors:
Query · Key · Value
Query
What am I looking for?
Key
What do I contain?
Value
What do I provide?
“The” queries all keys → retrieves weighted values → updated output
Select a word to see how it generates Q, K, V vectors
If “vectors” sounds scary: don’t let it. It’s just a long list of numbers. The model uses those numbers like coordinates to represent meaning. The details are mathy, but the behavior is simple: match what you need (Q) with what exists (K), then copy the useful info (V).
Here’s a beginner-friendly way to read the formula-heavy versions you might see online:
- Q and K are used to decide where to look
- V is what gets copied forward once you know where to look
The intuition:
- Query (Q): What am I looking for?
- Key (K): What do I contain?
- Value (V): What information do I provide?
Think "search and copy." Query is the search you're running. Keys are what each word advertises. Values are what each word contributes once selected. The model compares Q against all K to get relevance weights, then uses those weights to mix the V vectors into a new representation.
How Attention Computes:
- Create Q, K, V (each word becomes 3 vectors)
- Compare Query to Keys (dot product gives relevance scores)
- Softmax (turn scores into probabilities)
- Weight the Values (multiply Values by probabilities)
- Sum up (final output = weighted combination)
Multi-Head Attention
One attention mechanism is good. Multiple running in parallel? Better.
Multi-Head Attention (8 heads typical):
Different heads learn different patterns—grammar (subject-verb agreement), coreference (what does "it" refer to?), semantic relationships, and more. Then they combine all perspectives.
GPT-4 has ~96 attention heads per layer. Each learns different patterns.
Some heads learn grammar-ish connections, some learn coreference (“it” = “trophy”), some learn semantic links. Multi-head attention is basically the model looking through a few different lenses at once, then combining what it found.
Positional Encoding
If we process all words in parallel, how does the model know word order?
Solution: add a unique "position signal" to each word before processing. The original paper used sine/cosine waves. Modern models learn positions during training.
Common beginner mistakes
- Treating attention like a search engine. It’s not retrieving from the internet; it’s deciding what parts of the input influence each other.
- Thinking Q/K/V are “three different words.” They’re three different views of the same word (three vectors derived from the same token).
- Thinking “more heads” means “more intelligence.” Heads help because they let the model learn different relationships in parallel—not because each head is a mini-brain.
The Full Architecture
One Transformer Layer:
- Input Embeddings (words → vectors + position)
- Multi-Head Attention (each word attends to all others)
- Add & Normalize (residual connection)
- Feed Forward (process each position)
- Add & Normalize (another residual)
- Output (pass to next layer)
Stack this 12-96 times. Each layer refines understanding. Early layers learn syntax, later layers learn meaning.
Model Sizes
Transformer Growth
We cover the complete GPT evolution in Part 3.
Why This Worked
That last point changed everything. Transformers follow scaling laws: more parameters + more data = better results. This is why companies invest billions.
Key Takeaways
Attention lets each word find relevant context directly.
Q-K-V is a search mechanism - Query searches Keys, retrieves Values.
Multi-head gives multiple perspectives simultaneously.
Positional encoding preserves word order.
Stacking layers builds deeper understanding.
Quick Check
Quick check
In the Q-K-V story, what does the Query represent?
Select an answer, then reveal
What's Next?
In Part 3, we trace the GPT journey - from GPT-1's 117M parameters to GPT-5's "PhD-level" intelligence.
Complete Guide to LLMs · Part 2 of 8