·10 min read

The Transformer Revolution: How Attention Changed Everything

Kishore Gunnam

Kishore Gunnam

Developer & Writer

In Part 1, we traced AI from Turing's 1950 thought experiment to the 2017 transformer paper. Now let's crack open the black box.

By the end, you'll understand the core mechanism behind every modern AI.

This single idea is why modern AI feels different from older NLP:

  • Better long-range coherence (it can connect “it” to the right noun 100 tokens back)
  • Speed (attention can be computed in parallel, which makes training at scale possible)
  • Transfer (once the architecture works, you can scale it with more data/compute and get surprising new skills)

You’ll also see words like embedding, vector, and softmax. You don’t need to memorize them. You just need one idea: the model turns words into numbers, then uses math to decide what should influence what.


What You'll Learn

Concept 1

Self-Attention

How words find relevant context

Concept 2

Query-Key-Value

The search mechanism inside transformers

Concept 3

Multi-Head Attention

Multiple perspectives in parallel

Concept 4

Positional Encoding

How order is preserved

Concept 5

Full Architecture

Putting it all together


The Problem Transformers Solved

RNNs from Part 1 processed words one at a time:

RNN Problems
Why They Mattered
Sequential processing
Speed
Can't use GPUs fully
Forgets early words
Memory
Long documents fail
Takes weeks
Training
Expensive and slow

The core insight: instead of reading sequentially, look at ALL words at once.


Attention: The Big Idea

Every word asks one question:

""
The Transformer's Core Question

Attention: what does "it" look at?

Self-attention lets each word decide which other words are relevant to its meaning

When processing "it," the model calculates: "mat" is relevant (what is soft?). This happens for every word, all at once.

For the word you're currently processing, the model asks "what other words should influence me?" Then it mixes information from those words.

This lets the model resolve references (like "it") and keep long sentences coherent without relying on fragile memory.

Attention Explorer

Focus on:
Thecatsatonthematbecauseitwassoft.
The
2%
cat
15%
sat
5%
on
2%
the
3%
mat
75%
because
5%
it
0%
was
8%
soft
45%

Click a target word to see how much it 'attends' to other words

A tiny attention example (what it’s doing under the hood)

1 of 2

Pattern matching: ELIZA finds keywords and responds with templates

Deep dive: what 'softmax attention' is doing (no math required)(optional)

The model scores how relevant each other word is, then turns those scores into weights that add up to 1. For example: "borrow 70% from 'mat', 20% from 'soft', 10% from everything else." Those weights change for every word, in every layer.


Query, Key, Value

This is the heart of attention. Every word becomes three vectors:

Query · Key · Value

Focus on:
Q

Query

What am I looking for?

K

Key

What do I contain?

V

Value

What do I provide?

The” queries all keys → retrieves weighted values → updated output

Select a word to see how it generates Q, K, V vectors

If “vectors” sounds scary: don’t let it. It’s just a long list of numbers. The model uses those numbers like coordinates to represent meaning. The details are mathy, but the behavior is simple: match what you need (Q) with what exists (K), then copy the useful info (V).

Here’s a beginner-friendly way to read the formula-heavy versions you might see online:

  • Q and K are used to decide where to look
  • V is what gets copied forward once you know where to look

The intuition:

  • Query (Q): What am I looking for?
  • Key (K): What do I contain?
  • Value (V): What information do I provide?

Think "search and copy." Query is the search you're running. Keys are what each word advertises. Values are what each word contributes once selected. The model compares Q against all K to get relevance weights, then uses those weights to mix the V vectors into a new representation.

How Attention Computes:

  1. Create Q, K, V (each word becomes 3 vectors)
  2. Compare Query to Keys (dot product gives relevance scores)
  3. Softmax (turn scores into probabilities)
  4. Weight the Values (multiply Values by probabilities)
  5. Sum up (final output = weighted combination)

Multi-Head Attention

One attention mechanism is good. Multiple running in parallel? Better.

Multi-Head Attention (8 heads typical):

Different heads learn different patterns—grammar (subject-verb agreement), coreference (what does "it" refer to?), semantic relationships, and more. Then they combine all perspectives.

GPT-4 has ~96 attention heads per layer. Each learns different patterns.

Some heads learn grammar-ish connections, some learn coreference (“it” = “trophy”), some learn semantic links. Multi-head attention is basically the model looking through a few different lenses at once, then combining what it found.


Positional Encoding

If we process all words in parallel, how does the model know word order?

Without Position
With Position
cat the sat mat on
Input
The cat sat on mat
Bag of words
Model sees
Ordered sequence

Solution: add a unique "position signal" to each word before processing. The original paper used sine/cosine waves. Modern models learn positions during training.


Common beginner mistakes

  • Treating attention like a search engine. It’s not retrieving from the internet; it’s deciding what parts of the input influence each other.
  • Thinking Q/K/V are “three different words.” They’re three different views of the same word (three vectors derived from the same token).
  • Thinking “more heads” means “more intelligence.” Heads help because they let the model learn different relationships in parallel—not because each head is a mini-brain.

The Full Architecture

One Transformer Layer:

  1. Input Embeddings (words → vectors + position)
  2. Multi-Head Attention (each word attends to all others)
  3. Add & Normalize (residual connection)
  4. Feed Forward (process each position)
  5. Add & Normalize (another residual)
  6. Output (pass to next layer)

Stack this 12-96 times. Each layer refines understanding. Early layers learn syntax, later layers learn meaning.


Model Sizes

Transformer Growth

GPT-12018
117M
GPT-22019
1.5B
GPT-32020
175B
GPT-42023
~1.7T

We cover the complete GPT evolution in Part 3.


Why This Worked

RNNs (Before)
Transformers (After)
Sequential
Processing
Parallel
Information fades
Long-range
Direct connections
Diminishing returns
Scaling
Bigger = better

That last point changed everything. Transformers follow scaling laws: more parameters + more data = better results. This is why companies invest billions.


Key Takeaways

Attention lets each word find relevant context directly.

Q-K-V is a search mechanism - Query searches Keys, retrieves Values.

Multi-head gives multiple perspectives simultaneously.

Positional encoding preserves word order.

Stacking layers builds deeper understanding.


Quick Check

Quick check

In the Q-K-V story, what does the Query represent?

Select an answer, then reveal

What's Next?

In Part 3, we trace the GPT journey - from GPT-1's 117M parameters to GPT-5's "PhD-level" intelligence.