·8 min read

The Genesis of Language AI: From Turing to Transformers (1950-2017)

Kishore Gunnam

Kishore Gunnam

Developer & Writer

How did we get from Turing's 1950 thought experiment to ChatGPT? Here's the story.

That's exactly what we're exploring in this 8-part series on Large Language Models.

If you're a beginner, here’s the promise: by the end of this series, you’ll be able to explain (in normal words) why chatbots can write, why they hallucinate, why context windows matter, and what changed in 2017 that kicked off the modern AI boom.


The Timeline

1950

Turing Test Proposed

Alan Turing asks 'Can machines think?'

1966

ELIZA Chatbot

MIT creates the first conversational program

1990s

Statistical Revolution

Counting patterns beats hand-written rules

2013

Word2Vec

Words become numbers that capture meaning

2017

Transformers

The architecture behind every modern AI


1950: Can Machines Think?

It's 1950. Computers are the size of rooms. And Alan Turing asks:

""
Alan Turing, 1950

Turing proposed a test: if you're texting with someone and can't tell whether it's a human or computer - does it matter if the machine is "really" thinking?

When ChatGPT launched in November 2022, many argued we'd finally passed it. 72 years later.


The 1960s: ELIZA

In 1966, MIT built ELIZA - a program that pretended to be a therapist:

Talking to ELIZA

1 of 3

Pattern matching: ELIZA finds keywords and responds with templates

ELIZA understood nothing. It just matched patterns: find "mother" → respond about family.

But people genuinely confided in it. We want to believe machines understand us.


Why Rules Failed

Through the 1970s-80s, researchers tried encoding grammar rules. It didn't work:

Rules expect
Humans say
I am going to the store
Going somewhere
gonna grab stuff
That is not acceptable
Agreement
nah that ain't it
The meal was delicious
Approval
bro that slapped

Language is infinite creativity. You can't write rules for every expression.

Here’s the real reason this matters: once you accept that rules won’t cover real language, you need systems that learn from examples. That one decision (learn from data) is the throughline that leads to Word2Vec, transformers, and today’s chatbots.


The 1990s: Just Count Things

The breakthrough: instead of rules, count patterns in real text.

After reading millions of sentences, patterns emerge:

  • After "I want to eat..." → "pizza" 30%, "lunch" 20%
  • After "the cat sat on the..." → "mat" 40%

It's pattern recognition. And it works.

Deep dive: why 'counting' beat rules(optional)

Rules fail because language is messy. People shorten words, break grammar, invent slang, and rely on context. Statistical methods don’t need perfect grammar—they just need enough examples to learn what usually follows what. That shift (from “encode language” → “learn from data”) is basically the root of everything that comes later.


2013: Words Become Numbers

Google's Tomas Mikolov published Word2Vec: represent each word as a list of numbers. Words in similar contexts get similar numbers.

The magical part? You can do math with words:

Word Vector Arithmetic

King-Man+Woman=?

Word Vector Arithmetic

Paris-France+Italy=?

Nobody taught the model these relationships. It learned them from reading text.

Word2Vec doesn’t store a dictionary definition. It stores a vector that’s useful for predicting context. If two words appear in similar contexts (“king” and “queen”), the vectors end up close together. The “king - man + woman = queen” trick is a fun demo of the geometry, not the whole point.

The important beginner takeaway: this is the moment “words become numbers” stops being a metaphor. Once you can represent language as vectors, you can build bigger systems on top of those vectors.


2014-2016: The Memory Problem

Word2Vec handled single words. But language is about sequences.

"The bank was steep" - river bank.
"The bank was closed" - financial bank.

Context is everything.

Recurrent Neural Networks (RNNs) tried to solve this:

RNN Forward Pass

Thet1
catt2
satt3
hhidden
onpredict

Click play or wait for the animation to start

RNN processes words one at a time, updating a hidden state that carries context forward

Problem: RNNs had terrible long-term memory. By word 50, word 1 is forgotten. And they're slow.

So the world needed a model that could:

  • Keep long-range context without forgetting
  • Train fast on GPUs
  • Scale up with more data and compute

June 2017: Everything Changes

Eight researchers at Google published:

""
Vaswani et al., 2017

They introduced the Transformer. Every AI you use today - ChatGPT, Claude, Gemini - is built on this paper.


What Made Transformers Special

The key: self-attention. Instead of reading words sequentially, look at ALL words at once:

Attention: what does "it" look at?

Self-attention lets each word decide which other words are relevant to its meaning

When processing "it," the model calculates which words matter. "cat" is highly relevant. "The" is not.

RNNs
Transformers
One word at a time
Processing
All at once
Forgets after ~50 words
Memory
Perfect recall
Days to weeks
Training
Hours

We dive deep into how transformers work in Part 2.


The Explosion

2018

GPT-1

117M parameters. First GPT.

2019

GPT-2

'Too dangerous to release'

2020

GPT-3

175B parameters. World notices.

Nov 2022

ChatGPT

100M users in 2 months.

2025

GPT-5

PhD-level intelligence

The complete GPT history is covered in Part 3.


Key Takeaways

Rules don't scale. You can't hand-code human language.

Statistics work. Count patterns instead of defining rules.

Words can be numbers. Word2Vec proved meaning is mathematical.

Attention changed everything. Transformers process all words at once.


Common beginner mistakes

  • Thinking this history is trivia: it explains why modern models behave the way they do.
  • Thinking “language understanding” is the same as fluent text: ELIZA was fluent-ish; it still didn’t understand.
  • Thinking the transformer was a small improvement: it was the scaling unlock.

What's Next?

In Part 2, we crack open the transformer. Self-attention, Query-Key-Value, the architecture that powers everything.