How GPT works: 15-minute intuition

less than 1 minute read

Karpathy was an OpenAI co-founder and the Tesla autopilot vision team lead. Between Tesla and OpenAI, he released this gem of a guide.

Key Takeaways

Attention is a communication mechanism. Each token (sub-word) “searches” for information from the past. It says something like “I am a vowel, looking for consonants, in these positions”
This happens in a data-dependent manner
You can chop any data (images, sounds…) into chunks, mark the positions, and feed into the same architecture and it works
GPT is a decoder model. It doesn’t have memory or internal state. It only looks at a fixed context of input to generate outputs. GPT-4 Turbo takes in 128k input tokens and only looks at those.

It can “learn” during inference
The architecture has not changed much in the past 6 years, since the 2017 “Attention is All You Need”. Many are trying to improve it, but it has remained resilient.