How GPT works: 15-minute intuition
Karpathy was an OpenAI co-founder and the Tesla autopilot vision team lead. Between Tesla and OpenAI, he released this gem of a guide.
Key Takeaways
- Attention is a communication mechanism. Each token (sub-word) “searches” for information from the past. It says something like “I am a vowel, looking for consonants, in these positions”
- This happens in a data-dependent manner
- You can chop any data (images, sounds…) into chunks, mark the positions, and feed into the same architecture and it works
- GPT is a decoder model. It doesn’t have memory or internal state. It only looks at a fixed context of input to generate outputs. GPT-4 Turbo takes in 128k input tokens and only looks at those.
Why the transformer architecture matters
- It can “learn” during inference
- The architecture has not changed much in the past 6 years, since the 2017 “Attention is All You Need”. Many are trying to improve it, but it has remained resilient.
