
Episode 1318 min
Attention Is All You Need Explained: Part 2 — Why Self-Attention Wins & Training Details
Show Notes
Continuing the Transformer deep dive, Alex and Thuy explore why self-attention outperforms RNNs and CNNs, examining computational complexity, parallelization, and the training setup that made this architecture practical.
In this episode:
- Computational complexity comparison — Self-attention vs recurrent vs convolutional layers
- Path lengths for long-range dependencies — Why attention connects everything in O(1) steps
- Interpretability advantages — What attention heads actually learn to do
- Training data — WMT 2014 datasets and byte-pair encoding
- Hardware setup — 8 P100 GPUs, 12-hour training for the base model
- Adam optimizer with warmup schedule — The learning rate recipe
- Regularization — Dropout and label smoothing techniques
Paper
- Title: Attention Is All You Need
- Authors: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (Google Brain / Google Research / University of Toronto)
- Published: June 2017 (NeurIPS 2017)
- Link: arxiv.org/abs/1706.03762
Series
This is Part 2 of a 3-part series on the original Transformer paper.
Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)