Attention Is All You Need Explained: Part 2 — Why Self-Attention Wins & Training Details

Show Notes

Continuing the Transformer deep dive, Alex and Thuy explore why self-attention outperforms RNNs and CNNs, examining computational complexity, parallelization, and the training setup that made this architecture practical.

In this episode:

Computational complexity comparison — Self-attention vs recurrent vs convolutional layers
Path lengths for long-range dependencies — Why attention connects everything in O(1) steps
Interpretability advantages — What attention heads actually learn to do
Training data — WMT 2014 datasets and byte-pair encoding
Hardware setup — 8 P100 GPUs, 12-hour training for the base model
Adam optimizer with warmup schedule — The learning rate recipe
Regularization — Dropout and label smoothing techniques

Paper

Title: Attention Is All You Need
Authors: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (Google Brain / Google Research / University of Toronto)
Published: June 2017 (NeurIPS 2017)
Link: arxiv.org/abs/1706.03762

Series

This is Part 2 of a 3-part series on the original Transformer paper.

Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)