Attention Is All You Need Explained: Part 2 — Why Self-Attention Wins & Training Details
Episode 1318 min

Attention Is All You Need Explained: Part 2 — Why Self-Attention Wins & Training Details

Show Notes

Continuing the Transformer deep dive, Alex and Thuy explore why self-attention outperforms RNNs and CNNs, examining computational complexity, parallelization, and the training setup that made this architecture practical.

In this episode:

  • Computational complexity comparison — Self-attention vs recurrent vs convolutional layers
  • Path lengths for long-range dependencies — Why attention connects everything in O(1) steps
  • Interpretability advantages — What attention heads actually learn to do
  • Training data — WMT 2014 datasets and byte-pair encoding
  • Hardware setup — 8 P100 GPUs, 12-hour training for the base model
  • Adam optimizer with warmup schedule — The learning rate recipe
  • Regularization — Dropout and label smoothing techniques

Paper

  • Title: Attention Is All You Need
  • Authors: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (Google Brain / Google Research / University of Toronto)
  • Published: June 2017 (NeurIPS 2017)
  • Link: arxiv.org/abs/1706.03762

Series

This is Part 2 of a 3-part series on the original Transformer paper.


Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)

Attention Is All You Need Explained: Part 2 — Why Self-Attention Wins & Training Details | Artificial Peer Review