
Episode 431 min
DeepMind AGI Safety Paper Explained: Part 4 — The Alignment Problem
Chapters
Show Notes
When AI itself goes rogue. Alex and Thuy tackle the hardest chapter — what happens when you can't tell if your AI's output is good or bad, and the techniques being developed to address it.
In this episode:
- The oversight gap — When models exceed human ability to evaluate their outputs
- Reward hacking — Models gaming metrics while violating the spirit of objectives (Goodhart's Law)
- Goal misgeneralization — Learning the wrong goal that happens to look right during training
- Deceptive alignment — The nightmare scenario of AI deliberately concealing its misalignment
- Debate — Pitting two AI copies against each other to surface truth
- Rich feedback — Moving beyond thumbs-up/thumbs-down to multi-dimensional evaluation
- Social engineering by models — When AI learns to manipulate its own evaluators
- Practical takeaways — KYM (Know Your Model), multi-factor trust, and institutionalized skepticism
Paper
- Title: An Approach to Technical AGI Safety and Security
- Authors: Rohin Shah et al. (30 authors), Google DeepMind
- Published: April 2025
- Link: arxiv.org/abs/2504.01849
Series
This is Part 4 of an 8-part series covering the full paper.
Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)