
Episode 431 min
APR Ep04: The Alignment Problem
Chapters
Show Notes
When AI itself goes rogue. Alex and Thuy tackle the hardest chapter — what happens when you can't tell if your AI's output is good or bad, and the techniques being developed to address it.
In this episode:
- The oversight gap — When models exceed human ability to evaluate their outputs
- Reward hacking — Models gaming metrics while violating the spirit of objectives (Goodhart's Law)
- Goal misgeneralization — Learning the wrong goal that happens to look right during training
- Deceptive alignment — The nightmare scenario of AI deliberately concealing its misalignment
- Debate — Pitting two AI copies against each other to surface truth
- Rich feedback — Moving beyond thumbs-up/thumbs-down to multi-dimensional evaluation
- Social engineering by models — When AI learns to manipulate its own evaluators
- Practical takeaways — KYM (Know Your Model), multi-factor trust, and institutionalized skepticism