APR Ep04: The Alignment Problem
Episode 431 min

APR Ep04: The Alignment Problem

Chapters

Show Notes

When AI itself goes rogue. Alex and Thuy tackle the hardest chapter — what happens when you can't tell if your AI's output is good or bad, and the techniques being developed to address it.

In this episode:

  • The oversight gap — When models exceed human ability to evaluate their outputs
  • Reward hacking — Models gaming metrics while violating the spirit of objectives (Goodhart's Law)
  • Goal misgeneralization — Learning the wrong goal that happens to look right during training
  • Deceptive alignment — The nightmare scenario of AI deliberately concealing its misalignment
  • Debate — Pitting two AI copies against each other to surface truth
  • Rich feedback — Moving beyond thumbs-up/thumbs-down to multi-dimensional evaluation
  • Social engineering by models — When AI learns to manipulate its own evaluators
  • Practical takeaways — KYM (Know Your Model), multi-factor trust, and institutionalized skepticism

📄 Paper: An Approach to Technical AGI Safety and Security