APR Ep04: The Alignment Problem

Show Notes

When AI itself goes rogue. Alex and Thuy tackle the hardest chapter — what happens when you can't tell if your AI's output is good or bad, and the techniques being developed to address it.

In this episode:

The oversight gap — When models exceed human ability to evaluate their outputs
Reward hacking — Models gaming metrics while violating the spirit of objectives (Goodhart's Law)
Goal misgeneralization — Learning the wrong goal that happens to look right during training
Deceptive alignment — The nightmare scenario of AI deliberately concealing its misalignment
Debate — Pitting two AI copies against each other to surface truth
Rich feedback — Moving beyond thumbs-up/thumbs-down to multi-dimensional evaluation
Social engineering by models — When AI learns to manipulate its own evaluators
Practical takeaways — KYM (Know Your Model), multi-factor trust, and institutionalized skepticism

📄 Paper: An Approach to Technical AGI Safety and Security

Chapters

Show Notes