DeepMind AGI Safety Paper Explained: Part 4 — The Alignment Problem
Episode 431 min

DeepMind AGI Safety Paper Explained: Part 4 — The Alignment Problem

Chapters

Show Notes

When AI itself goes rogue. Alex and Thuy tackle the hardest chapter — what happens when you can't tell if your AI's output is good or bad, and the techniques being developed to address it.

In this episode:

  • The oversight gap — When models exceed human ability to evaluate their outputs
  • Reward hacking — Models gaming metrics while violating the spirit of objectives (Goodhart's Law)
  • Goal misgeneralization — Learning the wrong goal that happens to look right during training
  • Deceptive alignment — The nightmare scenario of AI deliberately concealing its misalignment
  • Debate — Pitting two AI copies against each other to surface truth
  • Rich feedback — Moving beyond thumbs-up/thumbs-down to multi-dimensional evaluation
  • Social engineering by models — When AI learns to manipulate its own evaluators
  • Practical takeaways — KYM (Know Your Model), multi-factor trust, and institutionalized skepticism

Paper

  • Title: An Approach to Technical AGI Safety and Security
  • Authors: Rohin Shah et al. (30 authors), Google DeepMind
  • Published: April 2025
  • Link: arxiv.org/abs/2504.01849

Series

This is Part 4 of an 8-part series covering the full paper.


Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)