
Episode 533 min
DeepMind AGI Safety Paper Explained: Part 5 — Robust Training & System-Level Security
Chapters
Show Notes
The second line of defense — what happens when alignment fails. Alex and Thuy explore how to treat AI as an untrusted insider and build containment that works even against adversarial models.
In this episode:
- The untrusted insider — Borrowing from corporate security to contain potentially misaligned AI
- Access control — Principle of least privilege applied to AI systems
- Sandboxing — Isolated environments calibrated to risk level
- Anomaly detection & logging — Behavioral baselines, drift detection, and comprehensive audit trails
- Distribution shift — When the world changes and your model hasn't (and the pandemic payments story)
- Active learning & adversarial training — Proactively finding and filling capability gaps
- Hierarchical monitoring — The economics of tiered oversight (cheap filters → ML models → human reviewers)
- The collusion problem — When AI monitors share the same blind spots as the systems they watch
Paper
- Title: An Approach to Technical AGI Safety and Security
- Authors: Rohin Shah et al. (30 authors), Google DeepMind
- Published: April 2025
- Link: arxiv.org/abs/2504.01849
Series
This is Part 5 of an 8-part series covering the full paper.
Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)