DeepMind AGI Safety Paper Explained: Part 5 — Robust Training & System-Level Security
Episode 533 min

DeepMind AGI Safety Paper Explained: Part 5 — Robust Training & System-Level Security

Chapters

Show Notes

The second line of defense — what happens when alignment fails. Alex and Thuy explore how to treat AI as an untrusted insider and build containment that works even against adversarial models.

In this episode:

  • The untrusted insider — Borrowing from corporate security to contain potentially misaligned AI
  • Access control — Principle of least privilege applied to AI systems
  • Sandboxing — Isolated environments calibrated to risk level
  • Anomaly detection & logging — Behavioral baselines, drift detection, and comprehensive audit trails
  • Distribution shift — When the world changes and your model hasn't (and the pandemic payments story)
  • Active learning & adversarial training — Proactively finding and filling capability gaps
  • Hierarchical monitoring — The economics of tiered oversight (cheap filters → ML models → human reviewers)
  • The collusion problem — When AI monitors share the same blind spots as the systems they watch

Paper

  • Title: An Approach to Technical AGI Safety and Security
  • Authors: Rohin Shah et al. (30 authors), Google DeepMind
  • Published: April 2025
  • Link: arxiv.org/abs/2504.01849

Series

This is Part 5 of an 8-part series covering the full paper.


Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)