
Episode 332 min
DeepMind AGI Safety Paper Explained: Part 3 — Fighting Misuse
Chapters
Show Notes
How do you stop people from weaponizing AI? Alex and Thuy dig into the paper's layered approach to misuse prevention — from threat modeling to capability evaluations to red teaming.
In this episode:
- Threat modeling — Building specific adversary profiles, not vague fears
- Dangerous capability evaluations — Testing what the model can do and setting thresholds
- Safety post-training — Teaching models to refuse harmful requests and hardening against jailbreaks
- Capability suppression — The aspiration of removing dangerous knowledge (and why it's hard)
- Monitoring — Input/output scanning, activation analysis, and manual auditing
- Access restrictions — KYC-style tiering for AI capabilities
- Weight security — Protecting the crown jewels with multi-person auth, environment hardening, and encrypted processing
- Red teaming — White-box adversarial testing with loosened thresholds
Paper
- Title: An Approach to Technical AGI Safety and Security
- Authors: Rohin Shah et al. (30 authors), Google DeepMind
- Published: April 2025
- Link: arxiv.org/abs/2504.01849
Series
This is Part 3 of an 8-part series covering the full paper.
Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)