DeepMind AGI Safety Paper Explained: Part 3 — Fighting Misuse
Episode 332 min

DeepMind AGI Safety Paper Explained: Part 3 — Fighting Misuse

Chapters

Show Notes

How do you stop people from weaponizing AI? Alex and Thuy dig into the paper's layered approach to misuse prevention — from threat modeling to capability evaluations to red teaming.

In this episode:

  • Threat modeling — Building specific adversary profiles, not vague fears
  • Dangerous capability evaluations — Testing what the model can do and setting thresholds
  • Safety post-training — Teaching models to refuse harmful requests and hardening against jailbreaks
  • Capability suppression — The aspiration of removing dangerous knowledge (and why it's hard)
  • Monitoring — Input/output scanning, activation analysis, and manual auditing
  • Access restrictions — KYC-style tiering for AI capabilities
  • Weight security — Protecting the crown jewels with multi-person auth, environment hardening, and encrypted processing
  • Red teaming — White-box adversarial testing with loosened thresholds

Paper

  • Title: An Approach to Technical AGI Safety and Security
  • Authors: Rohin Shah et al. (30 authors), Google DeepMind
  • Published: April 2025
  • Link: arxiv.org/abs/2504.01849

Series

This is Part 3 of an 8-part series covering the full paper.


Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)

DeepMind AGI Safety Paper Explained: Part 3 — Fighting Misuse | Artificial Peer Review