APR Ep07: Making the Safety Case

Show Notes

How do you decide whether an AGI system is safe enough to deploy? Alex and Thuy walk through the paper's four types of safety cases and the role of red teaming.

In this episode:

Inability cases — The model simply can't do the dangerous thing (like PCI compliance scans)
Control cases — The system around the model catches harmful outputs (like SOC2 controls)
Incentive-based cases — Good training process implies good model (necessary but not sufficient)
Understanding-based cases — Direct internal inspection (the aspirational endgame, still far away)
Composite cases — Layered assurance combining multiple evidence types
Red teaming — White-box adversarial testing with loosened thresholds, building scientific evidence
The collusion vulnerability — When monitors share blind spots with the systems they oversee
Practical advice — How to build safety cases at any organizational scale

📄 Paper: An Approach to Technical AGI Safety and Security

Chapters

Show Notes