APR Ep03: Fighting Misuse

Show Notes

How do you stop people from weaponizing AI? Alex and Thuy dig into the paper's layered approach to misuse prevention — from threat modeling to capability evaluations to red teaming.

In this episode:

Threat modeling — Building specific adversary profiles, not vague fears
Dangerous capability evaluations — Testing what the model can do and setting thresholds
Safety post-training — Teaching models to refuse harmful requests and hardening against jailbreaks
Capability suppression — The aspiration of removing dangerous knowledge (and why it's hard)
Monitoring — Input/output scanning, activation analysis, and manual auditing
Access restrictions — KYC-style tiering for AI capabilities
Weight security — Protecting the crown jewels with multi-person auth, environment hardening, and encrypted processing
Red teaming — White-box adversarial testing with loosened thresholds

📄 Paper: An Approach to Technical AGI Safety and Security

Chapters

Show Notes