
Episode 332 min
APR Ep03: Fighting Misuse
Chapters
Show Notes
How do you stop people from weaponizing AI? Alex and Thuy dig into the paper's layered approach to misuse prevention — from threat modeling to capability evaluations to red teaming.
In this episode:
- Threat modeling — Building specific adversary profiles, not vague fears
- Dangerous capability evaluations — Testing what the model can do and setting thresholds
- Safety post-training — Teaching models to refuse harmful requests and hardening against jailbreaks
- Capability suppression — The aspiration of removing dangerous knowledge (and why it's hard)
- Monitoring — Input/output scanning, activation analysis, and manual auditing
- Access restrictions — KYC-style tiering for AI capabilities
- Weight security — Protecting the crown jewels with multi-person auth, environment hardening, and encrypted processing
- Red teaming — White-box adversarial testing with loosened thresholds