APR Ep03: Fighting Misuse
Episode 332 min

APR Ep03: Fighting Misuse

Chapters

Show Notes

How do you stop people from weaponizing AI? Alex and Thuy dig into the paper's layered approach to misuse prevention — from threat modeling to capability evaluations to red teaming.

In this episode:

  • Threat modeling — Building specific adversary profiles, not vague fears
  • Dangerous capability evaluations — Testing what the model can do and setting thresholds
  • Safety post-training — Teaching models to refuse harmful requests and hardening against jailbreaks
  • Capability suppression — The aspiration of removing dangerous knowledge (and why it's hard)
  • Monitoring — Input/output scanning, activation analysis, and manual auditing
  • Access restrictions — KYC-style tiering for AI capabilities
  • Weight security — Protecting the crown jewels with multi-person auth, environment hardening, and encrypted processing
  • Red teaming — White-box adversarial testing with loosened thresholds

📄 Paper: An Approach to Technical AGI Safety and Security