APR Ep05: Robust Training and System-Level Security
Episode 533 min

APR Ep05: Robust Training and System-Level Security

Chapters

Show Notes

The second line of defense — what happens when alignment fails. Alex and Thuy explore how to treat AI as an untrusted insider and build containment that works even against adversarial models.

In this episode:

  • The untrusted insider — Borrowing from corporate security to contain potentially misaligned AI
  • Access control — Principle of least privilege applied to AI systems
  • Sandboxing — Isolated environments calibrated to risk level
  • Anomaly detection & logging — Behavioral baselines, drift detection, and comprehensive audit trails
  • Distribution shift — When the world changes and your model hasn't (and the pandemic payments story)
  • Active learning & adversarial training — Proactively finding and filling capability gaps
  • Hierarchical monitoring — The economics of tiered oversight (cheap filters → ML models → human reviewers)
  • The collusion problem — When AI monitors share the same blind spots as the systems they watch

📄 Paper: An Approach to Technical AGI Safety and Security