APR Ep05: Robust Training and System-Level Security

Show Notes

The second line of defense — what happens when alignment fails. Alex and Thuy explore how to treat AI as an untrusted insider and build containment that works even against adversarial models.

In this episode:

The untrusted insider — Borrowing from corporate security to contain potentially misaligned AI
Access control — Principle of least privilege applied to AI systems
Sandboxing — Isolated environments calibrated to risk level
Anomaly detection & logging — Behavioral baselines, drift detection, and comprehensive audit trails
Distribution shift — When the world changes and your model hasn't (and the pandemic payments story)
Active learning & adversarial training — Proactively finding and filling capability gaps
Hierarchical monitoring — The economics of tiered oversight (cheap filters → ML models → human reviewers)
The collusion problem — When AI monitors share the same blind spots as the systems they watch

📄 Paper: An Approach to Technical AGI Safety and Security

Chapters

Show Notes