
Episode 533 min
APR Ep05: Robust Training and System-Level Security
Chapters
Show Notes
The second line of defense — what happens when alignment fails. Alex and Thuy explore how to treat AI as an untrusted insider and build containment that works even against adversarial models.
In this episode:
- The untrusted insider — Borrowing from corporate security to contain potentially misaligned AI
- Access control — Principle of least privilege applied to AI systems
- Sandboxing — Isolated environments calibrated to risk level
- Anomaly detection & logging — Behavioral baselines, drift detection, and comprehensive audit trails
- Distribution shift — When the world changes and your model hasn't (and the pandemic payments story)
- Active learning & adversarial training — Proactively finding and filling capability gaps
- Hierarchical monitoring — The economics of tiered oversight (cheap filters → ML models → human reviewers)
- The collusion problem — When AI monitors share the same blind spots as the systems they watch