
Episode 921 min
Ep09 Anthropic Safety Recommendation: Measurement and Evaluation
Chapters
Show Notes
Alex and Thuy dive into Anthropic's recommendations for technical AI safety research, starting with the foundational challenges of measuring what AI systems can do (capabilities evaluation) and whether they're genuinely aligned with human goals (alignment evaluation).
Topics Covered
- Why current benchmarks saturate too quickly
- Security-critical evaluations (CBRN)
- Mechanistic interpretability and chain-of-thought faithfulness
- How persona affects AI behavior
- The challenge of models "faking" alignment
Original Paper
https://alignment.anthropic.com/2025/recommended-directions/
Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)