Ep09 Anthropic Safety Recommendation: Measurement and Evaluation

Show Notes

Alex and Thuy dive into Anthropic's recommendations for technical AI safety research, starting with the foundational challenges of measuring what AI systems can do (capabilities evaluation) and whether they're genuinely aligned with human goals (alignment evaluation).

Topics Covered

Why current benchmarks saturate too quickly
Security-critical evaluations (CBRN)
Mechanistic interpretability and chain-of-thought faithfulness
How persona affects AI behavior
The challenge of models "faking" alignment

Original Paper

https://alignment.anthropic.com/2025/recommended-directions/

Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)

Chapters

Show Notes

Topics Covered

Original Paper