
Episode 1127 min
Ep11 Anthropic Safety Recommendation: Robustness and Future Directions
Chapters
Show Notes
The final episode on Anthropic's AI safety research agenda covers adversarial robustness, unlearning dangerous capabilities, multi-agent governance, and synthesizes key themes for building safe AI systems.
Topics Covered
- Adversarial attacks beyond jailbreaks (corpus poisoning, website hijacking)
- Realistic harm benchmarks and differential harm measurement
- Adaptive defenses and rapid response
- Unlearning dangerous information completely
- Multi-agent governance using game theory
- Key themes: measurement, defense in depth, scaling challenges
Original Paper
https://alignment.anthropic.com/2025/recommended-directions/
Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)