Ep11 Anthropic Safety Recommendation: Robustness and Future Directions

Show Notes

The final episode on Anthropic's AI safety research agenda covers adversarial robustness, unlearning dangerous capabilities, multi-agent governance, and synthesizes key themes for building safe AI systems.

Topics Covered

Adversarial attacks beyond jailbreaks (corpus poisoning, website hijacking)
Realistic harm benchmarks and differential harm measurement
Adaptive defenses and rapid response
Unlearning dangerous information completely
Multi-agent governance using game theory
Key themes: measurement, defense in depth, scaling challenges

Original Paper

https://alignment.anthropic.com/2025/recommended-directions/

Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)

Chapters

Show Notes

Topics Covered

Original Paper