
Episode 1127 min
Anthropic AI Safety Research Directions Explained: Part 3 — Robustness & Future Directions
Chapters
Show Notes
The final episode on Anthropic's AI safety research agenda covers adversarial robustness, unlearning dangerous capabilities, multi-agent governance, and synthesizes key themes for building safe AI systems.
In this episode:
- Adversarial attacks beyond jailbreaks — Corpus poisoning, website hijacking, and novel attack vectors
- Realistic harm benchmarks — Differential harm measurement and meaningful metrics
- Adaptive defenses and rapid response — Building systems that evolve with threats
- Unlearning dangerous information — Can models truly forget what they've learned?
- Multi-agent governance — Using game theory for AI coordination
- Key themes — Measurement, defense in depth, and scaling challenges
Paper
- Title: Recommended Directions for AI Safety Research
- Authors: Anthropic
- Published: 2025
- Link: alignment.anthropic.com/2025/recommended-directions
Series
This is the final episode (Part 3) in our series on Anthropic's AI safety research agenda.
Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)