Anthropic AI Safety Research Directions Explained: Part 3 — Robustness & Future Directions
Episode 1127 min

Anthropic AI Safety Research Directions Explained: Part 3 — Robustness & Future Directions

Chapters

Show Notes

The final episode on Anthropic's AI safety research agenda covers adversarial robustness, unlearning dangerous capabilities, multi-agent governance, and synthesizes key themes for building safe AI systems.

In this episode:

  • Adversarial attacks beyond jailbreaks — Corpus poisoning, website hijacking, and novel attack vectors
  • Realistic harm benchmarks — Differential harm measurement and meaningful metrics
  • Adaptive defenses and rapid response — Building systems that evolve with threats
  • Unlearning dangerous information — Can models truly forget what they've learned?
  • Multi-agent governance — Using game theory for AI coordination
  • Key themes — Measurement, defense in depth, and scaling challenges

Paper

Series

This is the final episode (Part 3) in our series on Anthropic's AI safety research agenda.


Hosted by Alex (PM at a fintech scale-up) and Thuy (AI researcher)