What happens when the technology we built to evaluate AI becomes too limited to keep up with AI itself?
In this episode, we explore a fundamental shift in how we assess artificial intelligence. For years, we relied on large language models to judge other models—a paradigm known as LLM-as-a-Judge. But as AI systems tackle increasingly complex, multi-step tasks, this approach is breaking down. The solution? Turning judges into agents—autonomous systems that can plan, use tools, collaborate, and verify their assessments against real-world evidence.
We unpack what this means for AI development pipelines, from code generation to medical diagnosis, and why the future of AI evaluation may determine the future of AI itself.
Inspired by the work of Runyang You, Hongru Cai, Caiqi Zhang, Yongqi Li, Wenjie Li, and colleagues at Hong Kong Polytechnic University, Cambridge, and Huawei, this episode was created using Google's NotebookLM.Read the original paper here: https://arxiv.org/pdf/2601.05111