PodcastsSciencesThe Information Bottleneck

The Information Bottleneck

Ravid Shwartz-Ziv & Allen Roush
The Information Bottleneck
Dernier épisode

47 épisodes

  • The Information Bottleneck

    Why AI Benchmarks Are Lying to You - with Wenhu Chen (Meta/University of Waterloo)

    13/06/2026 | 1 h 19 min
    In this episode, we sit down with Wenhu Chen, research scientist at Meta MSL, assistant professor at the University of Waterloo, and the person behind MMLU-Pro and MMMU. If you've read a frontier model release in the last two years, you've seen his benchmarks. That makes him one of the best people to answer the question everyone dances around: when a model jumps from 40% to 90% on your benchmark, how much of that is real? In this episode, we dig into why benchmarks have become the loss function of the entire field - design a bad one, and thousands of brilliant researchers will spend months hill-climbing in the wrong direction. Wenhu is surprisingly candid about the limits of his own creations: contamination is everywhere, saturation turns frontier benchmarks into unit tests, and popular alternatives, such as LM Arena, mostly measure tone and length rather than capability. His answer is to evaluate models where they've never been: private codebases, hospital data, and the messy, live internet.
    We also talk about ClawBench, his new benchmark that deploys agents to over 140 real production websites to do things people actually want done, such, such as ordering food, booking tickets, and applying for jobs. The best model in the world completes about a third of these tasks. We unpack why: bot detection, models that refuse to click "pay," agents that give up the moment an environment doesn't match their training, and harnesses that can swing results by 20% without changing the model at all.
    Along the way, we cover the overlooked science of evaluating pre-training, data flywheels, and synthetic environments for agent training, and whether RL teaches models to reason or just surfaces what's already there. We close with Wenhu's predictions: exploration and adaptability will improve rapidly, but security will become the field's hardest problem as agents gain real permissions in the real world.

    Timestamps
    00:00 – Intro
    00:55 – What good evaluation means, and how it's changed since the early GPT days
    03:35 – Benchmarks as the field's loss function
    05:50 – Contamination: the problem nobody fully solves
    08:08 – MMLU-Pro scores: real progress or training on the test set?
    11:05 – Can you measure creativity?
    12:34 – Why human judges and arenas are unreliable — and what to use instead
    19:22 – What a good benchmark actually looks like
    22:34 – Chain of thought: signal or scratchpad?
    26:01 – Auto-research and hill-climbing agents
    28:52 – Harnesses: 20% swings without touching the model
    32:28 – Safety, model release, and an "FDA for models"
    36:53 – The overlooked science of pre-training evaluation
    43:49 – Designing pre-training benchmarks when one run costs a billion dollars
    49:45 – ClawBench: agents on 140+ live websites, and why the best model gets 33%
    54:42 – How MMLU-Pro and MMMU-Pro were born from public complaints
    59:16 – Pixel agents vs. APIs: will MCP kill computer use?
    1:02:11 – Training agents: data flywheels and synthetic environments
    1:05:43 – SFT vs. RL, and does RL teach reasoning or reveal it?
    1:09:21 – What gets solved next year — and what doesn't
    1:14:32 – Undervalued ideas, and what's next for ClawBench

    Music:
    "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.

    About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
  • The Information Bottleneck

    Why AI Benchmarks Are Lying to You - with Wenhu Chen (Meta/ University of Waterloo)

    13/06/2026 | 1 h 19 min
    In this episode, we sit down with Wenhu Chen, research scientist at Meta MSL, assistant professor at the University of Waterloo, and the person behind MMLU-Pro and MMMU. If you've read a frontier model release in the last two years, you've seen his benchmarks. That makes him one of the best people to answer the question everyone dances around: when a model jumps from 40% to 90% on your benchmark, how much of that is real? In this episode, we dig into why benchmarks have become the loss function of the entire field - design a bad one, and thousands of brilliant researchers will spend months hill-climbing in the wrong direction. Wenhu is surprisingly candid about the limits of his own creations: contamination is everywhere, saturation turns frontier benchmarks into unit tests, and popular alternatives, such as LM Arena, mostly measure tone and length rather than capability. His answer is to evaluate models where they've never been: private codebases, hospital data, and the messy, live internet.
    We also talk about ClawBench, his new benchmark that deploys agents to over 140 real production websites to do things people actually want done, such, such as ordering food, booking tickets, and applying for jobs. The best model in the world completes about a third of these tasks. We unpack why: bot detection, models that refuse to click "pay," agents that give up the moment an environment doesn't match their training, and harnesses that can swing results by 20% without changing the model at all.
    Along the way, we cover the overlooked science of evaluating pre-training, data flywheels, and synthetic environments for agent training, and whether RL teaches models to reason or just surfaces what's already there. We close with Wenhu's predictions: exploration and adaptability will improve rapidly, but security will become the field's hardest problem as agents gain real permissions in the real world.

    Timestamps
    00:00 – Intro
    00:55 – What good evaluation means, and how it's changed since the early GPT days
    03:35 – Benchmarks as the field's loss function
    05:50 – Contamination: the problem nobody fully solves
    08:08 – MMLU-Pro scores: real progress or training on the test set?
    11:05 – Can you measure creativity?
    12:34 – Why human judges and arenas are unreliable — and what to use instead
    19:22 – What a good benchmark actually looks like
    22:34 – Chain of thought: signal or scratchpad?
    26:01 – Auto-research and hill-climbing agents
    28:52 – Harnesses: 20% swings without touching the model
    32:28 – Safety, model release, and an "FDA for models"
    36:53 – The overlooked science of pre-training evaluation
    43:49 – Designing pre-training benchmarks when one run costs a billion dollars
    49:45 – ClawBench: agents on 140+ live websites, and why the best model gets 33%
    54:42 – How MMLU-Pro and MMMU-Pro were born from public complaints
    59:16 – Pixel agents vs. APIs: will MCP kill computer use?
    1:02:11 – Training agents: data flywheels and synthetic environments
    1:05:43 – SFT vs. RL, and does RL teach reasoning or reveal it?
    1:09:21 – What gets solved next year — and what doesn't
    1:14:32 – Undervalued ideas, and what's next for ClawBench

    Music:
    "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.

    About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
  • The Information Bottleneck

    Jürgen Schmidhuber - Part 2: JEPA, the Road to AGI, and Who Really Invented Modern AI

    07/06/2026 | 1 h 29 min
    In the second half of our conversation with Jürgen Schmidhuber, we focus on the key ideas he's pursued since the early 1990s and discuss why he believes these concepts are only now being rediscovered.

    We start with JEPA. Jürgen argues that the method LeCun named in 2022 is the same family he published in 1992 as Predictability Maximization. From there he traces the adversarial lineage back further still, to his 1990 world-model paper and 1991 Predictability Minimization  -  the curiosity-driven minimax games he sees as the real origins of GANs.
    We also talk about why these ideas took thirty years to land, why today's trillion-dollar data-center buildout is driven by AGI fear, and why he thinks Apple may come out ahead.

    The back half turns to what he sees as the real frontier: physical AI. Today's systems are superhuman behind the screen but helpless at a leaky pipe, and until a robot can use human tools, there's no AGI. He discusses self-replicating, self-improving machines as "a new kind of life," reframes continual learning and test-time training as ideas from his 1991 fast-weight work, and detours through Solomonoff's universal prior, Hutter's AIXI, and the Gödel machine.

    We close on the subject Jürgen is famous for: scientific credit. He makes his case for rigorous attribution, casts himself as a "speaker for the dead" championing forgotten pioneers like Ivakhnenko, and reflects candidly on whether the fights are personal.

    Timeline

    00:30 — What JEPA is, and the 1992 Predictability Maximization story
    04:54 — Implementing PMAX: autoencoders, Siamese networks, Infomax
    09:10 — Predictability Minimization, factorial codes, and the roots of GANs
    16:00 — Why it took 30 years: the economics of compute
    20:52 — Data, the web, and 1990 as the origin point
    23:09 — Hardware inflation, the trillion-dollar buildout, and the coming crash
    34:05 — Physical AI: the plumber problem and self-replicating machines
    41:14 — Which 90s ideas are being scaled right now
    45:26 — Continual learning and test-time training as "old hats"
    55:19 — Measuring intelligence: Solomonoff, AIXI, and the Gödel machine
    1:05:26 — Self-replication and von Neumann
    1:09:51 — Will he see AGI in his lifetime?
    1:10:42 — Credit, integrity, and being a "speaker for the dead"

    Music:
    "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
    "Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
    Changes: trimmed

    About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
  • The Information Bottleneck

    Jürgen Schmidhuber - World Models, RL, and the Year that changed AI (Part 1)

    04/06/2026 | 1 h 37 min
    In this episode, we host Jürgen Schmidhuber - the man, the legend, one of the godfathers of modern AI. His lab worked out many ideas behind today’s systems (LSTM, world models, artificial curiosity, Transformer variants, and even GAN-style setups) decades before they became fashionable, and he’s just as well known for making sure people remember who did what first. This is the first of two conversations with him.
    We go back to his lab in the early 90s and ask how one small group came up with so many of the ideas that are now being scaled to a thousand billion dollars, back when compute was ten million times more expensive. A lot of the episode comes down to one distinction he keeps making: prediction vs. decision-making. His take is that LLMs are very good prediction machines that imitate the web, but that’s only half the problem. To actually act in the world, you need a controller that uses a world model to plan. He talks about his 1990 work on world models and artificial curiosity, where the controller gets rewarded for running experiments that improve its own model (an adversarial setup years before GANs), why planning millisecond by millisecond doesn’t scale, and why you need sub-goals instead.
    We also talk about compression as the core of understanding, from falling apples to Kepler to Einstein, and why we still don’t have a robot that can do what a plumber does, even though the AI behind the screen keeps getting better. Then the conversation moves to credit assignment: how “to Schmidhuber” became a verb, what he thinks is broken about the award system, and a long exchange on PMAX vs. JEPA. He ends on the real origins of deep learning and a prediction about self-replicating machines in space.

    Timeline
    00:00  Intro
    00:55  1991 in Munich, and why that lab mattered
    02:38  "I'm not very smart"  and why compute getting 10× cheaper every 5 years changed everything
    04:25  Chess as an AI proxy
    08:27  Artificial curiosity in the 90s vs. today's RL exploration
    09:10  Why RL is harder than supervised learning
    20:48  Coding agents vs. robots, and how a baby learns its own hands
    26:20  Compression as understanding
    33:40  What's actually missing on the road to AGI
    37:30  Why millisecond-by-millisecond planning is stupid
    47:44  Convergence to LLMs, GPUs, and how far we still are from the Bremermann limit
    51:49  Unsupervised learning, factorial codes, and predictability minimization
    58:12  Credit assignment: the fights with LeCun and the Nobel critique
    1:02:13  On his last name becoming a verb
    1:05:17  The award system's missing peer review
    1:07:03  Closed labs and the decline of open research
    1:13:23  Audience questions
    1:34:02  Closing: who really invented deep learning?

    Music:
    "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
    "Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
    Changes: trimmed

    About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
  • The Information Bottleneck

    AI for Science and the Thermodynamics of Generative AI - with Max Welling (UvA, CuspAI)

    29/05/2026 | 1 h 13 min
    In this episode, we sit with Max Welling, Professor of Machine Learning at the University of Amsterdam, co-founder and CTO of CuspAI, and a foundational figure behind variational autoencoders (VAEs), equivariant networks, and Bayesian deep learning. We talk about AI for science, the physics underneath generative models, and what's still missing on the road to real intelligence.
    Max starts with what impresses him and what worries him about the LLM era, then makes the case that the next leaps will come from physical AI and from science itself. We dig into how machine learning actually works in the lab, world models and whether priors like geometry and symmetry should be built in or simply learned, and whether transformers will still rule a decade from now. At the end, we talk about CuspAI's climate mission, AI risk and regulation, Max’s new book, and where neuroscience might inspire the next wave of ML.

    Timeline
    00:00 — Intro
    00:47 — Are we happy with the LLM era?
    03:14 — Embodiment and physical AI
    08:05 — Does "AGI" even matter as a term?
    11:34 — Verifiers, RL, and why math/coding are tractable
    13:17 — What actually shifted to make materials discovery work
    14:42 — From molecules to biology and wet labs
    16:26 — Working with real labs: timescales, friction, and the "Mira" agent
    20:29 — Balancing simulators vs. experiments: the exploration–exploitation trade-off
    23:44 — Active learning for experimental design
    24:23 — Why active learning hasn't been central to LLMs
    25:24 — A general loop for ML-for-science across domains
    27:10 — Foundation models for chemistry: a "mother ship" plus a zoo of fine-tuned models
    30:04 — Quantum mechanics, interpretation, and AI as a creative theorist
    31:54 — World models and Yann LeCun's view; priors vs. learning
    34:57 — Should world knowledge be explicit? (responding to Stefano Ermon)
    36:41 — Vision: equivariance vs. transformers, and the role of optimization
    40:32 — Best model for molecular properties in 10 years? Will transformers survive?
    43:16 — CuspAI's climate focus and what motivated it
    47:10 — One platform for every material class — what transfers and what doesn't
    48:42 — Where does the risk of human extinction really come from?
    51:06 — The "pause AI" debate and the arms-race reality
    52:40 — Regulating powerful models: government vs. self-regulation
    55:16 — Who should design AI regulation?
    56:29 — The new book
    1:00:31 — Compression, the information bottleneck, and renormalization
    1:03:30 — The role of foundational principles in modern AI
    1:04:06 — Waves in computing, the brain, and the next wave of innovation
    1:07:11 — Neuroscience and ML: are we in a better position now?
    1:09:17 — Conferences, the ICLR keynote, and finding the right peopleMusic:
    "Kid Kodi" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
    "Palms Down" - Blue Dot Sessions - via Free Music Archive - CC BY-NC 4.0.
    Changes: trimmed
    About: The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.
Plus de podcasts Sciences
À propos de The Information Bottleneck
Two AI Researchers - Ravid Shwartz Ziv, and Allen Roush, discuss the latest trends, news, and research within Generative AI, LLMs, GPUs, and Cloud Systems.
Site web du podcast

Écoutez The Information Bottleneck, Mystères et étoiles ou d'autres podcasts du monde entier - avec l'app de radio.fr

Obtenez l’app radio.fr
 gratuite

  • Ajout de radios et podcasts en favoris
  • Diffusion via Wi-Fi ou Bluetooth
  • Carplay & Android Auto compatibles
  • Et encore plus de fonctionnalités