What happens when AI agents are asked to build the spreadsheets finance teams actually use?
WorkstreamBench, a benchmark for end-to-end financial spreadsheet work, exposes the gap between impressive demos and professional deliverables. It tests complete multi-sheet workbooks, not single formulas or table questions.
The benchmark scores accuracy, formula quality, and formatting, because in finance a model must be auditable, readable, and easy to modify.
Claude Web leads with 69.1 out of 100, but even the best systems degrade as tasks become more complex. Enterprise AI still has a spreadsheet reliability problem.
Inspired by the work of Thomson Yen, Julian Poeltl, Harshith Srinivas Gear, Yilin Meng, Joshua Fan, Adam Shen, Yili Liu, Ali Bauyrzhan, Siri Du, Haoyang Liu, Daniel Guetta, and Hongseok Namkoong, this episode was created using Google's NotebookLM.
Read the original paper here:
https://arxiv.org/pdf/2605.22664