TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution
This program is tentative and subject to change.
Software tests and code evolve together: a code change should be followed by new or updated tests that record the new software behavior. Yet existing test generation and update benchmarks often isolate the test from the code change, and rely on static metadata that does not verify whether a test is executable or semantically tied to the code change. This makes it difficult to evaluate whether a test automation agent understands how a code change should propagate into the test suite. We introduce TestEvo-Bench, a benchmark of test and code co-evolution tasks mined from software repositories, with two tracks: in test generation, the agent shall write new tests to capture the new software behavior; in test update, the agent shall adapt failing existing tests to the changed software behavior. Each task is anchored to a real commit history and packaged with environment configuration to support execution-grounded metrics such as pass rate, coverage, and mutation score. TestEvo-Bench is also a live benchmark: each task records the timestamp of the test and code changes, and new tasks are periodically mined by our automated pipeline, so evaluation can be restricted to tasks postdating a model’s training cutoff to reduce data leakage risk. The current snapshot contains 746 test generation and 509 test update tasks, curated from 59,950 candidate co-evolution records across 152 open-source Java projects. We experiment with four state-of-the-art agents that combine strong harnesses (Claude Code, Gemini CLI, and SWE-Agent) with strong foundation models (Claude Opus 4.7 and Gemini 3.1 Pro). Results show that they achieve up to 77.5% success rate on test generation and 74.6% on test update. However, success rate is materially lower on the most recent benchmark tasks and drops significantly under limited per-task cost.
This program is tentative and subject to change.
Tue 7 JulDisplayed time zone: Eastern Time (US & Canada) change
12:00 - 12:30 | Benchmarks, Datasets, and Evaluation of AIware Benchmark & Dataset Track / ArXiv Track / Main Track at MB 1.210 | ||
12:00 5mTalk | SWE-Bench+: Enhanced LLM Coding Benchmark Benchmark & Dataset Track Haoran Xue York University, CA, Reem Aleithan York University, Canada, Nafid Enan York University, CA, Gias Uddin York University, Canada, Song Wang York University | ||
12:05 5mTalk | ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation Benchmark & Dataset Track Yeheng Chen Shanghai Jiao Tong University, Chaoxiang Xie Hohai University, Yuling Shi Shanghai Jiao Tong University, Wenhao Zeng Shanghai Jiao Tong University, Yongpan Wang Shanghai Jiaotong University, CN, Hongyu Zhang Chongqing University, Xiaodong Gu Shanghai Jiao Tong University | ||
12:10 5mTalk | Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges Benchmark & Dataset Track Ali Al-Kaswan Delft University of Technology, Netherlands, Maksim Plotnikov Delft University of Technology, NL, Maxim Hájek Delft University of Technology, NL, Roland Vízner Delft University of Technology, NL, Arie van Deursen TU Delft, Mali Izadi Google & TU Delft | ||
12:15 5mTalk | A Dataset of Agentic AI Coding Tool Configurations Benchmark & Dataset Track Matthias Galster University of Canterbury, Seyedmoein Mohsenimofidi Heidelberg University, Levi Böhme Universität Bayreuth, DE, Jai Lal Lulla Singapore Management University, Muhammad Auwal Abubakar Otto-Friedrich Universität Bamberg, DE, Christoph Treude Singapore Management University, Sebastian Baltes Heidelberg University | ||
12:20 5mTalk | AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub Benchmark & Dataset Track DOI Pre-print | ||
12:25 5mTalk | TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution ArXiv Track Jiale Amber Wang University of Waterloo, Kaiyuan Wang Google, Inc., Pengyu Nie University of Waterloo | ||