AIware 2026
Mon 6 - Tue 7 July 2026 Montreal, Canada
co-located with FSE 2026

This program is tentative and subject to change.

Software tests and code evolve together: a code change should be followed by new or updated tests that record the new software behavior. Yet existing test generation and update benchmarks often isolate the test from the code change, and rely on static metadata that does not verify whether a test is executable or semantically tied to the code change. This makes it difficult to evaluate whether a test automation agent understands how a code change should propagate into the test suite. We introduce TestEvo-Bench, a benchmark of test and code co-evolution tasks mined from software repositories, with two tracks: in test generation, the agent shall write new tests to capture the new software behavior; in test update, the agent shall adapt failing existing tests to the changed software behavior. Each task is anchored to a real commit history and packaged with environment configuration to support execution-grounded metrics such as pass rate, coverage, and mutation score. TestEvo-Bench is also a live benchmark: each task records the timestamp of the test and code changes, and new tasks are periodically mined by our automated pipeline, so evaluation can be restricted to tasks postdating a model’s training cutoff to reduce data leakage risk. The current snapshot contains 746 test generation and 509 test update tasks, curated from 59,950 candidate co-evolution records across 152 open-source Java projects. We experiment with four state-of-the-art agents that combine strong harnesses (Claude Code, Gemini CLI, and SWE-Agent) with strong foundation models (Claude Opus 4.7 and Gemini 3.1 Pro). Results show that they achieve up to 77.5% success rate on test generation and 74.6% on test update. However, success rate is materially lower on the most recent benchmark tasks and drops significantly under limited per-task cost.

This program is tentative and subject to change.

Tue 7 Jul

Displayed time zone: Eastern Time (US & Canada) change

12:00 - 12:30
Benchmarks, Datasets, and Evaluation of AIware Benchmark & Dataset Track / ArXiv Track / Main Track at MB 1.210
12:00
5m
Talk
SWE-Bench+: Enhanced LLM Coding Benchmark
Benchmark & Dataset Track
Haoran Xue York University, CA, Reem Aleithan York University, Canada, Nafid Enan York University, CA, Gias Uddin York University, Canada, Song Wang York University
12:05
5m
Talk
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
Benchmark & Dataset Track
Yeheng Chen Shanghai Jiao Tong University, Chaoxiang Xie Hohai University, Yuling Shi Shanghai Jiao Tong University, Wenhao Zeng Shanghai Jiao Tong University, Yongpan Wang Shanghai Jiaotong University, CN, Hongyu Zhang Chongqing University, Xiaodong Gu Shanghai Jiao Tong University
12:10
5m
Talk
Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges
Benchmark & Dataset Track
Ali Al-Kaswan Delft University of Technology, Netherlands, Maksim Plotnikov Delft University of Technology, NL, Maxim Hájek Delft University of Technology, NL, Roland Vízner Delft University of Technology, NL, Arie van Deursen TU Delft, Mali Izadi Google & TU Delft
12:15
5m
Talk
A Dataset of Agentic AI Coding Tool Configurations
Benchmark & Dataset Track
Matthias Galster University of Canterbury, Seyedmoein Mohsenimofidi Heidelberg University, Levi Böhme Universität Bayreuth, DE, Jai Lal Lulla Singapore Management University, Muhammad Auwal Abubakar Otto-Friedrich Universität Bamberg, DE, Christoph Treude Singapore Management University, Sebastian Baltes Heidelberg University
12:20
5m
Talk
AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub
Benchmark & Dataset Track
Daniel Ogenrwot University of Nevada Las Vegas, John Businge University of Nevada, Las Vegas
DOI Pre-print
12:25
5m
Talk
TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution
ArXiv Track
Jiale Amber Wang University of Waterloo, Kaiyuan Wang Google, Inc., Pengyu Nie University of Waterloo