AIware 2026
Mon 6 - Tue 7 July 2026 Montreal, Canada
co-located with FSE 2026

This program is tentative and subject to change.

LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes—compositional code creation, i.e., building a complete, internally structured class from a specification—remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark’s discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck.

This program is tentative and subject to change.

Tue 7 Jul

Displayed time zone: Eastern Time (US & Canada) change

12:00 - 12:30
Benchmarks, Datasets, and Evaluation of AIware Benchmark & Dataset Track / ArXiv Track / Main Track at MB 1.210
12:00
5m
Talk
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
Benchmark & Dataset Track
Yeheng Chen Shanghai Jiao Tong University, Chaoxiang Xie Hohai University, Yuling Shi Shanghai Jiao Tong University, Wenhao Zeng Shanghai Jiao Tong University, Yongpan Wang Shanghai Jiaotong University, CN, Hongyu Zhang Chongqing University, Xiaodong Gu Shanghai Jiao Tong University
12:05
5m
Talk
SWE-Bench+: Enhanced LLM Coding Benchmark
Benchmark & Dataset Track
Haoran Xue York University, CA, Reem Aleithan York University, Canada, Nafid Enan York University, CA, Gias Uddin York University, Canada, Song Wang York University
12:10
5m
Talk
Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges
Benchmark & Dataset Track
Ali Al-Kaswan Delft University of Technology, Netherlands, Maksim Plotnikov Delft University of Technology, NL, Maxim Hájek Delft University of Technology, NL, Roland Vízner Delft University of Technology, NL, Arie van Deursen TU Delft, Mali Izadi Google & TU Delft
12:15
5m
Talk
A Dataset of Agentic AI Coding Tool Configurations
Benchmark & Dataset Track
Matthias Galster University of Canterbury, Seyedmoein Mohsenimofidi Heidelberg University, Levi Böhme Universität Bayreuth, DE, Jai Lal Lulla Singapore Management University, Muhammad Auwal Abubakar Otto-Friedrich Universität Bamberg, DE, Christoph Treude Singapore Management University, Sebastian Baltes Heidelberg University
12:20
5m
Talk
AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub
Benchmark & Dataset Track
Daniel Ogenrwot University of Nevada Las Vegas, John Businge University of Nevada, Las Vegas
DOI Pre-print
12:25
5m
Talk
TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution
ArXiv Track
Jiale Amber Wang University of Waterloo, Kaiyuan Wang Google, Inc., Pengyu Nie University of Waterloo