ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation (AIware 2026 - Benchmark & Dataset Track)

Who

Yeheng Chen, Chaoxiang Xie, Yuling Shi, Wenhao Zeng, Yongpan Wang, Hongyu Zhang, Xiaodong Gu

Track

AIware 2026 Benchmark & Dataset Track

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 7 Jul 2026 12:00 - 12:05 at MB 1.210 - Benchmarks, Datasets, and Evaluation of AIware

Abstract

LLMs have achieved strong results on both function-level code synthesis and repository-level code modification, yet a capability that falls between these two extremes—compositional code creation, i.e., building a complete, internally structured class from a specification—remains underserved. Current evaluations are either confined to isolated functions or rely on manually curated class-level tasks that are expensive to scale and increasingly susceptible to data contamination. We introduce ClassEval-Pro, a benchmark of 300 class-level tasks spanning 11 domains, constructed through an automated three-stage pipeline that combines complexity enhancement, cross-domain class composition, and integration of real-world GitHub code contributed after January 2025. Every task is validated by an LLM Judge Ensemble and must pass test suites with over 90% line coverage. We evaluate five frontier LLMs under five generation strategies. The best model achieves only 45.6% class-level Pass@1, with a 17.7-point gap between the strongest and weakest models, confirming the benchmark’s discriminative power. Strategy choice strongly interacts with model capability: structured approaches such as bottom-up improve weaker models by up to 9.4 percentage points, while compositional generation collapses to as low as 1.3%. Error analysis over 500 manually annotated failures reveals that logic errors (56.2%) and dependency errors (38.0%) dominate, identifying cross-method coordination as the core bottleneck.

Yeheng Chen

Shanghai Jiao Tong University

Chaoxiang Xie

Hohai University

China

Yuling Shi

Shanghai Jiao Tong University

China

Wenhao Zeng

Shanghai Jiao Tong University

China

Yongpan Wang

Shanghai Jiaotong University, CN

Hongyu Zhang

Chongqing University

China

Xiaodong Gu

Shanghai Jiao Tong University

China

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 7 Jul
Displayed time zone: Eastern Time (US & Canada) change

12:00 - 12:30	Benchmarks, Datasets, and Evaluation of AIware Benchmark & Dataset Track / ArXiv Track / Main Track at MB 1.210

12:00 5m Talk		ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation Benchmark & Dataset Track Yeheng Chen Shanghai Jiao Tong University, Chaoxiang Xie Hohai University, Yuling Shi Shanghai Jiao Tong University, Wenhao Zeng Shanghai Jiao Tong University, Yongpan Wang Shanghai Jiaotong University, CN, Hongyu Zhang Chongqing University, Xiaodong Gu Shanghai Jiao Tong University
12:05 5m Talk		SWE-Bench+: Enhanced LLM Coding Benchmark Benchmark & Dataset Track Haoran Xue York University, CA, Reem Aleithan York University, Canada, Nafid Enan York University, CA, Gias Uddin York University, Canada, Song Wang York University
12:10 5m Talk		Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges Benchmark & Dataset Track Ali Al-Kaswan Delft University of Technology, Netherlands, Maksim Plotnikov Delft University of Technology, NL, Maxim Hájek Delft University of Technology, NL, Roland Vízner Delft University of Technology, NL, Arie van Deursen TU Delft, Mali Izadi Google & TU Delft
12:15 5m Talk		A Dataset of Agentic AI Coding Tool Configurations Benchmark & Dataset Track Matthias Galster University of Canterbury, Seyedmoein Mohsenimofidi Heidelberg University, Levi Böhme Universität Bayreuth, DE, Jai Lal Lulla Singapore Management University, Muhammad Auwal Abubakar Otto-Friedrich Universität Bamberg, DE, Christoph Treude Singapore Management University, Sebastian Baltes Heidelberg University
12:20 5m Talk		AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub Benchmark & Dataset Track Daniel Ogenrwot University of Nevada Las Vegas, John Businge University of Nevada, Las Vegas DOI Pre-print
12:25 5m Talk		TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution ArXiv Track Jiale Amber Wang University of Waterloo, Kaiyuan Wang Google, Inc., Pengyu Nie University of Waterloo