AIware 2026
Mon 6 - Tue 7 July 2026 Montreal, Canada
co-located with FSE 2026

This program is tentative and subject to change.

Thinking Large Language Models (LLMs) generate explicit intermediate reasoning traces before final answers, potentially improving transparency, interpretability, and solution accuracy for code generation. However, it remains unclear which trace characteristics are informative and the quality of the reasoning chains. In this paper, we present an empirical study examining the reasoning processes and the quality of thinking LLMs on code generation tasks. We evaluate six state-of-the-art reasoning LLMs (DeepSeek-R1, OpenAI-o3-mini, Claude-3.7-Sonnet, Gemini-2.0-Flash-Thinking, Gemini-2.5-Flash, and Qwen-QwQ) on 100 BigCodeBench code generation tasks (600 model–task instances; 3,772 reasoning steps). To characterize reasoning-chain structure, we measure step count and per-step verbosity, and compare successful versus failed attempts under difficulty stratification (Hard vs. Non-Hard). We further perform a 21-participant human evaluation of reasoning quality across three dimensions: efficiency, logical consistency, and completeness, and we build a taxonomy of problematic reasoning patterns. We find the model- and difficulty-dependent relationship between step count and success, and verbosity is not a reliable correctness signal. Human analysis indicates that completeness issues dominate failures (44.5%), most often due to missed edge cases and boundary conditions, and incompleteness is a stronger predictor of failure on Hard tasks than on Non-Hard tasks (𝜌 = −0.219 vs. 𝜌 = −0.096).

This program is tentative and subject to change.

Mon 6 Jul

Displayed time zone: Eastern Time (US & Canada) change

08:50 - 10:30
Coding Agents, Software Testing, and Code UnderstandingArXiv Track / Main Track at MB 1.210
08:50
5m
Talk
Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles
Main Track
Young Jo Chung , Safwat Hassan University of Toronto
08:55
5m
Talk
When Code Authors Are Agents: A Large-Scale Study of Human–Agent Collaboration in Pull Requests
Main Track
Anthonia Oluchukwu Njoku École Polytechnique de Montréal, Université de Montréal, CA, Zohreh Sharafi Polytechnique Montréal, Foutse Khomh Polytechnique Montréal
09:00
5m
Talk
Understanding Conversational Patterns in Multi-Agent Programming: A Case Study On Fibonacci Game Development
Main Track
Srijita Basu Chalmers University of Technology and University of Gothenburg, Viktor Kjellberg Göteborg University, SE, Simin Sun , Bengt Haraldsson Chalmers University of Technology and University of Gothenburg, Scania CV AB, Md Abu Ahammed Babu Volvo Cars AB, Wilhelm Meding Ericsson, Farnaz Fotrousi Chalmers University of Technology and University of Gothenburg, Miroslaw Staron Chalmers University of Technology and University of Gothenburg
09:05
5m
Talk
Recovering from Misbehaviors in Coding Agents
Main Track
Rahul Nanda Facebook, US, Chandra Maddila Meta Platforms, Inc., Smriti Jha Facebook, US, Euna Mehnaz Khan , Satish Chandra Meta Platforms, Inc., Matteo Paltenghi University of Stuttgart
09:10
5m
Talk
Configuring Agentic AI Coding Tools: An Exploratory Study
Main Track
Matthias Galster University of Canterbury, Seyedmoein Mohsenimofidi Heidelberg University, Jai Lal Lulla Singapore Management University, Muhammad Auwal Abubakar Otto-Friedrich Universität Bamberg, DE, Christoph Treude Singapore Management University, Sebastian Baltes Heidelberg University
Pre-print
09:15
5m
Talk
Execution Control Matters: Deterministic and Agentic Tool Orchestration for LLM-Based Code Translation
Main Track
Naing Oo Lwin Bucknell University, US, Rajesh Kumar Bucknell University, US
09:20
5m
Talk
Developer Experience with AI Coding Agents: HTTP Behavioral Signatures in Documentation Portals
ArXiv Track
Oleksii Borysenko Cisco DevNet
09:25
5m
Talk
VISOR: A Vision-Language Model-based Test Oracle for Testing Robots
Main Track
Prasun Saurabh Simula Research Laboratory, NO, Pablo Valle Mondragon University, Aitor Arrieta Mondragon University, Shaukat Ali Simula Research Laboratory and Oslo Metropolitan University, Paolo Arcaini National Institute of Informatics
09:30
5m
Talk
Fixpad++: Automated Bug Fix Verification Using LLM Agents
Main Track
Mustafa Özkan İr Bilkent University, Bilkent University, TR, Mehmet Dedeler Bilkent University, Bilkent University, TR, Anil Koyuncu Bilkent University, Eray Tüzün Bilkent University
09:35
5m
Talk
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
Main Track
Éric Jacopin Cosmic AI, FR
09:40
5m
Talk
Examining LLMs Ability to Summarize Code Through Mutation-Analysis
Main Track
Lara Khatib University of Waterloo, Michael Pu University of Waterloo, Bogdan Vasilescu Carnegie Mellon University, Mei Nagappan University of Waterloo
09:45
5m
Talk
Testing AIware Systems: A Software Engineering Survey
Main Track
Karla Gonzalez Royal Military College of Canada, Mariam El Mezouar Royal Military College
09:50
5m
Talk
TestMap: Evidence Infrastructure for Foundation-Model-Assisted Test Generation
ArXiv Track
Hunter Leary Virginia Tech, Luke Hanuska Virginia Tech, Chris Brown Virginia Tech
09:55
5m
Talk
Metamorphic Testing for Clinical ML Models: A Framework Proposal and Pilot Study
ArXiv Track
Jie JW Wu Michigan Technological University, USA, Feiyu E Michigan Technological University, USA, Bo Chen Michigan Technological University, USA
10:00
5m
Talk
An Empirical Study of Reasoning Steps in Thinking Code LLMs
Main Track
Haoran Xue York University, CA, Gias Uddin York University, Canada, Song Wang York University
10:05
5m
Talk
Can LLMs really reason about Code? Studying how well LLMs understand the relation between Input, Code, and Output
Main Track
Norman Becker CISPA Helmholtz Center for Information Security, DE, Tural Mammadov CISPA Helmholtz Center for Information Security, Andreas Zeller CISPA Helmholtz Center for Information Security
10:10
5m
Talk
How Robustly do LLMs Understand Execution Semantics?
Main Track
Claudio Spiess University of California, Davis, Premkumar Devanbu UC Davis, Earl T. Barr University College London
10:15
5m
Talk
Program-as-Weights: A Programming Paradigm for Fuzzy Functions
ArXiv Track
Wentao Zhang University of Waterloo, Liliana Hotsko University of Waterloo, Woojeong Kim Cornell University, Pengyu Nie University of Waterloo, Stuart Shieber Harvard University, Yuntian Deng University of Waterloo
10:20
10m
Live Q&A
Joint Q&A and Discussion
Main Track