Towards AI as a Collaborative Partner: A Taxonomy of AI Agent Behavior in Software Engineering (AIware 2026 - Main Track) - AIware 2026

Mon 6 - Tue 7 July 2026 Montreal, Canada

co-located with FSE 2026

Who

Tao Dong, Sherry Shi, Harini Sampath, Andrew Macvean

Track

AIware 2026 Main Track

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

When

Tue 7 Jul 2026 14:05 - 14:10 at MB 1.210 - Human Factors, Responsible AIware, and Benchmarks & Datasets

Abstract

The ongoing transition of Large Language Models in software engineering from one-shot code generators into agentic partners requires a shift in how we define and measure success. While models are becoming more capable, the industry lacks a clear understanding of the behavioral norms that make an interactive SWE agent effective in collaborative software development in the enterprise. This work addresses this gap by presenting a taxonomy of desirable SWE agent behaviors, synthesized from 91 sets of developer-defined rules for SWE agents and validated through interviewing 15 experienced professional developers. In this taxonomy, we identify four core expectations: Adhere to Standards and Processes, Ensure Code Quality and Reliability, Solve Problems Effectively, and Collaborate with the Developer. These findings offer a concrete vocabulary for aligning SWE agent behavior with developer preferences, enabling researchers and practitioners to move beyond correctness-only benchmarks and start designing evaluations that reflect the socio-technical nature of professional software development in enterprises.

Link to Preprint

https://storage.googleapis.com/gweb-research2023-media/pubtools/1038077.pdf

Tao Dong

Google

United States

Sherry Shi

Google

Harini Sampath

Andrew Macvean

Google, Inc.

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Session Program

Tue 7 Jul
Displayed time zone: Eastern Time (US & Canada) change

	14:00 - 15:30	Human Factors, Responsible AIware, and Benchmarks & DatasetsBenchmark & Dataset Track / Main Track at MB 1.210

	14:00 5m Talk		Is Artificial Intelligence an Elixir to the Software Engineering Community? An Empirical Study Among Managers Main Track Xin Zhao Seattle University, Brian Vu Seattle University, US, Sitesh Pattanaik Donald Bren School of Information and Computer Sciences, University of California, Irvine, US
	14:05 5m Talk		Towards AI as a Collaborative Partner: A Taxonomy of AI Agent Behavior in Software Engineering Main Track Tao Dong Google, Sherry Shi Google, Harini Sampath , Andrew Macvean Google, Inc. Pre-print
	14:10 5m Talk		Auditing Who Appears to Belong: A Large-Scale Empirical Study of Bias in Deployed Text-to-Image Systems for Software Engineering Main Track Mohamad Kassab Boston University
	14:15 5m Talk		Operationalizing Ethics for AI Agents: How Developers Encode Values into Repository Context Files Main Track Christoph Treude Singapore Management University, Sebastian Baltes Heidelberg University, Marc Cheong the University of Melbourne Pre-print
	14:20 5m Talk		Accountable Agents in Software Engineering: An Analysis of Terms of Service and a Research Roadmap Main Track Christoph Treude Singapore Management University Pre-print
	14:25 5m Talk		SOSecure: The Wisdom of the Crowd for Safer AI-Generated Code Main Track Manisha Mukherjee Carnegie Mellon University, Vincent J. Hellendoorn Google DeepMind
	14:30 5m Talk		SecVulEval: Context-Aware Benchmarking of LLMs for Vulnerability Detection Benchmark & Dataset Track Md Basim Uddin Ahmed York University, CA, Nima Shiri Harzevili York University, Jiho Shin York University, Hung Viet Pham York University, Song Wang York University
	14:35 5m Talk		SecMutBench: Evaluating LLM-Generated Security Tests via Mutation-Based Vulnerability Detection Benchmark & Dataset Track Mariam ALMutairi Virginia Polytechnic Institute and State University, US
	14:40 5m Talk		CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis Benchmark & Dataset Track Arunabh Majumdar Independent Researcher, IN
	14:45 5m Talk		REBench: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names Benchmark & Dataset Track Jun Yeon Won Ohio State University, Columbus, US, Xin Jin Meta, Shiqing Ma University of Massachusetts at Amherst, Zhiqiang Lin The Ohio State University
	14:50 5m Talk		RustBuildEq: A Benchmark for Binary Equivalence Under Build Variability Benchmark & Dataset Track Elliott Wen The University of Auckland, Chenye Ni , Valerio Terragni University of Auckland, Jens Dietrich Victoria University of Wellington
	14:55 5m Talk		TOGBench: A Developer-Written Multi-Variant Dataset and Benchmark Suite for Test Oracle Generation Benchmark & Dataset Track Tasfia Tasnim University of Texas at Dallas, US, Matthew B Dwyer University of Virginia, Soneya Binta Hossain University of Texas at Dallas
	15:00 5m Talk		HEJ-Robust: A Robustness Benchmark for LLM-based Automated Program Repair Benchmark & Dataset Track Fazle Rabbi Concordia University, Jinqiu Yang Concordia University
	15:05 5m Paper		JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks Benchmark & Dataset Track Yiran Wang Linköping University, José Antonio Hernández López Department of Computer Science and Systems, University of Murcia, Ulf Nilsson Linköping University, Daniel Varro Linköping University / McGill University Pre-print
	15:10 5m Talk		AgentTelemetry: A Fault Detection Benchmark and Toolkit for LLM Agent Observability Benchmark & Dataset Track Krishna Chaitanya Balusu Independent
	15:15 15m Live Q&A		Joint Q&A and Discussion Benchmark & Dataset Track