AIware 2026
Mon 6 - Tue 7 July 2026 Montreal, Canada
co-located with FSE 2026

This program is tentative and subject to change.

LLM-based autonomous agents fail in ways that existing observability infrastructure cannot detect. OpenTelemetry’s GenAI semantic conventions cover LLM invocation and tool execution but leave five critical agent orchestration phases — planning, reasoning, safety monitoring, inter-agent delegation, and memory management — without span-level representation. We present AgentTelemetry, an open-source benchmark suite and toolkit for evaluating fault detection in agent systems. The benchmark defines (1) a taxonomy of 14 fault types mapped to 9 agent-specific span kinds, (2) a controlled evaluation harness of 2,940 configurations (14 faults × 5 observability conditions × 7 frameworks × 6 models), and (3) a pip-installable library (3,700+ LOC, 78 tests) with adapters for seven frameworks. On the controlled benchmark, the full span taxonomy achieves a Fault Detection Rate (FDR) of 1.000 — an upper bound confirming structural completeness — compared to 0.429 for vanilla OpenTelemetry and OTel+GenAI. An ablation study proves all nine span kinds are necessary: removing any one makes at least one fault type undetectable. A case study on 112 SWE-bench Lite instances reveals that reasoning loops account for 75% of agent failures (95% CI: [66%, 82%]) — a failure mode invisible to vanilla OTel — and a telemetry-guided intervention improves the patch rate by +8.3 pp over a matched control. All code, data, and benchmark configurations are open-source for reproducibility.

This program is tentative and subject to change.

Tue 7 Jul

Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30
Human Factors, Responsible AIware, and Benchmarks & DatasetsBenchmark & Dataset Track / Main Track at MB 1.210
14:00
5m
Talk
Is Artificial Intelligence an Elixir to the Software Engineering Community? An Empirical Study Among Managers
Main Track
Xin Zhao Seattle University, Brian Vu Seattle University, US, Sitesh Pattanaik Donald Bren School of Information and Computer Sciences, University of California, Irvine, US
14:05
5m
Talk
Towards AI as a Collaborative Partner: A Taxonomy of AI Agent Behavior in Software Engineering
Main Track
Tao Dong Google, Sherry Shi Google, Harini Sampath , Andrew Macvean Google, Inc.
Pre-print
14:10
5m
Talk
Auditing Who Appears to Belong: A Large-Scale Empirical Study of Bias in Deployed Text-to-Image Systems for Software Engineering
Main Track
Mohamad Kassab Boston University
14:15
5m
Talk
Operationalizing Ethics for AI Agents: How Developers Encode Values into Repository Context Files
Main Track
Christoph Treude Singapore Management University, Sebastian Baltes Heidelberg University, Marc Cheong the University of Melbourne
Pre-print
14:20
5m
Talk
Accountable Agents in Software Engineering: An Analysis of Terms of Service and a Research Roadmap
Main Track
Christoph Treude Singapore Management University
Pre-print
14:25
5m
Talk
SOSecure: The Wisdom of the Crowd for Safer AI-Generated Code
Main Track
Manisha Mukherjee Carnegie Mellon University, Vincent J. Hellendoorn Google DeepMind
14:30
5m
Talk
SecVulEval: Context-Aware Benchmarking of LLMs for Vulnerability Detection
Benchmark & Dataset Track
Md Basim Uddin Ahmed York University, CA, Nima Shiri Harzevili York University, Jiho Shin York University, Hung Viet Pham York University, Song Wang York University
14:35
5m
Talk
SecMutBench: Evaluating LLM-Generated Security Tests via Mutation-Based Vulnerability Detection
Benchmark & Dataset Track
Mariam ALMutairi Virginia Polytechnic Institute and State University, US
14:40
5m
Talk
CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis
Benchmark & Dataset Track
Arunabh Majumdar Independent Researcher, IN
14:45
5m
Talk
REBench: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names
Benchmark & Dataset Track
Jun Yeon Won Ohio State University, Columbus, US, Xin Jin Meta, Shiqing Ma University of Massachusetts at Amherst, Zhiqiang Lin The Ohio State University
14:50
5m
Talk
RustBuildEq: A Benchmark for Binary Equivalence Under Build Variability
Benchmark & Dataset Track
Elliott Wen The University of Auckland, Chenye Ni , Valerio Terragni University of Auckland, Jens Dietrich Victoria University of Wellington
14:55
5m
Talk
TOGBench: A Developer-Written Multi-Variant Dataset and Benchmark Suite for Test Oracle Generation
Benchmark & Dataset Track
Tasfia Tasnim University of Texas at Dallas, US, Matthew B Dwyer University of Virginia, Soneya Binta Hossain University of Texas at Dallas
15:00
5m
Talk
HEJ-Robust: A Robustness Benchmark for LLM-based Automated Program Repair
Benchmark & Dataset Track
Fazle Rabbi Concordia University, Jinqiu Yang Concordia University
15:05
5m
Paper
JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks
Benchmark & Dataset Track
Yiran Wang Linköping University, José Antonio Hernández López Department of Computer Science and Systems, University of Murcia, Ulf Nilsson Linköping University, Daniel Varro Linköping University / McGill University
Pre-print
15:10
5m
Talk
AgentTelemetry: A Fault Detection Benchmark and Toolkit for LLM Agent Observability
Benchmark & Dataset Track
15:15
15m
Live Q&A
Joint Q&A and Discussion
Benchmark & Dataset Track