TOGBench: A Developer-Written Multi-Variant Dataset and Benchmark Suite for Test Oracle Generation
This program is tentative and subject to change.
Test oracles determine whether a program execution is correct for a given input. Two common forms are assertion oracles, which compare observed outputs with expected results, and exception oracles, which verify that a program raises an expected exception. Automated test oracle generation (TOG) aims to reduce the manual effort involved in constructing such oracles. Although recent TOG methods, especially LLM-based approaches, have made rapid progress, their evaluation remains constrained by benchmarks that rely on automatically generated tests, narrow single-assert formulations, simplified developer-written tests, or limited oracle diversity. To address these limitations, we introduce OE25𝑑𝑒𝑣 , a multi-variant dataset curated from developer-written unit tests across 25 open-source Java projects spanning 56 modules, and TOGBench, an end-to-end benchmark suite for TOG. OE25𝑑𝑒𝑣 captures six oracle categories and preserves realistic settings, including single- and multi-oracle configurations, mixed assertion-and-exception oracles, and developer-authored custom oracles. TOGBench supports end-to-end experimentation by reintegrating generated oracles into runnable test suites and evaluating them via compilation, execution, false-positive analysis, and mutation testing. Our evaluation further shows that OE25𝑑𝑒𝑣 preserves substantially greater structural complexity than prior benchmarks and exposes marked performance degradation of representative TOG models on developer-written tests, particularly for assertion oracles.