AIware 2026
Mon 6 - Tue 7 July 2026 Montreal, Canada
co-located with FSE 2026

SWE-Bench has become one of the most widely used benchmarks for evaluating whether large language models (LLMs) can resolve real-world GitHub issues. However, despite its broad adoption, a systematic evaluation of the quality of the benchmark remains lacking. In this paper, we present SWE-Bench+, an enhanced benchmark framework that improves evaluation reliability by addressing two benchmark-quality risks in SWE-Bench: solution leakage in issue descriptions and weak tests that allow plausible but incorrect patches to pass. To construct SWE-Bench+, we first analyze 217 commonly resolved issues across three top-performing agents in the SWE-Bench leaderboard during our study: SWE-agent 1.0, OpenHands+CodeAct v2.1, and AutoCodeRover-v2.0), yielding 651 model-generated patches, and identify five quality-problem patterns under the two broader risks of solution leakage and weak tests. Based on this analysis, we develop two components: SoluLeakDetector, which identifies solution-leaking content for issue sanitization, and TestEnhancer, which strengthens test suites for more reliable patch validation. Our results show that 60.83% of the commonly resolved issues exhibit solution leakage, and 77.88% are problematic overall. After removing leaked information from issue descriptions, model resolution rates drop substantially, showing that leakage materially inflates reported performance. SoluLeakDetector achieves 80.45% accuracy on solution-leak detection, and TestEnhancer identifies plausible patches for 97.11% of weak-test issues while reducing average resolution rates by 27.00 percentage points on Lite and 36.27 percentage points on Verified. These findings show that existing SWE-Bench evaluations can substantially overestimate true issue-resolution performance and that SWE-Bench+ provides a more rigorous benchmark framework for LLM-based software engineering agents.