AIware 2026
Mon 6 - Tue 7 July 2026 Montreal, Canada
co-located with FSE 2026

Large Language Models (LLMs) show promise for vulnerability detection, but their evaluation is limited by the lack of high-quality benchmarks. Existing datasets rely on coarse function-level labels, overlook fine-grained vulnerability patterns, and lack critical pro- gram context, such as data/control dependencies. They also suffer from data quality issues, including mislabeling and duplication, leading to unreliable evaluation and limited real-world relevance. To address these limitations, this paper introduces SecVulEval, a comprehensive benchmark designed to support fine-grained evaluation of LLMs and other detection methods with rich contextual information. SecVulEval focuses on real-world C/C++ vulnerabilities at the statement level. This granularity enables more precise evaluation of a model’s ability to localize and understand vulnerabilities, beyond simple binary classification at the function level. By incorporating rich contextual information, SecVulEval sets a new standard for benchmarking vulnerability detection in realistic software development scenarios. This benchmark includes 25,440 function samples covering 5,867 unique CVEs in C/C++ projects from 1999 to 2024. We evaluated state-of-the-art LLMs in both standalone and multi-agent settings. Results on our dataset indicate that current models remain far from accurately identifying vulnerable statements within a given function, although agent-based approaches provide modest but promising improvements. The best-performing Claude-3.7-Sonnet-driven agent achieves an F1-score of 23.83% for vulnerable statement detection. We believe this benchmark can serve as a foundation for advancing context-aware and fine-grained vulnerability detection with LLMs.