SecMutBench: Evaluating LLM-Generated Security Tests via Mutation-Based Vulnerability Detection
This program is tentative and subject to change.
Existing LLM security benchmarks evaluate code generation quality, leaving an open question: can LLMs generate tests that detect vulnerabilities? We address this with two technical contributions. First, we propose the Security Mutation Score (SMS), a metric that classifies mutant kills into semantic, functional, incidental, and crash categories using operator-aware heuristics, distinguishing genuine security awareness from coincidental detection. We further define Effective SMS (EffSMS = SMS × Secure-Pass Rate) to account for test validity. Second, we design 25 security-specific mutation operators spanning 30 CWE categories that transform secure Python code into realistic vulnerable variants, extending prior security mutation frameworks to Python and introducing 22 new operators. Evaluating eight LLMs and two static analysis baselines on 339 programs and 1,869 mutants reveals three findings: (i) traditional mutation scores overstate LLM security testing capability by 2.2× on average; (ii) the best LLM achieves only 19.7% EffSMS vs. 47.6% for expert-written tests—a 2.4× gap raw metrics obscure; and (iii) functional kills, not crashes, dominate non-semantic failures (15–36%), showing LLMs detect behavioral side-effects rather than security properties. Static analysis and mutation testing provide complementary coverage across syntactic vs. logic-flaw CWEs. Code and data are publicly available.