Metamorphic Testing for Clinical ML Models: A Framework Proposal and Pilot Study
This program is tentative and subject to change.
Machine learning models for clinical prediction tasks such as in-hospital mortality and sepsis onset routinely achieve high AUROC scores, yet AUROC measures ranking correctness, not clinical sensibility. A model can rank patients correctly in aggregate while predicting lower mortality risk when a patient’s SOFA score worsens, which contradicts established medical guidelines. This paper proposes applying metamorphic testing (MT) to clinical ML models as a way to check behavioral correctness without requiring ground-truth labels for individual predictions. We design a catalog of 12 candidate metamorphic relations (MRs) for three ICU prediction tasks on MIMIC-III/IV, each grounded in an authoritative clinical guideline. We also propose a five-layer validation strategy for ensuring that MRs are clinically sound before use. As a feasibility check, we run a pilot study on the UCI Heart Disease dataset, where all three clinical models tested (AUROC 0.849–0.900) produce violation rates of 27–87% on the five pilot MRs. An injected-fault experiment shows that a sign-negation error in a blood pressure feature went undetected by AUROC but produced a 31–67 percentage-point shift in MT violation rate. These results suggest MT is a useful complement to standard metrics for checking clinical model behavior.