JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks (AIware 2026 - Benchmark & Dataset Track)

Mon 6 - Tue 7 July 2026 Montreal, Canada

co-located with FSE 2026

Who

Yiran Wang, José Antonio Hernández López, Ulf Nilsson, Daniel Varro

Track

AIware 2026 Benchmark & Dataset Track

Abstract

Jupyter notebooks are widely used for machine learning (ML) prototyping and experimentation, yet debugging support for notebook-based ML development remains limited, partly due to the lack of realistic and executable bug benchmarks. We introduce JunoBench, to our knowledge the first executable benchmark of real-world crashes in Python-based ML notebooks. JunoBench contains 111 curated and reproducible crashes from public Kaggle notebooks, each paired with a verified fix. The benchmark covers widely used ML libraries (e.g., TensorFlow/Keras, PyTorch, and Scikit-learn) as well as notebook-specific failures such as out-of-order execution. To ensure reproducibility and ease of evaluation, JunoBench provides a unified execution environment that reliably reproduces all crashes. In addition, each crash is accompanied by human-validated annotations, including library cause, crash type, root cause, ML pipeline stage, and natural-language diagnostic summaries. By combining realistic crashes, verified fixes, structured labels, and reproducible execution, JunoBench enables systematic evaluation of crash detection, diagnosis, and automated repair techniques for notebook-based ML development.

Yiran Wang

Linköping University