AIware 2026
Mon 6 - Tue 7 July 2026 Montreal, Canada
co-located with FSE 2026

Jupyter notebooks are widely used for machine learning (ML) prototyping and experimentation, yet debugging support for notebook-based ML development remains limited, partly due to the lack of realistic and executable bug benchmarks. We introduce JunoBench, to our knowledge the first executable benchmark of real-world crashes in Python-based ML notebooks. JunoBench contains 111 curated and reproducible crashes from public Kaggle notebooks, each paired with a verified fix. The benchmark covers widely used ML libraries (e.g., TensorFlow/Keras, PyTorch, and Scikit-learn) as well as notebook-specific failures such as out-of-order execution. To ensure reproducibility and ease of evaluation, JunoBench provides a unified execution environment that reliably reproduces all crashes. In addition, each crash is accompanied by human-validated annotations, including library cause, crash type, root cause, ML pipeline stage, and natural-language diagnostic summaries. By combining realistic crashes, verified fixes, structured labels, and reproducible execution, JunoBench enables systematic evaluation of crash detection, diagnosis, and automated repair techniques for notebook-based ML development.