Artifact Readiness Gates with Saturation Stop Rules and Host-Parity Admissibility for FM Release Evaluation
This program is tentative and subject to change.
Release evaluation for FM-powered software often grows by habit rather than policy: teams repeat runs until budget or time is exhausted, without clear evidence that more passes change release decisions. We study a release-evaluation protocol that separates three concerns: artifact readiness, decision-stability stopping, and cross-hardware promotion gating. The study uses 340 runs spanning seven edit families (five core plus two probes), four model families, ten seeds, and dual-host H100/H200 execution. In this matrix and under this policy setting, additional seed repetition did not change promote/block outcomes, edit-family breadth remained decision-informative, and small H100/H200 score differences could still alter promotion outcomes near strict boundaries. These findings motivate workload-conditional resource allocation for release engineering: in this evidence setting, additional budget is more decision-informative when spent on edit diversity and host-parity checks than on deeper seed repetition. The contribution is an operational decision framework, with explicit sensitivity reporting, that turns release evaluation from a fixed checklist into a defensible governance process. In this matrix, seed-stop reduced measured GPU-hours by about 90% versus fixed 10-pass seed evaluation. Numeric thresholds are workload-derived; the transferable contribution is the gate-setting process.