Imagine 300 forecasts from 10 different teams over 10 months each with 3 different models. Which forecast is doing best, where? Which experienced the fastest improvement? Do differences in models or changes in input data drive the largest improvements in forecast skill? What does it take to be able to answer such questions again and again, automatically and at scale? Our existing cyber-infrastructure and standards were largely designed around a one-and-done approach that does not easily facilitate quantitative comparisons between qualitatively different models. I will present how EFI is tackling this challenge through new cyber-infrastructure and standards.