@karinanguyen My whole experience doing this project was finding endless "up and to the right" graphs at all resolutions of AI R&D, from the well known (e.g., SWE-Bench) to more niche (like those above). It's a fractal, but at all the resolutions you see the same trend of meaningful progress.
@karinanguyen There's also MLE-Bench, which is ecologically valid (tasks come from real kaggle competitions) and involves building a very diverse set of ML apps to solve specific problems. The same progress shows up here.