New research: long-running agents often fail by stopping too early, not because the model can't make progress.
We tested 5 harness designs across 8 long-horizon coding tasks.
Our new orchestration harness, Zenith, wins 5/8 at 43% the cost of the strongest baseline.
显示更多