cv usk(@cv_usk):An AI agent's performance is governed not by how much it computes, but by how well that compute turns into good feedback 📈 Title: Scaling Laws for Agent Harnesses via Effective Feedback Compute URL: https://t.co/r8YlGIA5jZ 📈 Overview This work proposes Effective Feedback Compute (EFC), a metric that reframes agent scaling efficiency around feedback quality rather than raw compute. It measures whether computation actually improved decisions. ❓ Challenges Solved We tend to reason about performance via raw metrics — tokens, tool calls, cost. But these mask whether feedback truly improved decision-making. Redundant, invalid, or unused feedback doesn't help. 💡 Methodology & Proposed Approach ・EFC credits feedback only when it is informative, valid, non-redundant, and retained for later decisions ・It normalizes by task demands for fair cross-task comparison ・Evaluated on synthetic tasks, code tasks, real traces, and prospective tests, vs raw-compute and SAS baselines 📊 Experimental Results EFC's explanatory power stood out (R² vs performance). ・Raw tokens/tool calls: R²=0.33-0.42 ・SAS baseline: 0.88, Oracle-EFC: 0.94, task-normalized: 0.99 ・Real traces: 0.92, prospective holdout: 0.85 ・Matched-budget interventions that improved feedback quality lifted success from 0.27 to 0.90 #AIAgents #ScalingLaws

2026.06.14 04:23

An AI agent's performance is governed not by how much it computes, but by how well that compute turns into good feedback 📈 Title: Scaling Laws for Agent Harnesses via Effective Feedback Compute URL: 📈 Overview This work proposes Effective Feedback Compute (EFC), a metric that reframes agent scaling efficiency around feedback quality rather than raw compute. It measures whether computation actually improved decisions. ❓ Challenges Solved We tend to reason about performance via raw metrics — tokens, tool calls, cost. But these mask whether feedback truly improved decision-making. Redundant, invalid, or unused feedback doesn't help. 💡 Methodology & Proposed Approach ・EFC credits feedback only when it is informative, valid, non-redundant, and retained for later decisions ・It normalizes by task demands for fair cross-task comparison ・Evaluated on synthetic tasks, code tasks, real traces, and prospective tests, vs raw-compute and SAS baselines 📊 Experimental Results EFC's explanatory power stood out (R² vs performance). ・Raw tokens/tool calls: R²=0.33-0.42 ・SAS baseline: 0.88, Oracle-EFC: 0.94, task-normalized: 0.99 ・Real traces: 0.92, prospective holdout: 0.85 ・Matched-budget interventions that improved feedback quality lifted success from 0.27 to 0.90 #AIAgents# #ScalingLaws#

显示更多