Search LLMEvaluation on X

Search results for LLMEvaluation

LLMEvaluation community

One keyword maps to one global community path.

Create community

People

Not Found

Tweets including LLMEvaluation

cv usk@cv_usk

2026.06.12 08:38

Can an AI actually mediate a conflict between people? 🤝 A benchmark that tries to measure that, reliably, under realistic conditions. Title: SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations URL: 🤝 Overview This work proposes SoCRATES, a comprehensive benchmark for evaluating LLMs as mediators. An agentic pipeline builds realistic conflict scenarios across eight domains from actual public disputes, enabling automated and reliable evaluation of proactive LLM mediation. ❓ Challenges Solved Using LLMs to guide disputing parties toward agreement is gaining attention, but evaluating it is hard. ・Real conflicts shift constantly as disputants' emotions, intentions, and context change mid-mediation ・Existing benchmarks rely on a limited set of expert-authored scenarios ・They also score every turn against every topic, injecting noise that muddies the evaluation signal 💡 Methodology & Proposed Approach SoCRATES integrates three approaches. ・Agentic scenario curation: agents find genuine public disputes, restructure them into mediation scenarios, and filter for cases that truly need intervention ・Socio-cognitive probing: vary each scenario across five independent dimensions (strategic posture, party composition, conversation-history length, emotional reactivity, cultural identity) to pinpoint capability gaps ・Topic-localized evaluation: instead of scoring every topic at every turn, rate only the turns where a topic is actively discussed, reducing noise It spans eight domains: transactional, health, environmental, B2B, policy, international, legal, and intra-organizational. 🌍 Use Cases / Experimental Results The results were sober and revealing. ・The evaluator reached r=0.82 alignment with human experts (trajectory level), more than doubling baseline performance ・Among eight frontier LLMs, even the best, GPT-5.4-mini, closed only about 34.4% of the consensus gap (all-mediator average 25.9%) ・Big domain spread: 41.3% improvement in transactional disputes versus just 16.6% in intra-organizational ones The key takeaway: meaningful progress needs better social adaptation to diverse conditions, not just general capability gains. #LLMEvaluation# #AIMediation#

Forward to community