Can an AI actually mediate a conflict between people? 🤝 A benchmark that tries to measure that, reliably, under realistic conditions.
Title: SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
URL:
🤝 Overview
This work proposes SoCRATES, a comprehensive benchmark for evaluating LLMs as mediators. An agentic pipeline builds realistic conflict scenarios across eight domains from actual public disputes, enabling automated and reliable evaluation of proactive LLM mediation.
❓ Challenges Solved
Using LLMs to guide disputing parties toward agreement is gaining attention, but evaluating it is hard.
・Real conflicts shift constantly as disputants' emotions, intentions, and context change mid-mediation
・Existing benchmarks rely on a limited set of expert-authored scenarios
・They also score every turn against every topic, injecting noise that muddies the evaluation signal
💡 Methodology & Proposed Approach
SoCRATES integrates three approaches.
・Agentic scenario curation: agents find genuine public disputes, restructure them into mediation scenarios, and filter for cases that truly need intervention
・Socio-cognitive probing: vary each scenario across five independent dimensions (strategic posture, party composition, conversation-history length, emotional reactivity, cultural identity) to pinpoint capability gaps
・Topic-localized evaluation: instead of scoring every topic at every turn, rate only the turns where a topic is actively discussed, reducing noise
It spans eight domains: transactional, health, environmental, B2B, policy, international, legal, and intra-organizational.
🌍 Use Cases / Experimental Results
The results were sober and revealing.
・The evaluator reached r=0.82 alignment with human experts (trajectory level), more than doubling baseline performance
・Among eight frontier LLMs, even the best, GPT-5.4-mini, closed only about 34.4% of the consensus gap (all-mediator average 25.9%)
・Big domain spread: 41.3% improvement in transactional disputes versus just 16.6% in intra-organizational ones
The key takeaway: meaningful progress needs better social adaptation to diverse conditions, not just general capability gains.
#
LLMEvaluation# #
AIMediation#