Register and share your invite link to earn from video plays and referrals.

Search results for LLMEvaluation
LLMEvaluation community
One keyword maps to one global community path.
Create community
People
Not Found
Tweets including LLMEvaluation
Can an AI actually mediate a conflict between people? 🤝 A benchmark that tries to measure that, reliably, under realistic conditions. Title: SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations URL: 🤝 Overview This work proposes SoCRATES, a comprehensive benchmark for evaluating LLMs as mediators. An agentic pipeline builds realistic conflict scenarios across eight domains from actual public disputes, enabling automated and reliable evaluation of proactive LLM mediation. ❓ Challenges Solved Using LLMs to guide disputing parties toward agreement is gaining attention, but evaluating it is hard. ・Real conflicts shift constantly as disputants' emotions, intentions, and context change mid-mediation ・Existing benchmarks rely on a limited set of expert-authored scenarios ・They also score every turn against every topic, injecting noise that muddies the evaluation signal 💡 Methodology & Proposed Approach SoCRATES integrates three approaches. ・Agentic scenario curation: agents find genuine public disputes, restructure them into mediation scenarios, and filter for cases that truly need intervention ・Socio-cognitive probing: vary each scenario across five independent dimensions (strategic posture, party composition, conversation-history length, emotional reactivity, cultural identity) to pinpoint capability gaps ・Topic-localized evaluation: instead of scoring every topic at every turn, rate only the turns where a topic is actively discussed, reducing noise It spans eight domains: transactional, health, environmental, B2B, policy, international, legal, and intra-organizational. 🌍 Use Cases / Experimental Results The results were sober and revealing. ・The evaluator reached r=0.82 alignment with human experts (trajectory level), more than doubling baseline performance ・Among eight frontier LLMs, even the best, GPT-5.4-mini, closed only about 34.4% of the consensus gap (all-mediator average 25.9%) ・Big domain spread: 41.3% improvement in transactional disputes versus just 16.6% in intra-organizational ones The key takeaway: meaningful progress needs better social adaptation to diverse conditions, not just general capability gains. #LLMEvaluation# #AIMediation#
Show more