cv usk(@cv_usk):Interactive video world models that generate footage as you control them — here's a unified benchmark that finally measures them fairly 🎮 Title: WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation URL: https://t.co/XCR90VepJP 🎮 Overview WBench is a unified framework for comprehensively evaluating interactive video world models. With 289 test cases and 1,058 interaction turns, it unifies text, 6-DoF pose, and discrete-action control so models with different native inputs can be compared on equal footing. ❓ Challenges Solved Interactive world models are advancing fast, but there was no comprehensive standard to assess them. Existing benchmarks only partially covered the needed competencies, and differing input interfaces made apples-to-apples comparison hard. 💡 Methodology & Proposed Approach Evaluation spans five core dimensions. ・Video quality ・Setting adherence ・Interaction adherence ・Consistency ・Physics compliance Tasks cover navigation, subject action, event editing, and perspective switching. It uses 22 automatic sub-metrics combining specialist vision models with large multimodal models, all validated against human judgments. 📊 Experimental Results Analyzing 20 state-of-the-art models revealed that no single model performs strongly across all dimensions, exposing characteristic strengths, weaknesses, and persistent challenges across approaches. #WorldModels #Benchmark

2026.06.14 09:22

Interactive video world models that generate footage as you control them — here's a unified benchmark that finally measures them fairly 🎮 Title: WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation URL: 🎮 Overview WBench is a unified framework for comprehensively evaluating interactive video world models. With 289 test cases and 1,058 interaction turns, it unifies text, 6-DoF pose, and discrete-action control so models with different native inputs can be compared on equal footing. ❓ Challenges Solved Interactive world models are advancing fast, but there was no comprehensive standard to assess them. Existing benchmarks only partially covered the needed competencies, and differing input interfaces made apples-to-apples comparison hard. 💡 Methodology & Proposed Approach Evaluation spans five core dimensions. ・Video quality ・Setting adherence ・Interaction adherence ・Consistency ・Physics compliance Tasks cover navigation, subject action, event editing, and perspective switching. It uses 22 automatic sub-metrics combining specialist vision models with large multimodal models, all validated against human judgments. 📊 Experimental Results Analyzing 20 state-of-the-art models revealed that no single model performs strongly across all dimensions, exposing characteristic strengths, weaknesses, and persistent challenges across approaches. #WorldModels# #Benchmark#