Interactive video world models that generate footage as you control them — here's a unified benchmark that finally measures them fairly 🎮
Title: WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
URL:
🎮 Overview
WBench is a unified framework for comprehensively evaluating interactive video world models. With 289 test cases and 1,058 interaction turns, it unifies text, 6-DoF pose, and discrete-action control so models with different native inputs can be compared on equal footing.
❓ Challenges Solved
Interactive world models are advancing fast, but there was no comprehensive standard to assess them. Existing benchmarks only partially covered the needed competencies, and differing input interfaces made apples-to-apples comparison hard.
💡 Methodology & Proposed Approach
Evaluation spans five core dimensions.
・Video quality
・Setting adherence
・Interaction adherence
・Consistency
・Physics compliance
Tasks cover navigation, subject action, event editing, and perspective switching. It uses 22 automatic sub-metrics combining specialist vision models with large multimodal models, all validated against human judgments.
📊 Experimental Results
Analyzing 20 state-of-the-art models revealed that no single model performs strongly across all dimensions, exposing characteristic strengths, weaknesses, and persistent challenges across approaches.
#
WorldModels# #
Benchmark#