🕶️ Walk through a first-person world with your own body motion, and explicitly specify what exists at a given location with an image and pose, including how it evolves over time. Meet AnchorWorld, an embodied egocentric world model.
Title: AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization
URL:
📝 Overview
AnchorWorld generates first-person video controlled by full-body human motion. With "anchor views," it lets you explicitly specify what exists at a given 3D location and how it changes over time.
❓ Challenges Solved
Existing world models struggle to supervise full-body motion from egocentric video alone, and define environments only implicitly. They lacked both natural embodied control and localized world customization.
💡 Methodology & Proposed Approach
・Since most of the body is invisible in first person, it uses third-person video as auxiliary supervision to learn body-environment positioning
・An anchor has three parts: an RGB image, a 6-DoF viewpoint pose, and an evolution prompt that specify local appearance and temporal change
・3D RoPE spatially distinguishes multiple anchors, and masked cross-attention enables anchor-specific text control
・It trains in four stages (third-person, first-person, static anchors, dynamic evolution), built on Wan 2.2 TI2V 5B
🎯 Use Cases
It applies to embodied VR apps, first-person game environment design, embodied-AI training scenarios, and interactive video generation with localized control.
📊 Experimental Results
・On egocentric static scenes it reaches CLIP-V 0.885 and camera accuracy ATE 0.112m, beating PlayerOne and others
・On egocentric dynamic scenes, text alignment (VideoAlign-TA) is 0.717, far above CaM-Ego's 0.385
・It generalizes strongly to out-of-distribution UE and real-world scenes with little visual overlap between the initial view and anchors
#
WorldModel# #
EmbodiedAI#