cv usk(@cv_usk):🕶️ Walk through a first-person world with your own body motion, and explicitly specify what exists at a given location with an image and pose, including how it evolves over time. Meet AnchorWorld, an embodied egocentric world model. Title: AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization URL: https://t.co/F36wEct9qj 📝 Overview AnchorWorld generates first-person video controlled by full-body human motion. With "anchor views," it lets you explicitly specify what exists at a given 3D location and how it changes over time. ❓ Challenges Solved Existing world models struggle to supervise full-body motion from egocentric video alone, and define environments only implicitly. They lacked both natural embodied control and localized world customization. 💡 Methodology & Proposed Approach ・Since most of the body is invisible in first person, it uses third-person video as auxiliary supervision to learn body-environment positioning ・An anchor has three parts: an RGB image, a 6-DoF viewpoint pose, and an evolution prompt that specify local appearance and temporal change ・3D RoPE spatially distinguishes multiple anchors, and masked cross-attention enables anchor-specific text control ・It trains in four stages (third-person, first-person, static anchors, dynamic evolution), built on Wan 2.2 TI2V 5B 🎯 Use Cases It applies to embodied VR apps, first-person game environment design, embodied-AI training scenarios, and interactive video generation with localized control. 📊 Experimental Results ・On egocentric static scenes it reaches CLIP-V 0.885 and camera accuracy ATE 0.112m, beating PlayerOne and others ・On egocentric dynamic scenes, text alignment (VideoAlign-TA) is 0.717, far above CaM-Ego's 0.385 ・It generalizes strongly to out-of-distribution UE and real-world scenes with little visual overlap between the initial view and anchors #WorldModel #EmbodiedAI

2026.06.13 11:29

🕶️ Walk through a first-person world with your own body motion, and explicitly specify what exists at a given location with an image and pose, including how it evolves over time. Meet AnchorWorld, an embodied egocentric world model. Title: AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization URL: 📝 Overview AnchorWorld generates first-person video controlled by full-body human motion. With "anchor views," it lets you explicitly specify what exists at a given 3D location and how it changes over time. ❓ Challenges Solved Existing world models struggle to supervise full-body motion from egocentric video alone, and define environments only implicitly. They lacked both natural embodied control and localized world customization. 💡 Methodology & Proposed Approach ・Since most of the body is invisible in first person, it uses third-person video as auxiliary supervision to learn body-environment positioning ・An anchor has three parts: an RGB image, a 6-DoF viewpoint pose, and an evolution prompt that specify local appearance and temporal change ・3D RoPE spatially distinguishes multiple anchors, and masked cross-attention enables anchor-specific text control ・It trains in four stages (third-person, first-person, static anchors, dynamic evolution), built on Wan 2.2 TI2V 5B 🎯 Use Cases It applies to embodied VR apps, first-person game environment design, embodied-AI training scenarios, and interactive video generation with localized control. 📊 Experimental Results ・On egocentric static scenes it reaches CLIP-V 0.885 and camera accuracy ATE 0.112m, beating PlayerOne and others ・On egocentric dynamic scenes, text alignment (VideoAlign-TA) is 0.717, far above CaM-Ego's 0.385 ・It generalizes strongly to out-of-distribution UE and real-world scenes with little visual overlap between the initial view and anchors #WorldModel# #EmbodiedAI#

Forward to community