cv usk(@cv_usk):🗺️ Even frontier GPT-5 succeeds on just 14.4% of real-world spatial tasks. A new benchmark goes beyond staring at a static image and exposes how weak AI agents still are at active spatial reasoning. Title: SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks URL: https://t.co/eKeN8bSua8 📝 Overview SpatialWorld measures whether multimodal LLMs can solve tasks by actively exploring 3D environments from a vision-only, egocentric viewpoint. It unifies eight different simulators across indoor, outdoor, and digital-game settings under a shared protocol, and evaluates 15 frontier models on 760 human-annotated tasks. The agent gets no prior map and no reference solution; it has to look, move, and decide on its own. ❓ Challenges Solved Prior spatial-reasoning benchmarks relied on passive evaluation via static VQA or pre-recorded video. That can't capture the interactive spatial understanding the real world demands, where an agent must move its own viewpoint to gather visual evidence and replan on the fly under partial observability. There was a large gap between recognizing a static scene and actually moving through an unfamiliar space to get a task done. 💡 Methodology & Proposed Approach ・The task is framed as a vision-only POMDP (Partially Observable Markov Decision Process) ・The agent receives only a natural-language goal and a single native-resolution egocentric RGB image, with no depth, maps, or semantic metadata ・Actions are issued through a high-level text interface covering navigation, viewpoint control, object interaction, and task completion ・It integrates eight backends: indoor (AI2-THOR, ProcTHOR, VirtualHome), outdoor (CARLA, EmbodiedCity), and digital games (Block3D, Snake3D, Rubik's Cube) ・Success is judged by whether the final terminal state satisfies the goal, not by matching the trajectory, and is validated by human annotators ・Beyond success rate, it measures step efficiency against human reference trajectories to surface inefficient behavior 🎯 Use Cases It offers a unified, fair way to evaluate the spatial abilities of home robots and autonomous agents before real-world deployment. It can systematically diagnose where long-horizon tasks that combine navigation and manipulation break down, serving as a rigorous testbed for improving spatial-reasoning models. 📊 Experimental Results ・Across 15 frontier models, physical-task success was 14.4% for GPT-5, 12.2% for Qwen-3.5-397B, 9.2% for Gemini-3.1-Pro, and 9.2% for Kimi-K2.5 ・On digital games, Gemini-3.1-Pro led at 39.0%, followed by GPT-5 at 36.4% ・By complexity, interaction-only tasks averaged 50.2%, navigation-only dropped to 8.6%, and combined navigation-and-interaction collapsed to just 4.2% ・Models with similar success rates showed very different efficiency scores, revealing heavy reliance on trial-and-error exploration ・Model rankings shifted dramatically across environments, with no single model dominating every category #AIAgents #SpatialReasoning

2026.06.12 01:38

🗺️ Even frontier GPT-5 succeeds on just 14.4% of real-world spatial tasks. A new benchmark goes beyond staring at a static image and exposes how weak AI agents still are at active spatial reasoning. Title: SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks URL: 📝 Overview SpatialWorld measures whether multimodal LLMs can solve tasks by actively exploring 3D environments from a vision-only, egocentric viewpoint. It unifies eight different simulators across indoor, outdoor, and digital-game settings under a shared protocol, and evaluates 15 frontier models on 760 human-annotated tasks. The agent gets no prior map and no reference solution; it has to look, move, and decide on its own. ❓ Challenges Solved Prior spatial-reasoning benchmarks relied on passive evaluation via static VQA or pre-recorded video. That can't capture the interactive spatial understanding the real world demands, where an agent must move its own viewpoint to gather visual evidence and replan on the fly under partial observability. There was a large gap between recognizing a static scene and actually moving through an unfamiliar space to get a task done. 💡 Methodology & Proposed Approach ・The task is framed as a vision-only POMDP (Partially Observable Markov Decision Process) ・The agent receives only a natural-language goal and a single native-resolution egocentric RGB image, with no depth, maps, or semantic metadata ・Actions are issued through a high-level text interface covering navigation, viewpoint control, object interaction, and task completion ・It integrates eight backends: indoor (AI2-THOR, ProcTHOR, VirtualHome), outdoor (CARLA, EmbodiedCity), and digital games (Block3D, Snake3D, Rubik's Cube) ・Success is judged by whether the final terminal state satisfies the goal, not by matching the trajectory, and is validated by human annotators ・Beyond success rate, it measures step efficiency against human reference trajectories to surface inefficient behavior 🎯 Use Cases It offers a unified, fair way to evaluate the spatial abilities of home robots and autonomous agents before real-world deployment. It can systematically diagnose where long-horizon tasks that combine navigation and manipulation break down, serving as a rigorous testbed for improving spatial-reasoning models. 📊 Experimental Results ・Across 15 frontier models, physical-task success was 14.4% for GPT-5, 12.2% for Qwen-3.5-397B, 9.2% for Gemini-3.1-Pro, and 9.2% for Kimi-K2.5 ・On digital games, Gemini-3.1-Pro led at 39.0%, followed by GPT-5 at 36.4% ・By complexity, interaction-only tasks averaged 50.2%, navigation-only dropped to 8.6%, and combined navigation-and-interaction collapsed to just 4.2% ・Models with similar success rates showed very different efficiency scores, revealing heavy reliance on trial-and-error exploration ・Model rankings shifted dramatically across environments, with no single model dominating every category #AIAgents# #SpatialReasoning#

显示更多