Register and share your invite link to earn from video plays and referrals.

Search results for SpatialReasoning
SpatialReasoning community
One keyword maps to one global community path.
Create community
People
Not Found
Tweets including SpatialReasoning
Making AI "reason about space in words" might be backfiring 🧭 Here's a new approach that lets it imagine unseen viewpoints instead. Title: Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models URL: 🧭 Overview This work proposes Imaginative Perception Tokens (IPT) to strengthen spatial reasoning in vision language models (VLMs). Rather than forcing spatial logic through language, it keeps "what could be perceived under a different arrangement" as an intermediate perceptual representation. ❓ Challenges Solved VLMs struggle with spatial reasoning: inferring unobserved viewpoints, reasoning through occluded paths, and integrating partial observations. Prior work pushed this into textual chain-of-thought, but forcing visual reasoning through language alone hit a ceiling. 💡 Methodology & Proposed Approach ・Uses the unified VLM backbone BAGEL, trained with IPT supervision ・Formulates three tasks: Perspective Taking (PET), Path Tracing (PT), Multiview Counting (MVC) ・Builds a ~20,000-example dataset with ground truth, answers, and metrics The core idea is treating the perception itself ("if I moved here, I'd see this") as an intermediate representation. 📊 Experimental Results ・IPT improved Multiview Counting (MVC) accuracy by 3.4% ・Path Tracing (PT) reached performance competitive with closed-source models ・IPT supervision outperformed textual chain-of-thought training ・Conversely, textual CoT substantially degraded spatial reasoning #SpatialReasoning# #MultimodalLLM#
Show more
🗺️ Even frontier GPT-5 succeeds on just 14.4% of real-world spatial tasks. A new benchmark goes beyond staring at a static image and exposes how weak AI agents still are at active spatial reasoning. Title: SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks URL: 📝 Overview SpatialWorld measures whether multimodal LLMs can solve tasks by actively exploring 3D environments from a vision-only, egocentric viewpoint. It unifies eight different simulators across indoor, outdoor, and digital-game settings under a shared protocol, and evaluates 15 frontier models on 760 human-annotated tasks. The agent gets no prior map and no reference solution; it has to look, move, and decide on its own. ❓ Challenges Solved Prior spatial-reasoning benchmarks relied on passive evaluation via static VQA or pre-recorded video. That can't capture the interactive spatial understanding the real world demands, where an agent must move its own viewpoint to gather visual evidence and replan on the fly under partial observability. There was a large gap between recognizing a static scene and actually moving through an unfamiliar space to get a task done. 💡 Methodology & Proposed Approach ・The task is framed as a vision-only POMDP (Partially Observable Markov Decision Process) ・The agent receives only a natural-language goal and a single native-resolution egocentric RGB image, with no depth, maps, or semantic metadata ・Actions are issued through a high-level text interface covering navigation, viewpoint control, object interaction, and task completion ・It integrates eight backends: indoor (AI2-THOR, ProcTHOR, VirtualHome), outdoor (CARLA, EmbodiedCity), and digital games (Block3D, Snake3D, Rubik's Cube) ・Success is judged by whether the final terminal state satisfies the goal, not by matching the trajectory, and is validated by human annotators ・Beyond success rate, it measures step efficiency against human reference trajectories to surface inefficient behavior 🎯 Use Cases It offers a unified, fair way to evaluate the spatial abilities of home robots and autonomous agents before real-world deployment. It can systematically diagnose where long-horizon tasks that combine navigation and manipulation break down, serving as a rigorous testbed for improving spatial-reasoning models. 📊 Experimental Results ・Across 15 frontier models, physical-task success was 14.4% for GPT-5, 12.2% for Qwen-3.5-397B, 9.2% for Gemini-3.1-Pro, and 9.2% for Kimi-K2.5 ・On digital games, Gemini-3.1-Pro led at 39.0%, followed by GPT-5 at 36.4% ・By complexity, interaction-only tasks averaged 50.2%, navigation-only dropped to 8.6%, and combined navigation-and-interaction collapsed to just 4.2% ・Models with similar success rates showed very different efficiency scores, revealing heavy reliance on trial-and-error exploration ・Model rankings shifted dramatically across environments, with no single model dominating every category #AIAgents# #SpatialReasoning#
Show more