Search ComputerVision on X

Search results for ComputerVision

ComputerVision community

One keyword maps to one global community path.

Create community

People

Not Found

Tweets including ComputerVision

cv usk@cv_usk

2026.06.16 21:39

🧊 Turning one image into 3D used to force a choice: "accurate on the visible surface but no backside" or "complete but misaligned with the input." World Tracing stacks 3D points per pixel into layers, capturing visible and hidden surfaces at once. Title: World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible URL: 🔍 Overview World Tracing represents geometry as an ordered stack of L camera-space 3D points per pixel. Layer 0 is the visible surface, deeper layers record front-to-back intersections with surfaces hidden behind the foreground, unifying faithful reconstruction and generative completion as one layered problem. ❓ Challenges Solved Image-to-3D carried a fundamental trade-off. ・Depth estimators are pixel-accurate but stop at the visible surface ・Generative 3D models are complete but work in canonical frames, so they misalign with the input World Tracing frames this as faithful generation: accurately reconstruct the visible surface while plausibly generating the invisible. 💡 Methodology & Proposed Approach At its core is WT-DiT, a 1.7B-parameter diffusion transformer. ・Three-way factorized attention (layer-wise, ray-wise, global) preserves depth ordering and front-to-back coherence ・A mixed noise schedule handles the asymmetry between layer 0 (image-constrained, reconstruction-like) and deeper, generative layers by varying noise per layer ・Mix-training lets multilayer (3D assets) and single-layer (RGBD photos) supervision train together 🎯 Use Cases ・Text-driven 3D scene editing (training-free closed-form compositing thanks to pixel alignment) ・Geometry-conditioned novel-view video synthesis using complete hidden geometry as memory ・A TRELLIS hybrid that yields faithful meshes which reproject correctly to the input 📊 Experimental Results It outperforms prior work on object, scene, and dynamic benchmarks. ・Object visible-depth MAE 0.0149 (VGGT 0.0257) ・Complete-shape F-score@0.05 0.549 (TRELLIS 0.204) ・Scene MAE 0.0102, and best dynamic-clip Chamfer L2 at 0.0105 #3DGeneration# #ComputerVision#

Forward to community