Search 3DGeneration on X

2026.06.16 21:39

🧊 Turning one image into 3D used to force a choice: "accurate on the visible surface but no backside" or "complete but misaligned with the input." World Tracing stacks 3D points per pixel into layers, capturing visible and hidden surfaces at once. Title: World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible URL: 🔍 Overview World Tracing represents geometry as an ordered stack of L camera-space 3D points per pixel. Layer 0 is the visible surface, deeper layers record front-to-back intersections with surfaces hidden behind the foreground, unifying faithful reconstruction and generative completion as one layered problem. ❓ Challenges Solved Image-to-3D carried a fundamental trade-off. ・Depth estimators are pixel-accurate but stop at the visible surface ・Generative 3D models are complete but work in canonical frames, so they misalign with the input World Tracing frames this as faithful generation: accurately reconstruct the visible surface while plausibly generating the invisible. 💡 Methodology & Proposed Approach At its core is WT-DiT, a 1.7B-parameter diffusion transformer. ・Three-way factorized attention (layer-wise, ray-wise, global) preserves depth ordering and front-to-back coherence ・A mixed noise schedule handles the asymmetry between layer 0 (image-constrained, reconstruction-like) and deeper, generative layers by varying noise per layer ・Mix-training lets multilayer (3D assets) and single-layer (RGBD photos) supervision train together 🎯 Use Cases ・Text-driven 3D scene editing (training-free closed-form compositing thanks to pixel alignment) ・Geometry-conditioned novel-view video synthesis using complete hidden geometry as memory ・A TRELLIS hybrid that yields faithful meshes which reproject correctly to the input 📊 Experimental Results It outperforms prior work on object, scene, and dynamic benchmarks. ・Object visible-depth MAE 0.0149 (VGGT 0.0257) ・Complete-shape F-score@0.05 0.549 (TRELLIS 0.204) ・Scene MAE 0.0102, and best dynamic-clip Chamfer L2 at 0.0105 #3DGeneration# #ComputerVision#

0

Forward to community

cv usk@cv_usk

2026.06.15 08:51

🪑 Insert an object into an image while specifying its exact 3D orientation and position. DIRECT solves the 3D-pose control that text leaves ambiguous and parameters struggle with, by decomposing visual proxies. Title: Direct 3D-Aware Object Insertion via Decomposed Visual Proxies URL: 📝 Overview DIRECT is a diffusion-based method that inserts a reference object into an image with explicit control over its 3D pose and position. It decomposes the insertion condition into geometry, appearance, and context, injected through independent pathways. ❓ Challenges Solved Existing insertion methods formulate the task as 2D inpainting and can't control 3D pose. Text guidance is spatially ambiguous, and parametric 3D methods can't translate abstract parameters into correct geometric projections. 💡 Methodology & Proposed Approach ・A user-manipulated 3D proxy rendered at the target pose provides geometry guidance ・Appearance (the reference's high-fidelity look) and context (background semantics) are injected independently via separate LoRA adapters and positional embeddings to avoid feature entanglement ・TRELLIS lifts the image into a coarse 3D shape, refined with VGGT and 3D Gaussian Splatting ・Built on FLUX.1-Fill, it uses shape-decomposed mask augmentation and progressive-resolution training to avoid overfitting 🎯 Use Cases It fits virtual staging, e-commerce product photography, creative work needing precise spatial control, and photorealistic AR/VR content generation. 📊 Experimental Results ・On the FLUX backbone it reaches PSNR 23.09, LPIPS 0.147, and matching error 17.8, beating baselines on all metrics ・It stays stable across large 0-180 degree pose changes and preserves fine details even under 3D-reconstruction degradation ・Hybrid-data training raised CLIP-I from 0.904 to 0.943 ・For symmetric object orientation, RGB geometry guidance outperformed normal maps #3DGeneration# #ImageEditing#

0

1

0

Forward to community

cv usk@cv_usk

2026.06.13 08:29

🏠 Just specify furniture with text or images, and get a style-consistent 3D indoor scene generated automatically, about 85% faster than MMGDreamer. Title: FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow URL: 📝 Overview FlowScene generates high-fidelity 3D indoor scenes from a multimodal scene graph that fuses text and images. It produces layout, shape, and texture in three branches via a straight-line rectified flow, keeping style consistent across the whole scene. ❓ Challenges Solved Language-driven retrieval methods lack object-level control and style coherence, while graph-based methods struggle with high-quality textures. FlowScene resolves both weaknesses at once. 💡 Methodology & Proposed Approach ・It takes a multimodal graph where nodes fuse text descriptions and image features (text-only, image-only, or mixed) ・An InfoExchangeUnit densely exchanges node information during sampling to satisfy both individual and holistic conditions ・Layout (3D boxes), shape (VQ-VAE latents), and texture (anchored to geometry) are generated by independent denoisers ・Texture is denoised with geometry fixed, so even text-only nodes get style-consistent textures through information exchange 🎯 Use Cases It fits interactive scene design for interior design and manufacturing, VR/AR content creation, and building simulation environments for robotics. 📊 Experimental Results ・Bedroom FID improves from 42.38 to 35.01, 17.4% better than MMGDreamer ・CLIPScore of 0.2386 is the best of all methods, and users rate style consistency 8.72/10 ・Inference without textures takes 6.83s, about 85% faster than MMGDreamer's 45.34s ・Object quality also improves, e.g. a 43.90% better minimum matching distance on nightstands #3DGeneration# #GenerativeAI#

0

Forward to community