Search ImageEditing on X

33minutes ago

🖼 Test-time scaling for image editing tends to hand every edit the same compute budget, wasting a lot of it. By allocating budget by difficulty and pruning with edit-specific verification, this work hits up to 2.2x speedup while preserving quality. Title: From Scale to Speed: Adaptive Test-Time Scaling for Image Editing URL: 📝 Overview ADE-CoT is a test-time scaling method tailored to goal-directed image editing. Instead of reusing Image-CoT methods built for text-to-image generation, it combines three strategies, difficulty-aware allocation, edit-specific early verification, and opportunistic stopping, to cut compute substantially while preserving quality. ❓ Challenges Solved Prior methods had three mismatches. ・Fixed sampling budgets waste compute on easy edits that barely improve ・General MLLM scores wrongly prune about 40% of samples that start low but ultimately score high ・Large-scale sampling produces redundant identical correct outputs, adding needless compute 💡 Methodology & Proposed Approach ・It reads edit difficulty, giving easy edits a minimal budget and expanding the search for hard ones ・A one-step preview estimates clean latents from noisy intermediates without extra denoising, making early verification reliable ・Grounded SAM2 checks that only the intended region changed, and DINOv2 embeddings remove redundant candidates ・It generates candidates sequentially and stops, via depth-first opportunistic stopping, once enough intent-aligned results are found 🎯 Use Cases It fits complex pose changes, multi-object removal or replacement, fine-grained regional edits, multi-turn editing, and high-quality editing under compute constraints, and is especially valuable where inference cost matters, like a production image-editing API. 📊 Experimental Results ・On GEdit-Bench, FLUX.1 Kontext is 2.2x, BAGEL 1.8x, and Step1X-Edit 2.0x faster than Best-of-N ・Reasoning efficiency more than doubles on a fixed 32-sample budget, and outcome efficiency rises 4.9x, 2.7x, and 2.9x across three benchmarks ・On hard multi-object edits like "remove the person standing next to the lady in white," it fixes the baseline's misidentification #ImageEditing# #DiffusionModels#

0

Forward to community

cv usk@cv_usk

2026.06.15 08:51

🪑 Insert an object into an image while specifying its exact 3D orientation and position. DIRECT solves the 3D-pose control that text leaves ambiguous and parameters struggle with, by decomposing visual proxies. Title: Direct 3D-Aware Object Insertion via Decomposed Visual Proxies URL: 📝 Overview DIRECT is a diffusion-based method that inserts a reference object into an image with explicit control over its 3D pose and position. It decomposes the insertion condition into geometry, appearance, and context, injected through independent pathways. ❓ Challenges Solved Existing insertion methods formulate the task as 2D inpainting and can't control 3D pose. Text guidance is spatially ambiguous, and parametric 3D methods can't translate abstract parameters into correct geometric projections. 💡 Methodology & Proposed Approach ・A user-manipulated 3D proxy rendered at the target pose provides geometry guidance ・Appearance (the reference's high-fidelity look) and context (background semantics) are injected independently via separate LoRA adapters and positional embeddings to avoid feature entanglement ・TRELLIS lifts the image into a coarse 3D shape, refined with VGGT and 3D Gaussian Splatting ・Built on FLUX.1-Fill, it uses shape-decomposed mask augmentation and progressive-resolution training to avoid overfitting 🎯 Use Cases It fits virtual staging, e-commerce product photography, creative work needing precise spatial control, and photorealistic AR/VR content generation. 📊 Experimental Results ・On the FLUX backbone it reaches PSNR 23.09, LPIPS 0.147, and matching error 17.8, beating baselines on all metrics ・It stays stable across large 0-180 degree pose changes and preserves fine details even under 3D-reconstruction degradation ・Hybrid-data training raised CLIP-I from 0.904 to 0.943 ・For symmetric object orientation, RGB geometry guidance outperformed normal maps #3DGeneration# #ImageEditing#

0

1

0

Forward to community

Agnes AI@agnesai_sapiens

2026.05.31 14:17

𝗔𝗴𝗻𝗲𝘀 𝟮.𝟬 𝗶𝘀 𝗳𝗿𝗲𝗲. 𝗜𝗻𝗱𝗲𝗳𝗶𝗻𝗶𝘁𝗲𝗹𝘆. 𝗡𝗼 𝘄𝗮𝗶𝘁𝗹𝗶𝘀𝘁. Text. Image. Video. One model series. We built Agnes-2.0 for the developers who got priced out — high token costs, API limits, geography. That ends today. 𝗪𝗵𝗮𝘁'𝘀 𝗹𝗶𝘃𝗲: → 𝗔𝗴𝗻𝗲𝘀-𝟮.𝟬-𝗙𝗹𝗮𝘀𝗵 — text and agentic, top 10 on Claw-Eval, ahead of Gemini and MiniMax → 𝗔𝗴𝗻𝗲𝘀-𝗜𝗺𝗮𝗴𝗲-𝟮.𝟬-𝗙𝗹𝗮𝘀𝗵 — image editing, top 10 on Artificial Analysis → 𝗔𝗴𝗻𝗲𝘀-𝗩𝗶𝗱𝗲𝗼-𝗩𝟮.𝟬 — text-to-video, image-to-video, native audio, top 10 on Artificial Analysis These are not demo models. Not stripped-down trials. The same production models that benchmark globally alongside OpenAI, Google, Anthropic, KlingAI, and ByteDance. Trained in Singapore. Ranked globally. Free today. 𝗻𝗼 𝘀𝘂𝗯𝘀𝗰𝗿𝗶𝗽𝘁𝗶𝗼𝗻 𝗻𝗲𝗲𝗱𝗲𝗱.

0

5

51

9

Forward to community

Mustafa Suleyman@mustafasuleyman

2026.06.02 18:38

Super excited to announce seven new world-class MAI models today. They represent what we consider a new era in AI designed to keep you in control and on the frontier. First is our text foundation model, MAI-Thinking-1, exceptionally strong on reasoning and SWE tasks. - It’s a 35B active parameter MoE with a 256K context window. Independent human raters on Surge prefer it for overall quality in blind side-by-sides versus Sonnet 4.6, and it’s achieved 97% on AIME 2025, the key measure of its general-purpose reasoning abilities. - It's at 53% on SWE Bench Pro, placing it right alongside Opus 4.6 on one of the toughest coding benchmarks. - And since we co-designed our models with our own silicon, MAI-Thinking-1 is optimized on our MAIA 200 chip. Benchmarking head-to-head against the GB200, we see 30% better performance per dollar as well as a 1.4x performance-per-watt gain when running our MAI models on the MAIA 200 end-to-end. Next is MAI-Image-2.5 and its Flash variant. Two super strong models now at #2# on the leaderboards, surpassing the score of Nano Banana 2 on image editing. Last for now is MAI-Code-1-Flash, our new inference efficient coding model, especially tuned for VS Code and GitHub Copilot CLI. - Code-1-Flash achieves 51% on SWE Bench Pro, despite having just 5B parameters, putting it closer to Haiku in size but cheaper in cost. All of this is the foundation for Microsoft Frontier Tuning. It lets you customize our models to create custom, company-specific agents that only you control. You can make our model, your model. Your data. Your agents. Your moat. Early adopters are already seeing a difference. When we tuned our models for McKinsey’s tasks, MAI delivered the highest win rate, outperforming GPT-5.5 on quality, while being 10x lower on cost. Also really excited to be collaborating with the amazing team at Mayo Clinic to jointly train a new frontier AI model for healthcare. Our announcements today mark another milestone on the road to humanist superintelligence. You can learn more and about our other new models in our latest blog:

0

192

3.8K

541

Forward to community