Search Multimodal on X — X Web Viewer

2026.02.26 07:21

Supporting @MITEECS and @nlp_mit’s Multimodal Machine Learning course (Spring 2026). 🎓 Students are leveraging the multimodal capabilities of Kimi K2.5 to power their final research projects. We look forward to seeing the innovative applications that will emerge this semester. 🔗 Happy coding! ✨

0

21

721

65

Forward to community

cv usk@cv_usk

2026.06.15 15:31

Understand entire hour-long videos and wield tools and search — an efficient multimodal model with 30B total params but only 3B active at inference 🎬 Title: Kwai Keye-VL-2.0 Technical Report URL: 🎬 Overview An open-source multimodal foundation model from Kuaishou, built for long-video understanding and agentic intelligence. It's a Mixture-of-Experts (MoE) model with 30B total parameters but only 3B activated at inference. ❓ Challenges Solved Processing hour-level videos demands enormous compute. ・Many frames make long-range temporal dependencies hard to capture ・The challenge was addressing that compute constraint while keeping strong performance across diverse tasks 💡 Methodology & Proposed Approach ・Long-context: adapts DeepSeek Sparse Attention (DSA) to GQA-based architectures for lossless 256K context processing, capturing key frames and long-range temporal dependencies ・Infrastructure: scalable video I/O, heterogeneous ViT-LM parallelism, custom DSA kernels ・Training: Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) with Context-RL and Video-RL to address catastrophic forgetting during multi-task alignment 📊 Experimental Results ・State-of-the-art among models of similar scale ・Especially strong on fine-grained temporal localization (TimeLens) ・Excels at long-video comprehension on Video-MME-v2 and LongVideoBench ・Also capable at multimodal agent collaboration across Code, Tool, and Search, with self-correction 🌍 Use Cases It fits long-video understanding, search, and moderation, plus backbones for video-handling autonomous agents. As the first application of sparse attention to multimodal at this scale, its big strength is making hour-level video processing cost-realistic. #VideoUnderstanding# #Multimodal#

0

Forward to community

jibber@jibberswrld

2026.06.13 08:20

AI just got a lot more useful for laptops. Google released Gemma 4 12B, a multimodal model that can run locally with 16GB of memory. More capable AI is moving onto your device, not just the cloud. Follow me to stay updated with the latest AI news.

0

Forward to community

Google Gemma@googlegemma

2026.06.03 16:00

Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to your laptop, and released under an Apache 2.0 license. Bridging the gap between edge efficiency and advanced reasoning. Here is what’s new with Gemma 4 12B: 👇

0

323

10.6K

1.5K

Forward to community

Kimi.ai@Kimi_Moonshot

2026.02.03 14:46

We're introducing WorldVQA, a new benchmark to measure atomic vision-centric world knowledge in Multimodal Large Language Models. Current evaluations often conflate visual knowledge retrieval with reasoning. In contrast, WorldVQA decouples these capabilities to strictly measure "what the model memorizes." The benchmark consists of 3,500 VQA pairs across 9 categories, with careful attention to linguistic and cultural diversity:

0

32

845

98

Forward to community

cv usk@cv_usk

2026.06.13 08:29

🏠 Just specify furniture with text or images, and get a style-consistent 3D indoor scene generated automatically, about 85% faster than MMGDreamer. Title: FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow URL: 📝 Overview FlowScene generates high-fidelity 3D indoor scenes from a multimodal scene graph that fuses text and images. It produces layout, shape, and texture in three branches via a straight-line rectified flow, keeping style consistent across the whole scene. ❓ Challenges Solved Language-driven retrieval methods lack object-level control and style coherence, while graph-based methods struggle with high-quality textures. FlowScene resolves both weaknesses at once. 💡 Methodology & Proposed Approach ・It takes a multimodal graph where nodes fuse text descriptions and image features (text-only, image-only, or mixed) ・An InfoExchangeUnit densely exchanges node information during sampling to satisfy both individual and holistic conditions ・Layout (3D boxes), shape (VQ-VAE latents), and texture (anchored to geometry) are generated by independent denoisers ・Texture is denoised with geometry fixed, so even text-only nodes get style-consistent textures through information exchange 🎯 Use Cases It fits interactive scene design for interior design and manufacturing, VR/AR content creation, and building simulation environments for robotics. 📊 Experimental Results ・Bedroom FID improves from 42.38 to 35.01, 17.4% better than MMGDreamer ・CLIPScore of 0.2386 is the best of all methods, and users rate style consistency 8.72/10 ・Inference without textures takes 6.83s, about 85% faster than MMGDreamer's 45.34s ・Object quality also improves, e.g. a 43.90% better minimum matching distance on nightstands #3DGeneration# #GenerativeAI#

0

Forward to community

cv usk@cv_usk

2026.06.12 10:37

For agent memory, the real question isn't "how to store" — it's "what to remember" 🧠 A fresh take that learns what to memorize via reinforcement learning. Title: Task-Focused Memorization for Multimodal Agents URL: 🧠 Overview This work proposes TaskMem, which treats long-term memory for multimodal agents as a learnable policy optimized with reinforcement learning, focused on deciding what to memorize. From an unbounded stream of observations, it selectively retains only the content relevant to the agent's role and task. ❓ Challenges Solved A multimodal agent operating in the real world continuously receives an unbounded stream of observations. ・Most prior work focused on how to store memories (designing memory modules) ・But the essential problem is what to memorize — without a principled way to select role-relevant content from an endless stream, memory simply fails This work starts from that shift in perspective. 💡 Methodology & Proposed Approach TaskMem treats memorization as a learnable policy, optimized in two phases. ・Phase 1: learn high-quality memorization under fidelity requirements ・Phase 2: post-deployment fine-tuning that uses task rewards to align memorization with the environment's demands ・It builds on the MLLM Qwen3-VL-30B-A3B and optimizes the policy lightly via adapter tuning ・Reward models derived from real tasks steer the policy toward selecting relevant content 🌍 Use Cases / Experimental Results On reformulated streaming benchmarks, it delivered clear accuracy gains. ・VideoMME: 67.9% VQA accuracy (+6.3%) ・EgoLife: 45.4% VQA accuracy (+7.0%) ・EgoTempo: 27.6% VQA accuracy (+5.3%) ・Strong precision across all benchmarks (80.5-85.6%) It charts a practical path for long-running, always-on agents to selectively remember the right things while keeping context bloat in check. #AIAgents# #Memory#

0

Forward to community

cv usk@cv_usk

2026.06.12 08:22

Making AI "reason about space in words" might be backfiring 🧭 Here's a new approach that lets it imagine unseen viewpoints instead. Title: Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models URL: 🧭 Overview This work proposes Imaginative Perception Tokens (IPT) to strengthen spatial reasoning in vision language models (VLMs). Rather than forcing spatial logic through language, it keeps "what could be perceived under a different arrangement" as an intermediate perceptual representation. ❓ Challenges Solved VLMs struggle with spatial reasoning: inferring unobserved viewpoints, reasoning through occluded paths, and integrating partial observations. Prior work pushed this into textual chain-of-thought, but forcing visual reasoning through language alone hit a ceiling. 💡 Methodology & Proposed Approach ・Uses the unified VLM backbone BAGEL, trained with IPT supervision ・Formulates three tasks: Perspective Taking (PET), Path Tracing (PT), Multiview Counting (MVC) ・Builds a ~20,000-example dataset with ground truth, answers, and metrics The core idea is treating the perception itself ("if I moved here, I'd see this") as an intermediate representation. 📊 Experimental Results ・IPT improved Multiview Counting (MVC) accuracy by 3.4% ・Path Tracing (PT) reached performance competitive with closed-source models ・IPT supervision outperformed textual chain-of-thought training ・Conversely, textual CoT substantially degraded spatial reasoning #SpatialReasoning# #MultimodalLLM#

0

Forward to community