Understand entire hour-long videos and wield tools and search — an efficient multimodal model with 30B total params but only 3B active at inference 🎬
Title: Kwai Keye-VL-2.0 Technical Report
URL:
🎬 Overview
An open-source multimodal foundation model from Kuaishou, built for long-video understanding and agentic intelligence. It's a Mixture-of-Experts (MoE) model with 30B total parameters but only 3B activated at inference.
❓ Challenges Solved
Processing hour-level videos demands enormous compute.
・Many frames make long-range temporal dependencies hard to capture
・The challenge was addressing that compute constraint while keeping strong performance across diverse tasks
💡 Methodology & Proposed Approach
・Long-context: adapts DeepSeek Sparse Attention (DSA) to GQA-based architectures for lossless 256K context processing, capturing key frames and long-range temporal dependencies
・Infrastructure: scalable video I/O, heterogeneous ViT-LM parallelism, custom DSA kernels
・Training: Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) with Context-RL and Video-RL to address catastrophic forgetting during multi-task alignment
📊 Experimental Results
・State-of-the-art among models of similar scale
・Especially strong on fine-grained temporal localization (TimeLens)
・Excels at long-video comprehension on Video-MME-v2 and LongVideoBench
・Also capable at multimodal agent collaboration across Code, Tool, and Search, with self-correction
🌍 Use Cases
It fits long-video understanding, search, and moderation, plus backbones for video-handling autonomous agents. As the first application of sparse attention to multimodal at this scale, its big strength is making hour-level video processing cost-realistic.
#
VideoUnderstanding# #
Multimodal#