cv usk(@cv_usk):Understand entire hour-long videos and wield tools and search — an efficient multimodal model with 30B total params but only 3B active at inference 🎬 Title: Kwai Keye-VL-2.0 Technical Report URL: https://t.co/EbW8InZjgz 🎬 Overview An open-source multimodal foundation model from Kuaishou, built for long-video understanding and agentic intelligence. It's a Mixture-of-Experts (MoE) model with 30B total parameters but only 3B activated at inference. ❓ Challenges Solved Processing hour-level videos demands enormous compute. ・Many frames make long-range temporal dependencies hard to capture ・The challenge was addressing that compute constraint while keeping strong performance across diverse tasks 💡 Methodology & Proposed Approach ・Long-context: adapts DeepSeek Sparse Attention (DSA) to GQA-based architectures for lossless 256K context processing, capturing key frames and long-range temporal dependencies ・Infrastructure: scalable video I/O, heterogeneous ViT-LM parallelism, custom DSA kernels ・Training: Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) with Context-RL and Video-RL to address catastrophic forgetting during multi-task alignment 📊 Experimental Results ・State-of-the-art among models of similar scale ・Especially strong on fine-grained temporal localization (TimeLens) ・Excels at long-video comprehension on Video-MME-v2 and LongVideoBench ・Also capable at multimodal agent collaboration across Code, Tool, and Search, with self-correction 🌍 Use Cases It fits long-video understanding, search, and moderation, plus backbones for video-handling autonomous agents. As the first application of sparse attention to multimodal at this scale, its big strength is making hour-level video processing cost-realistic. #VideoUnderstanding #Multimodal

2026.06.15 15:31

Understand entire hour-long videos and wield tools and search — an efficient multimodal model with 30B total params but only 3B active at inference 🎬 Title: Kwai Keye-VL-2.0 Technical Report URL: 🎬 Overview An open-source multimodal foundation model from Kuaishou, built for long-video understanding and agentic intelligence. It's a Mixture-of-Experts (MoE) model with 30B total parameters but only 3B activated at inference. ❓ Challenges Solved Processing hour-level videos demands enormous compute. ・Many frames make long-range temporal dependencies hard to capture ・The challenge was addressing that compute constraint while keeping strong performance across diverse tasks 💡 Methodology & Proposed Approach ・Long-context: adapts DeepSeek Sparse Attention (DSA) to GQA-based architectures for lossless 256K context processing, capturing key frames and long-range temporal dependencies ・Infrastructure: scalable video I/O, heterogeneous ViT-LM parallelism, custom DSA kernels ・Training: Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) with Context-RL and Video-RL to address catastrophic forgetting during multi-task alignment 📊 Experimental Results ・State-of-the-art among models of similar scale ・Especially strong on fine-grained temporal localization (TimeLens) ・Excels at long-video comprehension on Video-MME-v2 and LongVideoBench ・Also capable at multimodal agent collaboration across Code, Tool, and Search, with self-correction 🌍 Use Cases It fits long-video understanding, search, and moderation, plus backbones for video-handling autonomous agents. As the first application of sparse attention to multimodal at this scale, its big strength is making hour-level video processing cost-realistic. #VideoUnderstanding# #Multimodal#

Forward to community