注册并分享邀请链接,可获得视频播放与邀请奖励。

cv usk
@cv_usk
AI / Software Research Notes AI Agent, LLMOps, MLOps, Software Architecture
加入 May 2026
240 正在关注    207 粉丝
🎼 Solve hard tasks that mix text, image, audio, and video by decomposing them across specialized sub-agents running in parallel. This work shows a well-matched team beats one giant monolithic model. Title: Orchestra-o1: Omnimodal Agent Orchestration URL: 💡 Overview Orchestra-o1 is a hierarchical agent framework that separates high-level orchestration from low-level tool execution to handle tasks where multiple modalities coexist. It specializes sub-agents per modality and runs independent sub-tasks in parallel for efficiency. ⚠️ The problem Existing orchestration frameworks handle only limited modalities and fail to generalize to scenarios where text, image, audio, and video coexist and interact at once. 🛠 Approach ・Represent each backend with a skill vector plus a cost-latency profile, and select via cost-aware matching ・Assign perception tools (image/audio/video analysis) and action tools (search, page visit, code execution) ・Build a latent dependency graph over sub-goals and run independent tasks in parallel ・Train with DA-GRPO: score step-level decisions, not just final answers, using a multi-dimensional rubric that weights decision quality at 0.6 📊 Results (their OmniGAIA benchmark) ・Orchestra-o1-GPT-5 hits 72.8%, beating the second-best Gemini-3-Pro by 10.3 points ・The open-source Orchestra-o1-8B (from Qwen3-8B) reaches 30.0%, best among open omnimodal agents ・It reaches 72.8% at cost 341.6, cheaper and far more accurate than weaker baselines ・By difficulty: Easy 80.3%, Medium 75.0%, Hard 56.4% #AIAgents# #Multimodal#
显示更多