cv usk(@cv_usk):🎼 Solve hard tasks that mix text, image, audio, and video by decomposing them across specialized sub-agents running in parallel. This work shows a well-matched team beats one giant monolithic model. Title: Orchestra-o1: Omnimodal Agent Orchestration URL: https://t.co/etRxth6EnS 💡 Overview Orchestra-o1 is a hierarchical agent framework that separates high-level orchestration from low-level tool execution to handle tasks where multiple modalities coexist. It specializes sub-agents per modality and runs independent sub-tasks in parallel for efficiency. ⚠️ The problem Existing orchestration frameworks handle only limited modalities and fail to generalize to scenarios where text, image, audio, and video coexist and interact at once. 🛠 Approach ・Represent each backend with a skill vector plus a cost-latency profile, and select via cost-aware matching ・Assign perception tools (image/audio/video analysis) and action tools (search, page visit, code execution) ・Build a latent dependency graph over sub-goals and run independent tasks in parallel ・Train with DA-GRPO: score step-level decisions, not just final answers, using a multi-dimensional rubric that weights decision quality at 0.6 📊 Results (their OmniGAIA benchmark) ・Orchestra-o1-GPT-5 hits 72.8%, beating the second-best Gemini-3-Pro by 10.3 points ・The open-source Orchestra-o1-8B (from Qwen3-8B) reaches 30.0%, best among open omnimodal agents ・It reaches 72.8% at cost 341.6, cheaper and far more accurate than weaker baselines ・By difficulty: Easy 80.3%, Medium 75.0%, Hard 56.4% #AIAgents #Multimodal

2026.06.15 21:36

🎼 Solve hard tasks that mix text, image, audio, and video by decomposing them across specialized sub-agents running in parallel. This work shows a well-matched team beats one giant monolithic model. Title: Orchestra-o1: Omnimodal Agent Orchestration URL: 💡 Overview Orchestra-o1 is a hierarchical agent framework that separates high-level orchestration from low-level tool execution to handle tasks where multiple modalities coexist. It specializes sub-agents per modality and runs independent sub-tasks in parallel for efficiency. ⚠️ The problem Existing orchestration frameworks handle only limited modalities and fail to generalize to scenarios where text, image, audio, and video coexist and interact at once. 🛠 Approach ・Represent each backend with a skill vector plus a cost-latency profile, and select via cost-aware matching ・Assign perception tools (image/audio/video analysis) and action tools (search, page visit, code execution) ・Build a latent dependency graph over sub-goals and run independent tasks in parallel ・Train with DA-GRPO: score step-level decisions, not just final answers, using a multi-dimensional rubric that weights decision quality at 0.6 📊 Results (their OmniGAIA benchmark) ・Orchestra-o1-GPT-5 hits 72.8%, beating the second-best Gemini-3-Pro by 10.3 points ・The open-source Orchestra-o1-8B (from Qwen3-8B) reaches 30.0%, best among open omnimodal agents ・It reaches 72.8% at cost 341.6, cheaper and far more accurate than weaker baselines ・By difficulty: Easy 80.3%, Medium 75.0%, Hard 56.4% #AIAgents# #Multimodal#

显示更多