🎼 Solve hard tasks that mix text, image, audio, and video by decomposing them across specialized sub-agents running in parallel. This work shows a well-matched team beats one giant monolithic model.
Title: Orchestra-o1: Omnimodal Agent Orchestration
URL:
💡 Overview
Orchestra-o1 is a hierarchical agent framework that separates high-level orchestration from low-level tool execution to handle tasks where multiple modalities coexist. It specializes sub-agents per modality and runs independent sub-tasks in parallel for efficiency.
⚠️ The problem
Existing orchestration frameworks handle only limited modalities and fail to generalize to scenarios where text, image, audio, and video coexist and interact at once.
🛠 Approach
・Represent each backend with a skill vector plus a cost-latency profile, and select via cost-aware matching
・Assign perception tools (image/audio/video analysis) and action tools (search, page visit, code execution)
・Build a latent dependency graph over sub-goals and run independent tasks in parallel
・Train with DA-GRPO: score step-level decisions, not just final answers, using a multi-dimensional rubric that weights decision quality at 0.6
📊 Results (their OmniGAIA benchmark)
・Orchestra-o1-GPT-5 hits 72.8%, beating the second-best Gemini-3-Pro by 10.3 points
・The open-source Orchestra-o1-8B (from Qwen3-8B) reaches 30.0%, best among open omnimodal agents
・It reaches 72.8% at cost 341.6, cheaper and far more accurate than weaker baselines
・By difficulty: Easy 80.3%, Medium 75.0%, Hard 56.4%
#
AIAgents# #
Multimodal#