cv usk(@cv_usk):🧮 Are you just letting your MoE router train on vibes? This paper proposes a mathematically grounded design principle: align router rows with the principal singular direction of their expert matrices. Title: Redesign Mixture-of-Experts Routers with Manifold Power Iteration URL: https://t.co/G0maku31hs 📝 Overview MoE efficiently activates only a subset of experts per input, and the router decides which experts to use. This paper argues that aligning each router row with the principal singular direction of its expert matrix better represents token-expert affinity. ❓ Challenges Solved Each router row acts as an "expert proxy" computing similarity, but there was no principled guideline for how to design that proxy vector. There was no clear principle for condensing expert information into a representative vector. 💡 Methodology & Proposed Approach ・The proposed Manifold Power Iteration (MPI) adopts a "Power-then-Retract" paradigm ・It runs power iteration on the router weights to converge toward the principal singular direction ・A retraction operation imposes norm constraints, balancing computational efficiency and training stability ・It also provides a theoretical proof that router rows converge to the principal singular directions 🎯 Use Cases It gives the routing design of large MoE LLMs a principled guideline rather than heuristics, useful when you want to improve expert utilization, such as reducing skew toward particular experts. 📊 Experimental Results ・The authors pretrained MoE models across scales from 1B to 11B parameters and verified that alignment improves effectiveness ・Aligning to the principal singular direction makes expert-activation decisions more effective As MoE becomes a standard component of large LLMs, this is a foundational contribution answering why routing should be designed a certain way. #MoE #LLM

2026.06.12 10:59

🧮 Are you just letting your MoE router train on vibes? This paper proposes a mathematically grounded design principle: align router rows with the principal singular direction of their expert matrices. Title: Redesign Mixture-of-Experts Routers with Manifold Power Iteration URL: 📝 Overview MoE efficiently activates only a subset of experts per input, and the router decides which experts to use. This paper argues that aligning each router row with the principal singular direction of its expert matrix better represents token-expert affinity. ❓ Challenges Solved Each router row acts as an "expert proxy" computing similarity, but there was no principled guideline for how to design that proxy vector. There was no clear principle for condensing expert information into a representative vector. 💡 Methodology & Proposed Approach ・The proposed Manifold Power Iteration (MPI) adopts a "Power-then-Retract" paradigm ・It runs power iteration on the router weights to converge toward the principal singular direction ・A retraction operation imposes norm constraints, balancing computational efficiency and training stability ・It also provides a theoretical proof that router rows converge to the principal singular directions 🎯 Use Cases It gives the routing design of large MoE LLMs a principled guideline rather than heuristics, useful when you want to improve expert utilization, such as reducing skew toward particular experts. 📊 Experimental Results ・The authors pretrained MoE models across scales from 1B to 11B parameters and verified that alignment improves effectiveness ・Aligning to the principal singular direction makes expert-activation decisions more effective As MoE becomes a standard component of large LLMs, this is a foundational contribution answering why routing should be designed a certain way. #MoE# #LLM#