cv usk(@cv_usk):🎯 The fixed clipping in PPO, long taken for granted in LLM reinforcement learning, may have been quietly crushing exploration diversity. A new method resolves that weakness on solid theoretical ground. Title: BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning URL: https://t.co/mL1saycIxd 🔍 Overview BandPO replaces PPO's ratio clipping with a unified operator called Band. It projects a trust region defined by f-divergences into dynamic, probability-aware clipping intervals, so the bounds adapt according to each action's probability rather than staying fixed. ❓ Challenges Solved PPO's fixed clipping bounds carry a structural weakness. ・They overly constrain the upward update margin of low-probability actions (tokens) ・Advantageous tail strategies that deserve to be reinforced get suppressed ・Exploration shrinks, leading to entropy collapse where the policy becomes deterministic too early A single fixed bound applied uniformly was breaking the explore-exploit balance. 💡 Methodology & Proposed Approach BandPO formulates the mapping from trust region to clipping interval as a convex optimization problem, guaranteeing globally optimal solutions. ・For specific divergences it derives closed-form solutions, keeping it computationally tractable ・It relaxes the constraint for low-probability, high-advantage actions so they can update properly The novelty is bridging two lineages, PPO's ratio clipping and TRPO-style trust regions, through probability-aware bounds. 🎯 Use Cases It fits LLM RL broadly, including RLHF and RLVR, wherever you want training stability while preserving exploration diversity. It is a practical drop-in replacement for existing PPO pipelines plagued by entropy collapse. 📊 Experimental Results Across diverse models and datasets, BandPO consistently outperforms canonical clipping and Clip-Higher. It also robustly mitigates entropy collapse, maintaining policy diversity throughout training. The code is released at OpenMOSS/BandPO. #ReinforcementLearning #LLM

2026.06.16 22:32

🎯 The fixed clipping in PPO, long taken for granted in LLM reinforcement learning, may have been quietly crushing exploration diversity. A new method resolves that weakness on solid theoretical ground. Title: BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning URL: 🔍 Overview BandPO replaces PPO's ratio clipping with a unified operator called Band. It projects a trust region defined by f-divergences into dynamic, probability-aware clipping intervals, so the bounds adapt according to each action's probability rather than staying fixed. ❓ Challenges Solved PPO's fixed clipping bounds carry a structural weakness. ・They overly constrain the upward update margin of low-probability actions (tokens) ・Advantageous tail strategies that deserve to be reinforced get suppressed ・Exploration shrinks, leading to entropy collapse where the policy becomes deterministic too early A single fixed bound applied uniformly was breaking the explore-exploit balance. 💡 Methodology & Proposed Approach BandPO formulates the mapping from trust region to clipping interval as a convex optimization problem, guaranteeing globally optimal solutions. ・For specific divergences it derives closed-form solutions, keeping it computationally tractable ・It relaxes the constraint for low-probability, high-advantage actions so they can update properly The novelty is bridging two lineages, PPO's ratio clipping and TRPO-style trust regions, through probability-aware bounds. 🎯 Use Cases It fits LLM RL broadly, including RLHF and RLVR, wherever you want training stability while preserving exploration diversity. It is a practical drop-in replacement for existing PPO pipelines plagued by entropy collapse. 📊 Experimental Results Across diverse models and datasets, BandPO consistently outperforms canonical clipping and Clip-Higher. It also robustly mitigates entropy collapse, maintaining policy diversity throughout training. The code is released at OpenMOSS/BandPO. #ReinforcementLearning# #LLM#

显示更多