ð¯ LLMã®åŒ·ååŠç¿ã§åœããåã«äœ¿ãããŠããPPOã®ãåºå®ã¯ãªããã³ã°ããå®ã¯æ¢çŽ¢ã®å€æ§æ§ãå¯ãã«æœ°ããŠãããããããŸããããã®åŒ±ç¹ãçè«çã«è§£æ¶ããæ°ææ³ãç»å ŽããŸããã
ã¿ã€ãã«: BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning
URL:
ð æŠèŠ
BandPOã¯ãPPOã®æ¯çã¯ãªããã³ã°ããBandããšããçµ±äžãªãã¬ãŒã¿ã«çœ®ãæããææ³ã§ããf-ãã€ããŒãžã§ã³ã¹ã§å®çŸ©ãããä¿¡é Œé åãã確çãèæ
®ããåçãªã¯ãªããã³ã°åºéãžãšå°åœ±ããããšã§ãè¡åã®ç¢ºçã«å¿ããŠå¢çãé©å¿çã«å€åããŸãã
â 解決ãã課é¡
PPOã®åºå®ã¯ãªããã³ã°å¢çã«ã¯æ§é çãªåŒ±ç¹ããããŸããã
ã»äœç¢ºçã®è¡åïŒããŒã¯ã³ïŒã®äžæ¹åãžã®æŽæ°å¹
ãé床ã«å¶çŽããŠããŸã
ã»æ¬æ¥ã¯é«ãã¢ããã³ããŒãžãæã€ãããŒã«æŠç¥ããæŒãæœ°ããã
ã»æ¢çŽ¢ãç©ã现ããæ¹çãæ©æã«æ±ºå®è«åãããšã³ããããŒåŽ©å£ãæã
äžåŸã®åºå®å¢çããæ¢çŽ¢ãšæŽ»çšã®ãã©ã³ã¹ã厩ããŠããã®ã§ãã
ð¡ æ¹æ³è«ãšææ¡ææ³
BandPOã¯ãä¿¡é Œé åããã¯ãªããã³ã°åºéãžã®ååãåžæé©ååé¡ãšããŠå®åŒåãã倧åçã«æé©ãªè§£ãåŸãããããšãä¿èšŒããŸãã
ã»ç¹å®ã®ãã€ããŒãžã§ã³ã¹ã§ã¯é圢åŒè§£ãå°åºããèšç®äžãæ±ãããã
ã»äœç¢ºçãã€é«ã¢ããã³ããŒãžãªè¡åã«ã¯å¶çŽãç·©ããé©åã«æŽæ°ã§ããããã«ãã
PPOã®æ¯çã¯ãªããã³ã°ãšTRPOç³»ã®ä¿¡é Œé åãšãã2ã€ã®ç³»èãã確çãèæ
®ããå¢çã§æ©æž¡ãããŠããç¹ãæ°ãããšããã§ãã
ð¯ ãŠãŒã¹ã±ãŒã¹
RLHFãRLVRãšãã£ãLLMã®åŒ·ååŠç¿å
šè¬ã§ãåŠç¿ã®å®å®æ§ãä¿ã¡ã€ã€æ¢çŽ¢ã®å€æ§æ§ãç¶æãããå Žé¢ã«æå¹ã§ãããšã³ããããŒåŽ©å£ã«æ©ãŸãããŠããæ¢åã®PPOãã€ãã©ã€ã³ã®çœ®ãæãå
ãšããŠå®çšçã§ãã
ð å®éšçµæ
倿§ãªã¢ãã«ãšããŒã¿ã»ããã«ããããæšæºçãªã¯ãªããã³ã°ããã³Clip-Higherãäžè²«ããŠäžåãæ§èœã瀺ããŸãããããã«ããšã³ããããŒåŽ©å£ãé å¥ã«ç·©åããåŠç¿ãéããŠæ¹çã®å€æ§æ§ãç¶æã§ããããšã確èªãããŠããŸããã³ãŒãã¯OpenMOSS/BandPOã§å
¬éãããŠããŸãã
#
匷ååŠç¿# #
LLM#