가입 후 초대 링크를 공유하면 동영상 재생 및 초대 보상을 받을 수 있습니다.

alphaXiv
@askalphaxiv
High fidelity research
가입 November 2023
49 팔로잉 중    43.1K
a new paper from Anthropic Fellows Program! "Model Spec Midtraining: Improving How Alignment Training Generalizes" A lot of alignment training teaches models what to say, but not why those behaviors are right. So before normal alignment fine-tuning, this research trains the model on synthetic documents that discuss its Model Spec, including its values, rules, and reasoning, then do the usual supervised alignment. On agentic misalignment evals, MSM + AFT cuts misalignment from 68% -> 5% on Qwen2.5-32B and 54% -> 7% on Qwen3-32B, beating the baseline. The gain shows up especially out of distribution, where normal “say the aligned thing” training can look good in QA but break in harder scenarios.
더 보기