註冊並分享邀請連結,可獲得影片播放與邀請獎勵。

alphaXiv
@askalphaxiv
High fidelity research
加入 November 2023
49 正在關注    43.1K 粉絲
a new paper from Anthropic Fellows Program! "Model Spec Midtraining: Improving How Alignment Training Generalizes" A lot of alignment training teaches models what to say, but not why those behaviors are right. So before normal alignment fine-tuning, this research trains the model on synthetic documents that discuss its Model Spec, including its values, rules, and reasoning, then do the usual supervised alignment. On agentic misalignment evals, MSM + AFT cuts misalignment from 68% -> 5% on Qwen2.5-32B and 54% -> 7% on Qwen3-32B, beating the baseline. The gain shows up especially out of distribution, where normal “say the aligned thing” training can look good in QA but break in harder scenarios.
顯示更多
0
6
130
22
轉發到社區