登録して招待リンクを共有すると、動画再生報酬と紹介報酬を獲得できます。

alphaXiv
@askalphaxiv
High fidelity research
参加 November 2023
49 フォロー中    43.1K ファン
a new paper from Anthropic Fellows Program! "Model Spec Midtraining: Improving How Alignment Training Generalizes" A lot of alignment training teaches models what to say, but not why those behaviors are right. So before normal alignment fine-tuning, this research trains the model on synthetic documents that discuss its Model Spec, including its values, rules, and reasoning, then do the usual supervised alignment. On agentic misalignment evals, MSM + AFT cuts misalignment from 68% -> 5% on Qwen2.5-32B and 54% -> 7% on Qwen3-32B, beating the baseline. The gain shows up especially out of distribution, where normal “say the aligned thing” training can look good in QA but break in harder scenarios.
もっと見る