a new paper from Anthropic Fellows Program!
"Model Spec Midtraining: Improving How Alignment Training Generalizes"
A lot of alignment training teaches models what to say, but not why those behaviors are right.
So before normal alignment fine-tuning, this research trains the model on synthetic documents that discuss its Model Spec, including its values, rules, and reasoning, then do the usual supervised alignment.
On agentic misalignment evals, MSM + AFT cuts misalignment from 68% -> 5% on Qwen2.5-32B and 54% -> 7% on Qwen3-32B, beating the baseline.
The gain shows up especially out of distribution, where normal “say the aligned thing” training can look good in QA but break in harder scenarios.
顯示更多