alphaXiv(@askalphaxiv):a new paper from Anthropic Fellows Program! "Model Spec Midtraining: Improving How Alignment Training Generalizes" A lot of alignment training teaches models what to say, but not why those behaviors are right. So before normal alignment fine-tuning, this research trains the model on synthetic documents that discuss its Model Spec, including its values, rules, and reasoning, then do the usual supervised alignment. On agentic misalignment evals, MSM + AFT cuts misalignment from 68% -> 5% on Qwen2.5-32B and 54% -> 7% on Qwen3-32B, beating the baseline. The gain shows up especially out of distribution, where normal “say the aligned thing” training can look good in QA but break in harder scenarios.

2026.05.07 06:03

a new paper from Anthropic Fellows Program! "Model Spec Midtraining: Improving How Alignment Training Generalizes" A lot of alignment training teaches models what to say, but not why those behaviors are right. So before normal alignment fine-tuning, this research trains the model on synthetic documents that discuss its Model Spec, including its values, rules, and reasoning, then do the usual supervised alignment. On agentic misalignment evals, MSM + AFT cuts misalignment from 68% -> 5% on Qwen2.5-32B and 54% -> 7% on Qwen3-32B, beating the baseline. The gain shows up especially out of distribution, where normal “say the aligned thing” training can look good in QA but break in harder scenarios.