Thinking Machines(@thinkymachines ):Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost.

Thinking Machines

@thinkymachines

Thinking, beeping, and booping. @tinkerapi

Joined February 2025

1 Following 149.1K Followers

Thinking Machines@thinkymachines

2025.10.27 17:05

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost.