Scaling up RL training with more data often encounters the performance saturation, which wastes compute.
We find that a precisely crafted entropy curve is all you need to avoid performance saturation, and we achieve this purely by rejection sampling.