Do more capable AI models produce better drug candidates?
Most teams in AI drug discovery assume they do & most effort goes into better architectures, more training data, and higher benchmark scores.
But in practice, the same pipeline can produce completely different outcomes depending on the biological target.
When one of the strongest peptide design pipelines available was benchmarked across multiple targets, hit rates ranged from 0% to 67% using the same underlying system.
The key finding was that the computational score used to rank designs was not a reliable predictor of experimental binding affinity. Separate analysis across more than 1,400 peptide inputs confirmed the same result, structure prediction confidence metrics showed negligible correlation with experimental outcomes.
The implication is important, a pipeline’s usefulness depends less on raw model capability and more on whether it was ever validated against biology where the answer is already known.
Confidence scores can be a decent binary signal (binds vs. does not bind), but they are often poor predictors of actual affinity.
Yet many autonomous discovery pipelines evaluate novel candidates without first confirming they can reliably separate known binders from known non-binders on that target class.
At
@peptai_, every novel candidate first passes through a calibration stage.
Known binders and known non-binders from public datasets are run through the full computational pipeline, and the resulting score distributions become the baseline for interpreting new designs.
If a platform cannot recover signal on biology we already understand, there is no basis for trusting what it says about novel sequences.