Was speaking to a few researchers and they were talking about how much harder voice to voice is to train on the frontier compared to other modalities. There’s no test time compute, which makes distillation really difficult.
These speech-to-speech models are getting insanely good, but the pricing still makes most real-world apps unrealistic
I have so many ideas for this tech, but at current costs I’d need to charge ~$200/user just to make it work