All roboticists - in case you are building VLAs, please stop what you are doing, take a break, get a coffee, and read the LeWorldModel paper by
@lucasmaes_,
@randall_balestr,
@ylecun and collaborators: And then read it several more times - the same general approach can be directly mapped to other key problems in robotics, including dealing with multimodal inputs such as vision and speech