Language-vision-action models that add action to the equation.
Trained by fine-tuning vision-language models with both Internet-scale visual-language tasks and robotic trajectory data. By expressing the robot actions as text tokens and incorporating them into the training set together with natural language tokens, these models can learn to output robot actions like LLMs output text