Recently, Large Language Models (LLMs) have achieved remarkable success using in-context learning (ICL) in the language domain. However, leveraging the ICL capabilities within off-the-shelf LLMs to directly predict robot actions remains largely unexplored. In this paper, we introduce RoboPrompt, a framework that enables off-the-shelf text-only LLMs to directly predict robot actions through ICL without training. Our approach first heuristically identifies keyframes that capture important moments from an episode. Next, we extract end-effector actions from these keyframes as well as the estimated initial object poses, and both are converted into textual descriptions. Finally, we construct a structured template to form ICL demonstrations from these textual descriptions and a task instruction. This enables an LLM to directly predict robot actions at test time. Through extensive experiments and analysis, RoboPrompt shows stronger performance over zero-shot and ICL baselines in simulated and real-world settings.
We introduce RoboPrompt, a framework that enables an off-the-shelf text-only LLM to directly predict robot actions through in-context learning (ICL) examples without any additional training. Our method first identifies keyframes where critical robot actions occur. We next estimate initial object poses and extract robot actions from keyframes, and both of them are converted into textual descriptions. Using this textual information along with the given instruction, we construct a structured prompt as ICL demonstrations, enabling the LLM to predict robot actions directly for an unseen test sample.
@article{yin2024incontextlearningenablesrobot,
title={In-Context Learning Enables Robot Action Prediction in LLMs},
author={Yida Yin and Zekai Wang and Yuvan Sharma and Dantong Niu and Trevor Darrell and Roei Herzig},
year={2024},
booktitle={arXiv preprint arXiv:2406.11815},
}