Zhancun Mu

Undergraduate student

ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting


Journal article


Shaofei Cai, Zihao Wang, Kewei Lian, Zhancun Mu, Xiaojian Ma, Anji Liu, Yitao Liang,

arXiv Project
Cite

Cite

APA   Click to copy
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting.


Chicago/Turabian   Click to copy
“ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting” (n.d.).


MLA   Click to copy
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting.


BibTeX   Click to copy

@article{shaofei-a,
  title = {ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting},
  author = {}
}

Abstract

Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. A key issue is the difficulty in smoothly connecting individual entities in low-level observations with abstract concepts required for planning. A common approach to address this problem is through the use of hierarchical agents, where VLMs serve as high-level reasoners that break down tasks into executable sub-tasks, typically specified using language and imagined observations. However, language often fails to effectively convey spatial information, while generating future images with sufficient accuracy remains challenging. To address these limitations, we propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from both past and present observations to guide policy-environment interactions. Using this approach, we train ROCKET-1, a low-level policy that predicts actions based on concatenated visual observations and segmentation masks, with real-time object tracking provided by SAM-2. Our method unlocks the full potential of VLMs’ visual-language reasoning abilities, enabling them to solve complex creative tasks, especially those heavily reliant on spatial understanding. Experiments in Minecraft demonstrate that our approach allows agents to accomplish previously unattainable tasks, highlighting the effectiveness of visual-temporal context prompting in embodied decision-making.


Share



Follow this website


You need to create an Owlstown account to follow this website.


Sign up

Already an Owlstown member?

Log in