Few-shot Object Reasoning for Robot Instruction Following

Representation learning models are notoriously opaque and difficult to extend. This particularly limits embodied agents following instructions in continuously-changing environments. This talk focuses on the problem of extending an instruction-following robot's reasoning to new objects, including both their observations and references in text. We define the problem of few-shot language-conditioned object segmentation, and propose an approach that explicitly aligns object references to the space they occupy in the world. We train this method using large-scale automatically generated augmented reality data. We use the segmentation and instruction information to formalize and construct a world map that captures desired behavior around objects, but abstracts over specific object information. We show how integrating this map into a control policy for natural language instruction following with a quadcopter drone allows following instructions with previously unseen objects without any additional training.