Interested in NLP and vision for virtual robotic agent tasks? Here’s a talk on my “no-vision vision” Findings of EMNLP 2020 paper, “Visually-Grounded Planning Without Vision: Language Models Infer Detailed Plans from High-Level Instructions”. Recorded for the Spatial Language Understanding (SpLU 2020) workshop at EMNLP 2020.
TL;DR: Language models can successfully reconstruct 26% of long visual semantic plans in the ALFRED virtual robotic agent task using only text — i.e. without visual input. If you allow the language model to know where the virtual robotic agent should start from in the environment, this performance increases to 56%! But, for the other 44% of cases that are not correct, the robot agent sometimes microwaves forks.
Paper, code, data, output, and analyses are available at this ALFRED-GPT-2 Github Repository (cognitiveailab).