Also available on Google Scholar.

DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents
Jansen, Côté, Khot, Bransom, Dalvi, Majumder, Tafjord, Clark — Arxiv
[project website] [code on github]

Automated scientific discovery promises to accelerate progress across scientific domains. However, developing and evaluating an AI agent’s capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent’s ability to perform complete cycles of novel scientific discovery. DISCOVERYWORLD contains a variety of different challenges, covering topics as diverse as radioisotope dating, rocket science, and proteomics, to encourage development of general discovery skills rather than task-specific solutions. DISCOVERYWORLD itself is an inexpensive, simulated, text-based environment (with optional 2D visual overlay). It includes 120 different challenge tasks, spanning eight topics each with three levels of difficulty and several parametric variations. Each task requires an agent to form hypotheses, design and run experiments, analyze results, and act on conclusions. DISCOVERYWORLD further provides three automatic metrics for evaluating performance, based on (a) task completion, (b) task-relevant actions taken, and (c) the discovered explanatory knowledge. We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks, suggesting that DISCOVERYWORLD captures some of the novel challenges of discovery, and thus that DISCOVERYWORLD may help accelerate near-term development and assessment of scientific discovery competency in agents.

Can Language Models Serve as Text-Based World Simulators?
Wang, Todd, Xiao, Yuan, Côté, Clark, Jansen — ACL 2024
[benchmark and code on github]

Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called ByteSized32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM’s capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear.

PDDLEGO: Iterative Planning in Textual Environments
Zhang, Jansen, Zhang, Clark, Callison-Burch, Tandon — *SEM 2024
[code on github]

Planning in textual environments have been shown to be a long-standing challenge even for current models. A recent, promising line of work uses LLMs to generate a formal representation of the environment that can be solved by a symbolic planner. However, existing methods rely on a fully-observed environment where all entity states are initially known, so a one-off representation can be constructed, leading to a complete plan. In contrast, we tackle partially-observed environments where there is initially no sufficient information to plan for the end-goal. We propose PDDLEGO that iteratively construct a planning representation that can lead to a partial plan for a given sub-goal. By accomplishing the sub-goal, more information is acquired to augment the representation, eventually achieving the end-goal. We show that plans produced by few-shot PDDLEGO are 43% more efficient than generating plans end-to-end on the Coin Collector simulation, with strong performance (98%) on the more complex Cooking World simulation where end-to-end LLMs fail to generate coherent plans (4%).

CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization
Majumder, Dalvi, Jansen, Tafjord, Tandon, Zhang, Callison-Burch, Clark — Arxiv
[project website] [code on github]

Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs of reinforcement learning. However, despite their zero-shot capabilities, these agents to date do not continually improve over time beyond performance refinement on a specific task. Here we present CLIN, the first language-based agent to achieve this, so that it continually improves over multiple trials, including when both the environment and task are varied, and without requiring parameter updates. Our approach is to use a persistent, dynamic, textual memory centered on causal abstractions (rather than general “helpful hints”) that is regularly updated after each trial so that the agent gradually learns useful knowledge for new trials. In the ScienceWorld benchmark, CLIN is able to continually improve on repeated trials on the same task and environment, outperforming state-of-the-art reflective language agents like Reflexion by 23 absolute points. CLIN can also transfer its learning to new environments (or new tasks), improving its zero-shot performance by 4 points (13 for new tasks) and can further improve performance there through continual memory updates, enhancing performance by an additional 17 points (7 for new tasks). This suggests a new architecture for agents built on frozen models that can still continually and rapidly improve over time.

Self-Supervised Behavior Cloned Transformers are Path Crawlers for Text Games
Wang, Jansen — EMNLP 2023 (Findings)
[code on github]

In this work, we introduce a self-supervised behavior cloning transformer for text games, which are challenging benchmarks for multi-step reasoning in virtual environments. Traditionally, Behavior Cloning Transformers excel in such tasks but rely on supervised training data. Our approach auto-generates training data by exploring trajectories (defined by common macro-action sequences) that lead to reward within the games, while determining the generality and utility of these trajectories by rapidly training small models then evaluating their performance on unseen development games. Through empirical analysis, we show our method consistently uncovers generalizable training data, achieving about 90\% performance of supervised systems across three benchmark text games.

ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games
Wang, Todd, Yuan, Xiao, Côté, Jansen — EMNLP 2023
[corpus and code on github]

In this work we examine the ability of language models to generate explicit world models of scientific and common-sense reasoning tasks by framing this as a problem of generating text-based games. To support this, we introduce ByteSized32, a corpus of 32 highly-templated text games written in Python totaling 24k lines of code, each centered around a particular task, and paired with a set of 16 unseen text game specifications for evaluation. We propose a suite of automatic and manual metrics for assessing simulation validity, compliance with task specifications, playability, winnability, and alignment with the physical world. In a single-shot evaluation of GPT-4 on this simulation-as-code-generation task, we find it capable of producing runnable games in 27% of cases, highlighting the difficulty of this challenge task. We discuss areas of future improvement, including GPT-4’s apparent capacity to perform well at simulating near canonical task solutions, with performance dropping off as simulations include distractors or deviate from canonical solutions in the action space.

From Words to Wires: Generating Functioning Electronic Devices from Natural Language Descriptions
Jansen — EMNLP 2023 (Findings)
[code on github] [demo video (2m)]

In this work, we show that contemporary language models have a previously unknown skill — the capacity for electronic circuit design from high-level textual descriptions, akin to code generation. We introduce two benchmarks: Pins100, assessing model knowledge of electrical components, and Micro25, evaluating a model’s capability to design common microcontroller circuits and code in the Arduino ecosystem that involve input, output, sensors, motors, protocols, and logic — with models such as GPT-4 and Claude-V1 achieving between 60% to 96% Pass@1 on generating full devices. We include six case studies of using language models as a design assistant for moderately complex devices, such as a radiation-powered random number generator, an emoji keyboard, a visible spectrometer, and several assistive devices, while offering a qualitative analysis performance, outlining evaluation challenges, and suggesting areas of development to improve complex circuit design and practical utility. With this work, we aim to spur research at the juncture of natural language processing and electronic design.

Behavior Cloned Transformers are Neurosymbolic Reasoners
Wang, Jansen, Côté, Ammanabrolu — EACL 2023
[code on github]

In this work, we explore techniques for augmenting interactive agents with information from symbolic modules, much like humans use tools like calculators and GPS systems to assist with arithmetic and navigation. We test our agent’s abilities in text games — challenging benchmarks for evaluating the multi-step reasoning abilities of game agents in grounded, language-based environments. Our experimental study indicates that injecting the actions from these symbolic modules into the action space of a behavior cloned transformer agent increases performance on four text game benchmarks that test arithmetic, navigation, sorting, and common sense reasoning by an average of 22%, allowing an agent to reach the highest possible performance on unseen games. This action injection technique is easily extended to new agents, environments, and symbolic modules.

TextWorldExpress: Simulating Text Games at One Million Steps Per Second
Jansen, Côté — EACL 2023
[code on github] [system demo video (18m)]

Text-based games offer a challenging test bed to evaluate virtual agents at language understanding, multi-step problem-solving, and common-sense reasoning. However, speed is a major limitation of current text-based games, capping at 300 steps per second, mainly due to the use of legacy tooling. In this work we present TextWorldExpress, a high-performance implementation of three common text game benchmarks that increases simulation throughput by approximately three orders of magnitude, reaching over one million steps per second on common desktop hardware. This significantly reduces experiment runtime, enabling billion-step-scale experiments in about one day.

ScienceWorld: Is your Agent Smarter than a 5th Grader?
Wang, Jansen, Côté, Ammanabrolu — EMNLP 2022
[ai2 website] [code on github] [talk (30m)]

This paper presents a new benchmark, ScienceWorld, to test agents’ scientific reasoning abilities in a new interactive text environment at the level of a standard elementary school science curriculum. Despite the recent transformer-based progress seen in adjacent fields such as question-answering, scientific text processing, and the wider area of natural language processing, we find that current state-of-the-art models are unable to reason about or explain learned science concepts in novel contexts. For instance, models can easily answer what the conductivity of a previously seen material is but struggle when asked how they would conduct an experiment in a grounded, interactive environment to find the conductivity of an unknown material. This begs the question of whether current models are simply retrieving answers by way of seeing a large number of similar input examples or if they have learned to reason about concepts in a reusable manner. We hypothesize that agents need to be grounded in interactive environments to achieve such reasoning capabilities. Our experiments provide empirical evidence supporting this hypothesis — showing that a 1.5 million parameter agent trained interactively for 100k steps outperforms a 11 billion parameter model statically trained for scientific question-answering and reasoning via millions of expert demonstrations.

Picard understanding Darmok: A Dataset and Model for Metaphor-Rich Translation in a Constructed Language
Jansen, Boyd-Graber — FigLang 2022
[code and data github] [talk (8m)]

Tamarian, a fictional language introduced in the Star Trek episode Darmok, communicates meaning through utterances of metaphorical references, such as “Darmok and Jalad at Tanagra” instead of “We should work together.” This work assembles a Tamarian-English dictionary of utterances from the original episode and several follow-on novels, and uses this to construct a parallel corpus of 456 English-Tamarian utterances. A machine translation system based on a large language model (T5) is trained using this parallel corpus, and is shown to produce an accuracy of 76% when translating from English to Tamarian on known utterances.

A Systematic Survey of Text Worlds as Embodied Natural Language Environments
Jansen — WordPlay 2022

Text Worlds are virtual environments for embodied agents that, unlike 2D or 3D environments, are rendered exclusively using textual descriptions. These environments offer an alternative to higher-fidelity 3D environments due to their low barrier to entry, providing the ability to study semantics, compositional inference, and other high-level tasks with rich high-level action spaces while controlling for perceptual input. This systematic survey outlines recent developments in tooling, environments, and agent modeling for Text Worlds, while examining recent trends in knowledge graphs, common sense reasoning, transfer learning of Text World performance to higher-fidelity environments, as well as near-term development targets that, once achieved, make Text Worlds an attractive general research paradigm for natural language processing.

Extracting Space Situational Awareness Events from News Text
Xie, Kwak, George, Dozal, Van, Jah, Furfaro, Jansen — LREC 2022
[data] [talk (10m)]

Space situational awareness typically makes use of physical measurements from radar, telescopes, and other assets to monitor satellites and other spacecraft for operational, navigational, and defense purposes. In this work we explore using textual input for the space situational awareness task. We construct a corpus of 48.5k news articles spanning all known active satellites between 2009 and 2020. Using a dependency-rule-based extraction system designed to target three high-impact events — spacecraft launches, failures, and decommissionings, we identify 1,787 space-event sentences that are then annotated by humans with 15.9k labels for event slots. We empirically demonstrate a state-of-the-art neural extraction system achieves an overall F1 between 53 and 91 per slot for event extraction in this low-resource, high-impact domain.

Explaining Answers with Entailment Trees
Dalvi*, Jansen*, Tafjord, Xie, Smith, Pipatanangkura, Clark — EMNLP 2021
[data] [EntailmentBank book — desk reference for explanation trees] [talk (12m)]

Our goal, in the context of open-domain textual question-answering (QA), is to explain answers by not just listing supporting textual evidence (“rationales”), but also showing how such evidence leads to the answer in a systematic way. If this could be done, new opportunities for understanding and debugging the system’s reasoning would become possible. Our approach is to generate explanations in the form of entailment trees, namely a tree of entailment steps from facts that are known, through intermediate conclusions, to the final answer. To train a model with this skill, we created ENTAILMENTBANK, the first dataset to contain multistep entailment trees. At each node in the tree (typically) two or more facts compose together to produce a new conclusion. Given a hypothesis (question + answer), we define three increasingly difficult explanation tasks: generate a valid entailment tree given (a) all relevant sentences (the leaves of the gold entailment tree), (b) all relevant and some irrelevant sentences, or (c) a corpus. We show that a strong language model only partially solves these tasks, and identify several new directions to improve performance. This work is significant as it provides a new type of dataset (multistep entailments) and baselines, offering a new avenue for the community to generate richer, more systematic explanations.

On the Challenges of Evaluating Compositional Explanations in Multi-Hop Inference: Relevance, Completeness, and Expert Ratings
Jansen, Smith, Moreno, Ortiz — EMNLP 2021
[data] [talk (16m)]

Building compositional explanations requires models to combine two or more facts that, together, describe why the answer to a question is correct. Typically, these “multi-hop” explanations are evaluated relative to one (or a small number of) gold explanations. In this work, we show these evaluations substantially underestimate model performance, both in terms of the relevance of included facts, as well as the completeness of model-generated explanations, because models regularly discover and produce valid explanations that are different than gold explanations. To address this, we construct a large corpus of 126k domain-expert (science teacher) relevance ratings that augment a corpus of explanations to standardized science exam questions, discovering 80k additional relevant facts not rated as gold. We build three strong models based on different methodologies (generation, ranking, and schemas), and empirically show that while expert-augmented ratings provide better estimates of explanation quality, both original (gold) and expert-augmented automatic evaluations still substantially underestimate performance by up to 36% when compared with full manual expert judgements, with different models being disproportionately affected. This poses a significant methodological challenge to accurately evaluating explanations produced by compositional reasoning models.

TextGraphs 2021 Shared Task on Multi-Hop Inference for Explanation Regeneration
Thayaparan, Valentino, Jansen and Ustalov — Textgraphs 2021
[codalab competition] [participant kit]

The Shared Task on Multi-Hop Inference for Explanation Regeneration asks participants to compose large multi-hop explanations to questions by assembling large chains of facts from a supporting knowledge base. While previous editions of this shared task aimed to evaluate explanatory completeness – finding a set of facts that form a complete inference chain, without gaps, to arrive from question to correct answer, this 2021 instantiation concentrates on the subtask of determining relevance in large multi-hop explanations. To this end, this edition of the shared task makes use of a large set of approximately 250k manual explanatory relevancy ratings that augment the 2020 shared task data. In this summary paper, we describe the details of the explanation regeneration task, the evaluation data, and the participating systems. Additionally, we perform a detailed analysis of participating systems, evaluating various aspects involved in the multi-hop inference process. The best performing system achieved an NDCG of 0.82 on this challenging task, substantially increasing performance over baseline methods by 32%, while also leaving significant room for future improvement.

TextGraphs 2020 Shared Task on Multi-Hop Inference for Explanation Regeneration
Jansen and Ustalov — Textgraphs 2020
[slides] [talk (short 15m)] [talk (long 30m)]

The 2020 Shared Task on Multi-Hop Inference for Explanation Regeneration tasks participants with regenerating large detailed multi-fact explanations for standardized science exam questions. Given a question, correct answer, and knowledge base, models must rank each fact in the knowledge base such that facts most likely to appear in the explanation are ranked highest. Explanations consist of an average of 6 (and as many as 16) facts that span both core scientific knowledge and world knowledge, and form an explicit lexically-connected “explanation graph” describing how the facts interrelate. In this second iteration of the explanation regeneration shared task, participants are supplied with more than double the training and evaluation data of the first shared task, as well as a knowledge base nearly double in size, both of which expand into more challenging scientific topics that increase the difficulty of the task. In total 10 teams participated, and 5 teams submitted system description papers. The best-performing teams significantly increased state-of-the-art performance both in terms of ranking (mean average precision) and inference speed on this challenge task.

COSATA: A Constraint Satisfaction Solver and Interpreted Language for Semi-Structured Tables of Sentences
Jansen — EMNLP 2020
[Code and Tool] [talk and demo video (youtube)]

This work presents CoSaTa, an intuitive constraint satisfaction solver and interpreted language for knowledge bases of semi-structured tables expressed as text. The stand-alone CoSaTa solver allows easily expressing complex compositional “inference patterns” for how knowledge from different tables tends to connect to support inference and explanation construction in question answering and other downstream tasks, while including advanced declarative features and the ability to operate over multiple representations of text (words, lemmas, or part-of-speech tags). CoSaTa also includes a hybrid imperative/declarative interpreted language for expressing simple models through minimally-specified simulations grounded in constraint patterns, helping bridge the gap between question answering, question explanation, and model simulation. The solver and interpreter are released as open source.

Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions
Jansen — Findings of EMNLP 2020
[data and code] [slides] [talk (youtube)]

The recently proposed ALFRED challenge task aims for a virtual robotic agent to complete complex multi-step everyday tasks in a virtual home environment from high-level natural language directives, such as “put a hot piece of bread on a plate”. Currently, the best-performing models are able to complete less than 1% of these tasks successfully. In this work we focus on modeling the translation problem of converting natural language directives into detailed multi-step sequences of actions that accomplish those goals in the virtual environment. We empirically demonstrate that it is possible to generate gold multi-step plans from language directives alone without any visual input in 26% of unseen cases. When a small amount of visual information, the starting location in the virtual environment, is incorporated, our best-performing GPT-2 model successfully generates gold command sequences in 58% of cases, suggesting contextualized language models may provide strong planning modules for grounded virtual agents.

WorldTree V2: A Corpus of Science-Domain Structured Explanations and Inference Patterns supporting Multi-Hop Inference
Xie, Thiem, Martin, Wainwright, Marmorstein, Jansen — LREC 2020
[data and tool] [WorldTree Book — Desk Reference of Explanation Graphs]

Explainable question answering for complex questions often requires combining large numbers of facts to answer a question while providing a human-readable explanation for the answer, a process known as multi-hop inference. Standardized science questions require combining an average of 6 facts, and as many as 16 facts, in order to answer and explain, but most existing datasets for multi-hop reasoning focus on combining only two facts, significantly limiting the ability of multi-hop inference algorithms to learn to generate large inferences. In this work we present the second iteration of the WorldTree project, a corpus of 5,114 standardized science exam questions paired with large detailed multi-fact explanations that combine core scientific knowledge and world knowledge. Each explanation is represented as a lexically-connected “explanation graph” that combines an average of 6 facts drawn from a semi-structured knowledge base of 9,216 facts across 66 tables. We use this explanation corpus to author a set of 344 high-level science domain inference patterns similar to semantic frames supporting multi-hop inference. Together, these resources provide training data and instrumentation for developing many-hop multi-hop inference models for question answering.

ScienceExamCER: A High-Density Fine-Grained Science-Domain Corpus for Common Entity Recognition
Smith, Zhang, Culnan, Jansen — LREC 2020
[data and code]

Named entity recognition identifies common classes of entities in text, but these entity labels are generally sparse, limiting utility to downstream tasks. In this work we present ScienceExamCER, a densely-labeled semantic classification corpus of 133k mentions in the science exam domain where nearly all (96%) of content words have been annotated with one or more fine-grained semantic class labels including taxonomic groups, meronym groups, verb/action groups, properties and values, and synonyms. Semantic class labels are drawn from a manually-constructed fine-grained typology of 601 classes generated through a data-driven analysis of 4,239 science exam questions. We show an off-the-shelf BERT-based named entity recognition model modified for multi-label classification achieves an accuracy of 0.85 F1 on this task, suggesting strong utility for downstream tasks in science domain question answering requiring densely-labeled semantic classification.

Multi-class Hierarchical Question Classification for Multiple Choice Science Exams
Xu, Jansen, Martin, Xie, Yadav, Madabushi, Tafjord, Clark — LREC 2020

Prior work has demonstrated that question classification (QC), recognizing the problem domain of a question, can help answer it more accurately. However, developing strong QC algorithms has been hindered by the limited size and complexity of annotated data available. To address this, we present the largest challenge dataset for QC, containing 7,787 science exam questions paired with detailed classification labels from a fine-grained hierarchical taxonomy of 406 problem domains. We then show that a BERT-based model trained on this dataset achieves a large (+0.12 MAP) gain compared with previous methods, while also achieving state-of-the-art performance on benchmark open-domain and biomedical QC datasets. Finally, we show that using this model’s predictions of question topic significantly improves the accuracy of a question answering system by +1.7% P@1, with substantial future gains possible as QC performance improves.

QASC: A Dataset for Question Answering via Sentence Composition
Khot, Clark, Guerquin, Jansen, Sabharwal — AAAI 2020

Composing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition (QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using common-sense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiple-choice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.

Extracting Common Inference Patterns from Semi-Structured Explanations
Thiem and Jansen — COIN 2019
[data] [visualization]

Complex questions often require combining multiple facts to correctly answer, particularly when generating detailed explanations for why those answers are correct. Combining multiple facts to answer questions is often modeled as a “multi-hop” graph traversal problem, where a given solver must find a series of interconnected facts in a knowledge graph that, taken together, answer the question and explain the reasoning behind that answer. Multi-hop inference currently suffers from semantic drift, or the tendency for chains of reasoning to “drift”‘ to unrelated topics, and this semantic drift greatly limits the number of facts that can be combined in both free text or knowledge base inference. In this work we present our effort to mitigate semantic drift by extracting large high-confidence multi-hop inference patterns, generated by abstracting large-scale explanatory structure from a corpus of detailed explanations. We represent these inference patterns as sets of generalized constraints over sentences represented as rows in a knowledge base of semi-structured tables. We present a prototype tool for identifying common inference patterns from corpora of semi-structured explanations, and use it to successfully extract 67 inference patterns from a “matter” subset of standardized elementary science exam questions that span scientific and world knowledge.

TextGraphs 2019 Shared Task on Multi-Hop Inference for Explanation Regeneration
Jansen and Ustalov— TextGraphs 2019
[shared task participant kit] [slides]

While automated question answering systems are increasingly able to retrieve answers to natural language questions, their ability to generate detailed human-readable explanations for their answers is still quite limited. The Shared Task on Multi-Hop Inference for Explanation Regeneration tasks participants with regenerating detailed gold explanations for standardized elementary science exam questions by selecting facts from a knowledge base of semi-structured tables. Each explanation contains between 1 and 16 interconnected facts that form an “explanation graph” spanning core scientific knowledge and detailed world knowledge. It is expected that successfully combining these facts to generate detailed explanations will require advancing methods in multi-hop inference and information combination, and will make use of the supervised training data provided by the WorldTree explanation corpus. The top-performing system achieved a mean average precision (MAP) of 0.56, substantially advancing the state-of-the-art over a baseline information retrieval model. Detailed extended analyses of all submitted systems showed large relative improvements in accessing the most challenging multi-hop inference problems, while absolute performance remains low, highlighting the difficulty of generating detailed explanations through multi-hop reasoning.

Multi-hop Inference for Sentence-level TextGraphs: How Challenging is Meaningfully Combining Information for Science Question Answering?
Jansen — TextGraphs 2018
[data] [slides]

Question Answering for complex questions is often modeled as a graph construction or traversal task, where a solver must build or traverse a graph of facts that answer and explain a given question. This “multi-hop” inference has been shown to be extremely challenging, with few models able to aggregate more than two facts before being overwhelmed by “semantic drift”, or the tendency for long chains of facts to quickly drift off topic. This is a major barrier to current inference models, as even elementary science questions require an average of 4 to 6 facts to answer and explain. In this work we empirically characterize the difficulty of building or traversing a graph of sentences connected by lexical overlap, by evaluating chance sentence aggregation quality through 9,784 manually-annotated judgments across knowledge graphs built from three free-text corpora (including study guides and Simple Wikipedia). We demonstrate semantic drift tends to be high and aggregation quality low, at between 0.04% and 3%, and highlight scenarios that maximize the likelihood of meaningfully combining information.

WorldTree: A Corpus of Explanation Graphs for Elementary Science Questions supporting Multi-hop Inference
Jansen, Wainwright, Marmorstein, Morrison — LREC 2018
[data, code, and tool] [talk on this project]

Developing methods of automated inference that are able to provide users with compelling human-readable justifications for why the answer to a question is correct is critical for domains such as science and medicine, where user trust and detecting costly errors are limiting factors to adoption. One of the central barriers to training question answering models on explainable inference tasks is the lack of gold explanations to serve as training data. In this paper we present a corpus of explanations for standardized science exams, a recent challenge task for question answering. We manually construct a corpus of detailed explanations for nearly all publicly available standardized elementary science question (approximately 1,680 3rd through 5th grade questions) and represent these as “explanation graphs” – sets of lexically overlapping sentences that describe how to arrive at the correct answer to a question through a combination of domain and world knowledge. We also provide an explanation-centered tablestore, a collection of semi-structured tables that contain the knowledge to construct these elementary science explanations. Together, these two knowledge resources map out a substantial portion of the knowledge required for answering and explaining elementary science exams, and provide both structured and free-text training data for the explainable inference task.

Controlling Information Aggregation for Complex Question Answering
Kwon, Trivedi, Jansen, Surdeanu and Balasubramanian — ECIR 2018

Complex question answering, the task of answering complex natural language questions that rely on inference, requires the aggregation of information from multiple sources. Automatic aggregation often fails because it combines se-mantically unrelated facts leading to bad inferences. This paper proposes methods to address this inference drift problem. In particular, the paper develops unsupervised and supervised mechanisms to control random walks on Open Information Extraction (OIE) knowledge graphs. Empirical evaluation on an elementary science exam benchmark shows that the proposed methods enables effective aggregation even over larger graphs and demonstrates the complementary value of information aggregation for answering complex questions.

A Study of Automatically Acquiring Explanatory Inference Patterns from Corpora of Explanations: Lessons from Elementary Science Exams
Jansen — AKBC 2017
[data] [slides]

Our long term interest is in building inference algorithms capable of answering questions and producing human-readable explanations by aggregating information from multiple sources and knowledge bases. Currently information aggregation (also referred to as “multi-hop inference”) is challenging for more than two facts due to “semantic drift”, or the tendency for natural language inference algorithms to quickly move off-topic when assembling long chains of knowledge. In this paper we explore the possibility of generating large explanations with an average of six facts by automatically extracting common explanatory patterns from a corpus of manually authored elementary science explanations represented as lexically-connected explanation graphs grounded in a semi-structured knowledge base of tables. We empirically demonstrate that there are sufficient common explanatory patterns in this corpus that it is possible in principle to reconstruct unseen explanation graphs by merging multiple explanatory patterns, then adapting and/or adding to their knowledge. This may ultimately provide a mechanism to allow inference algorithms to surpass the two-fact “aggregation horizon” in practice by using common explanatory patterns as constraints to limit the search space during information aggregation.

Tell Me Why: Using Question Answering as Distant Supervision for Answer Justification
Sharp, Surdeanu, Jansen, Valenzuela-Escarcega, Clark, and Hammond — CoNLL 2017

For many applications of question answer- ing (QA), being able to explain why a given model chose an answer is critical. However, the lack of labeled data for answer justifications makes learning this difficult and expensive. Here we propose an approach that uses answer ranking as distant supervision for learning how to select informative justifications, where justifications serve as inferential connections be- tween the question and the correct answer while often containing little lexical over- lap with either. We propose a neural net- work architecture for QA that reranks answer justifications as an intermediate (and human-interpretable) step in answer selection. Our approach is informed by a set of features designed to combine both learned representations and explicit features to capture the connection between questions, answers, and answer justifications. We show that with this end-to-end approach we are able to significantly improve upon a strong IR baseline in both justification ranking (+9% rated highly relevant) and answer selection (+6% P@1).

Framing Question Answering as Building and Ranking Answer Justifications
Jansen, Sharp, Surdeanu, and Clark — Computational Linguistics 2017

We propose a question answering (QA) approach that both identifies correct answers and produces compelling human-readable justifications for why those answers are correct. Our method first identifies the actual information need in a question using psycholinguistic concreteness norms, then uses this information need to construct answer justifications by aggregating multiple sentences from different knowledge bases using syntactic and lexical information. We then jointly rank answers and their justifications using a reranking perceptron that treats justification quality as a latent variable. We evaluate our method on 1,000 multiple-choice questions from elementary school science exams, and empirically demonstrate that it performs better than several strong baselines. Our best configuration answers 44% of the questions correctly, where the top justifications for 57% of these correct answers contain a compelling human-readable justification that explains the inference required to arrive at the correct answer. We include a detailed characterization of the justification quality for both our method and a strong information retrieval baseline, and show that information aggregation is key to addressing the information need in complex questions.

What’s in an Explanation? Characterizing Knowledge and Inference Requirements for Elementary Science Exams
Jansen, Balasubramanian, Surdeanu, and Clark — COLING 2016
[data and tool] [slides]

QA systems have been making steady advances in the challenging elementary science exam domain. In this work, we develop an explanation-based analysis of knowledge and inference requirements, which supports a fine-grained characterization of the challenges. In particular, we model the requirements based on appropriate sources of evidence to be used for the QA task. We create requirements by first identifying suitable sentences in a knowledge base that support the correct answer, then use these to build explanations, filling in any necessary missing information. These explanations are used to create a fine-grained categorization of the requirements. Using these requirements, we compare a retrieval and an inference solver on 212 questions. The analysis validates the gains of the inference solver, demonstrating that it answers more questions requiring complex inference, while also providing insights into the relative strengths of the solvers and knowledge sources. We release the annotated questions and explanations as a resource with broad utility for science exam QA, including determining knowledge base construction targets, as well as supporting information aggregation in automated inference.

Creating Causal Embeddings for Question Answering with Minimal Supervision
Sharp, Surdeanu, Jansen, Clark, Hammond — EMNLP 2016

A common model for question answering (QA) is that a good answer is one that is closely related to the question, where relatedness is often determined using general-purpose lexical models such as word embeddings. We argue that a better approach is to look for answers that are related to the question in a relevant way, according to the information need of the question, which may be determined through task-specific embeddings. With causality as a use case, we implement this insight in three steps. First, we generate causal embeddings cost-effectively by bootstrapping cause-effect pairs extracted from free text using a small set of seed patterns. Second, we train dedicated embeddings over this data, by using task-specific contexts, i.e., the context of a cause is its effect. Finally, we extend a state-of-the-art reranking approach for QA to incorporate these causal embeddings. We evaluate the causal embedding models both directly with a casual implication task, and indirectly, in a downstream causal QA task using data from Yahoo! Answers. We show that explicitly modeling causality improves performance in both tasks. In the QA task our best model achieves 37.3% P@1, significantly outperforming a strong baseline by 7.7% (relative).

Spinning Straw into Gold: Using Free Text to Train Monolingual Alignment Models for Non-factoid Question Answering
Sharp, Jansen, Surdeanu, and Clark — NAACL 2015
[code and data]

Monolingual alignment models have been shown to boost the performance of question answering systems by “bridging the lexical chasm” between questions and answers. The main limitation of these approaches is that they require semi-structured training data in the form of question-answer pairs, which is diffi cult to obtain in specialized domains or low-resource languages. We propose two inexpensive methods for training alignment models solely using free text, by generating artificial question-answer pairs from discourse structures. Our approach is driven by two representations of discourse: a shallow sequential representation, and a deep one based on Rhetorical Structure Theory. We evaluate the proposed model on two corpora from different genres and domains: one from Yahoo! Answers and one from the biology domain, and two types of non-factoid questions: manner and reason. We show that these alignment models trained directly from discourse structures imposed on free text improve performance considerably over an information retrieval baseline and a neural network language model trained on the same data.

Higher-order Lexical Semantic Models for Non-factoid Answer Reranking
Fried, Jansen, Hahn-Powell, Surdeanu, and Clark — Transactions of the ACL 2015

Lexical semantic models provide robust performance for question answering, but, in general, can only capitalize on direct evidence seen during training. For example, monolingual alignment models acquire term alignment probabilities from semi-structured data such as question-answer pairs; neural network language models learn term embeddings from unstructured text. All this knowledge is then used to estimate the semantic similarity between question and answer candidates. We introduce a higher-order formalism that allows all these lexical semantic models to chain direct evidence to construct indirect associations between question and answer texts, by casting the task as the traversal of graphs that encode direct term associations. Using a corpus of 10,000 questions from Yahoo! Answers, we experimentally demonstrate that higher-order methods are broadly applicable to alignment and language models, across both word and syntactic representations. We show that an important criterion for success is controlling for the semantic drift that accumulates during graph traversal. All in all, the proposed higher-order approach improves five out of the six lexical semantic models investigated, with relative gains of up to +13\% over their first-order variants.

Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
Jansen, Surdeanu, and Clark — ACL 2014
[code and data]

We propose a robust answer reranking model for non-factoid questions that integrates lexical semantics with discourse information, driven by two representations of discourse: a shallow representation centered around discourse markers, and a deep one based on Rhetorical Structure Theory. We evaluate the proposed model on two corpora from different genres and domains: one from Yahoo! Answers and one from the biology domain, and two types of non-factoid questions: manner and reason. We experimentally demonstrate that the discourse structure of nonfactoid answers provides information that is complementary to lexical semantic similarity between question and answer, improving performance up to 24% (relative) over a state-of-the-art model that exploits lexical semantic similarity alone. We further demonstrate excellent domain transfer of discourse information, suggesting these discourse features have general utility to non-factoid question answering.

Transmitting Narrative: An Interactive Shift-Summarization Tool for Improving Nurse Communication.
Forbes, Surdeanu, Jansen, and Carrington — IEEE Interactive Visual Text Analytics Workshop 2013

This paper describes an ongoing visualization project that aims to improve nurse communication. In particular, we in- vestigate the transmission of information that is related to potentially life-threatening clinical events. Currently these events may remain unnoticed or are misinterpreted by nurses, or most unfortunately, are simply not communicated clearly between nurses during a shift change, leading in some cases to catastrophic results. Our visualization system is based on a novel application of machine learning and natural language processing algorithms. Results are presented in the form of an interactive shift-summarization tool which augments existing Electronic Health Records (EHRs). This tool provides a high level overview of the patient’s health that is generated through an analysis of heterogeneous data: verbal summarizations de- scribing the patient’s health provided by the nurse in charge of the patient, the various monitored vital signs of the patient, and historical information of patients that had unexpected ad- verse reactions that were not foreseen by the receiving nurse despite being indicated by the responding nurse. In this pa- per, we introduce the urgent need for such a tool, describe the various components of our heterogeneous data analysis system, and present proposed enhancements to EHRs via the shift-summarization tool. This interactive, visual tool clearly indicates potential clinical events generated by our automated inferencing system; lets a nurse quickly verify the likelihood of these events; provides a mechanism for annotating the gen- erated events; and finally, makes it easy for a nurse to navigate the temporal aspects of patient data collected during a shift. This temporal data can then be used to interactively articu- late a narrative that more effectively transmits pertinent data to other nurses.

Adaptive feature-specific spectral imaging
Jansen, Dunlop, Golish, Gehm — SPIE 2012

We present an architecture for rapid spectral classification in spectral imaging applications. By making use of knowledge gained in prior measurements, our spectral imaging system is able to design adaptive feature-specific measurement kernels that selectively attend to the portions of a spectrum that contain useful classification information. With measurement kernels designed using a probabilistically-weighted version of principal component analysis, simulations predict an orders-of- magnitude reduction in classification error rates. We report on our latest simulation results, as well as an experimental prototype currently under construction.

Development of a scalable image formation pipeline for multiscale gigapixel photography.
Golish, Vera, Kelly, Gong, Jansen, Hughes, Kittle, Brady, and Gehm — Optics Express 2012

We report on the image formation pipeline developed to efficiently form gigapixel-scale imagery generated by the AWARE-2 multiscale camera. The AWARE-2 camera consists of 98 “microcameras” imaging through a shared spherical objective, covering a 120° x 50° field of view with approximately 40 microradian instantaneous field of view (the angular extent of a pixel). The pipeline is scalable, capable of producing imagery ranging in scope from “live” one megapixel views to full resolution gigapixel images. Architectural choices that enable trivially parallelizable algorithms for rapid image formation and on-the-fly microcamera alignment compensation are discussed.

Multiscale gigapixel photography
(High performance computing work in acknowledgements) Brady et al. — Nature 2012

Pixel count is the ratio of the solid angle within a camera’s field of view to the solid angle covered by a single detector element. Because the size of the smallest resolvable pixel is proportional to aperture diameter and the maximum field of view is scale independent, the diffraction-limited pixel count is proportional to aperture area. At present, digital cameras operate near the fundamental limit of 1 to 10 megapixels for millimetre-scale apertures, but few approach the corresponding limits of 1 to 100 gigapixels for centimetre-scale apertures. Barriers to high-pixel-count imaging include scale-dependent geometric aberrations, the cost and complexity of gigapixel sensor arrays, and the computational and communications challenge of gigapixel image management. Here we describe the AWARE-2 camera, which uses a 16-mm entrance aperture to capture snapshot, one-gigapixel images at three frames per minute. AWARE-2 uses a parallel array of microcameras to reduce the problems of gigapixel imaging to those of megapixel imaging, which are more tractable. In cameras of conventional design, lens speed and field of view decrease as lens scale increases, but with the experimental system described here we confirm previous theoretical results suggesting that lens speed and field of view can be scale independent in microcamera-based imagers resolving up to 50 gigapixels. Ubiquitous gigapixel cameras may transform the central challenge of photography from the question of where to point the camera to that of how to mine the data.

Strong systematicity through sensorimotor conceptual grounding: an unsupervised, developmental approach to connectionist sentence processing.
Jansen and Watter — Connection Science 2012

Connectionist language modelling typically has difficulty with syntactic systematicity, or the ability to generalise language learning to untrained sentences. This work develops an unsupervised connectionist model of infant grammar learning. Following the semantic boostrapping hypothesis, the network distils word category using a developmentally plausible infant-scale database of grounded sensorimotor conceptual representations, as well as a biologically plausible semantic co-occurrence activation function. The network then uses this knowledge to acquire an early benchmark clausal grammar using correlational learning, and further acquires separate conceptual and grammatical category representations. The network displays strongly systematic behaviour indicative of the general acquisition of the combinatorial systematicity present in the grounded infant-scale language stream, outperforms previous contemporary models that contain primarily noun and verb word categories, and successfully generalises broadly to novel untrained sensorimotor grounded sentences composed of unfamiliar nouns and verbs. Limitations as well as implications to later grammar learning are discussed.

A computational vector-map model of neonate saccades: Modulating the externality effect through refraction periods.
Jansen, Fiacconi, and Gibson — Vision Research 2010

The present study develops an explicit and predictive computational model of neonate saccades based on the interaction of several simple mechanisms, including the tendency to fixate towards areas of high contrast, and the decay and recovery of a world-centered contrast representation simulating a low-level inhibition of return mechanism. Emergent properties similar to early visual behaviors develop, including the externality effect (or tendency to focus on external then internal features). The age-associated progression of this effect is modulated by the decay period of the model’s contrast representation, where the high-level behavior of either scanning broadly or locally is modulated by a single decay parameter.

SayWhen: an automated method for high-accuracy speech onset detection.
Jansen and Watter — Behavioral Research Methods 2008
[SayWhen website, tutorial, and tool download]

Many researchers across many experimental domains utilize the latency of spoken responses as a dependent measure. These measurements are typically made using a voice key, an electronic device that monitors the amplitude of a voice signal, and detects when a predetermined threshold is crossed. Unfortunately, voice keys have been repeatedly shown to be alarmingly errorful and biased in accurately detecting speech onset latencies. We present SayWhen–an easy-to-use software system for offline speech onset latency measurement that (1) automatically detects speech onset latencies with high accuracy, well beyond voice key performance, (2) automatically detects and flags a subset of trials most likely to have mismeasured onsets, for optional manual checking, and (3) implements a graphical user interface that greatly speeds and facilitates the checking and correction of this flagged subset of trials. This automatic-plus-selective-checking method approaches the gold standard performance of full manual coding in a small fraction of the time.