WorldTree V2: A Corpus of Science-Domain Structured Explanations and Inference Patterns supporting Multi-Hop Inference
Xie, Thiem, Martin, Wainwright, Marmorstein, Jansen — accepted to LREC 2020
[data and code coming soon]
Explainable question answering for complex questions often requires combining large numbers of facts to answer a question while providing a human-readable explanation for the answer, a process known as multi-hop inference. Standardized science questions require combining an average of 6 facts, and as many as 16 facts, in order to answer and explain, but most existing datasets for multi-hop reasoning focus on combining only two facts, significantly limiting the ability of multi-hop inference algorithms to learn to generate large inferences. In this work we present the second iteration of the WorldTree project, a corpus of 5,114 standardized science exam questions paired with large detailed multi-fact explanations that combine core scientific knowledge and world knowledge. Each explanation is represented as a lexically-connected “explanation graph” that combines an average of 6 facts drawn from a semi-structured knowledge base of 9,216 facts across 66 tables. We use this explanation corpus to author a set of 344 high-level science domain inference patterns similar to semantic frames supporting multi-hop inference. Together, these resources provide training data and instrumentation for developing many-hop multi-hop inference models for question answering.
ScienceExamCER: A High-Density Fine-Grained Science-Domain Corpus for Common Entity Recognition
Smith, Zhang, Culnan, Jansen — accepted to LREC 2020
[data and code]
Named entity recognition identifies common classes of entities in text, but these entity labels are generally sparse, limiting utility to downstream tasks. In this work we present ScienceExamCER, a densely-labeled semantic classification corpus of 133k mentions in the science exam domain where nearly all (96%) of content words have been annotated with one or more fine-grained semantic class labels including taxonomic groups, meronym groups, verb/action groups, properties and values, and synonyms. Semantic class labels are drawn from a manually-constructed fine-grained typology of 601 classes generated through a data-driven analysis of 4,239 science exam questions. We show an off-the-shelf BERT-based named entity recognition model modified for multi-label classification achieves an accuracy of 0.85 F1 on this task, suggesting strong utility for downstream tasks in science domain question answering requiring densely-labeled semantic classification.
Multi-class Hierarchical Question Classification for Multiple Choice Science Exams
Xu, Jansen, Martin, Xie, Yadav, Madabushi, Tafjord, Clark — accepted to LREC 2020
Prior work has demonstrated that question classification (QC), recognizing the problem domain of a question, can help answer it more accurately. However, developing strong QC algorithms has been hindered by the limited size and complexity of annotated data available. To address this, we present the largest challenge dataset for QC, containing 7,787 science exam questions paired with detailed classification labels from a fine-grained hierarchical taxonomy of 406 problem domains. We then show that a BERT-based model trained on this dataset achieves a large (+0.12 MAP) gain compared with previous methods, while also achieving state-of-the-art performance on benchmark open-domain and biomedical QC datasets. Finally, we show that using this model’s predictions of question topic significantly improves the accuracy of a question answering system by +1.7% P@1, with substantial future gains possible as QC performance improves.
QASC: A Dataset for Question Answering via Sentence Composition
Khot, Clark, Guerquin, Jansen, Sabharwal — AAAI 2020
Composing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition (QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using common-sense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiple-choice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.
Extracting Common Inference Patterns from Semi-Structured Explanations
Thiem and Jansen — COIN 2019
Complex questions often require combining multiple facts to correctly answer, particularly when generating detailed explanations for why those answers are correct. Combining multiple facts to answer questions is often modeled as a “multi-hop” graph traversal problem, where a given solver must find a series of interconnected facts in a knowledge graph that, taken together, answer the question and explain the reasoning behind that answer. Multi-hop inference currently suffers from semantic drift, or the tendency for chains of reasoning to “drift”‘ to unrelated topics, and this semantic drift greatly limits the number of facts that can be combined in both free text or knowledge base inference. In this work we present our effort to mitigate semantic drift by extracting large high-confidence multi-hop inference patterns, generated by abstracting large-scale explanatory structure from a corpus of detailed explanations. We represent these inference patterns as sets of generalized constraints over sentences represented as rows in a knowledge base of semi-structured tables. We present a prototype tool for identifying common inference patterns from corpora of semi-structured explanations, and use it to successfully extract 67 inference patterns from a “matter” subset of standardized elementary science exam questions that span scientific and world knowledge.
TextGraphs 2019 Shared Task on Multi-Hop Inference for Explanation Regeneration
Jansen and Ustalov— TextGraphs 2019
[shared task participant kit] [slides]
While automated question answering systems are increasingly able to retrieve answers to natural language questions, their ability to generate detailed human-readable explanations for their answers is still quite limited. The Shared Task on Multi-Hop Inference for Explanation Regeneration tasks participants with regenerating detailed gold explanations for standardized elementary science exam questions by selecting facts from a knowledge base of semi-structured tables. Each explanation contains between 1 and 16 interconnected facts that form an “explanation graph” spanning core scientific knowledge and detailed world knowledge. It is expected that successfully combining these facts to generate detailed explanations will require advancing methods in multi-hop inference and information combination, and will make use of the supervised training data provided by the WorldTree explanation corpus. The top-performing system achieved a mean average precision (MAP) of 0.56, substantially advancing the state-of-the-art over a baseline information retrieval model. Detailed extended analyses of all submitted systems showed large relative improvements in accessing the most challenging multi-hop inference problems, while absolute performance remains low, highlighting the difficulty of generating detailed explanations through multi-hop reasoning.
Multi-hop Inference for Sentence-level TextGraphs: How Challenging is Meaningfully Combining Information for Science Question Answering?
Jansen — TextGraphs 2018
Question Answering for complex questions is often modeled as a graph construction or traversal task, where a solver must build or traverse a graph of facts that answer and explain a given question. This “multi-hop” inference has been shown to be extremely challenging, with few models able to aggregate more than two facts before being overwhelmed by “semantic drift”, or the tendency for long chains of facts to quickly drift off topic. This is a major barrier to current inference models, as even elementary science questions require an average of 4 to 6 facts to answer and explain. In this work we empirically characterize the difficulty of building or traversing a graph of sentences connected by lexical overlap, by evaluating chance sentence aggregation quality through 9,784 manually-annotated judgments across knowledge graphs built from three free-text corpora (including study guides and Simple Wikipedia). We demonstrate semantic drift tends to be high and aggregation quality low, at between 0.04% and 3%, and highlight scenarios that maximize the likelihood of meaningfully combining information.
WorldTree: A Corpus of Explanation Graphs for Elementary Science Questions supporting Multi-hop Inference
Jansen, Wainwright, Marmorstein, Morrison — LREC 2018
[data, code, and tool] [talk on this project]
Developing methods of automated inference that are able to provide users with compelling human-readable justifications for why the answer to a question is correct is critical for domains such as science and medicine, where user trust and detecting costly errors are limiting factors to adoption. One of the central barriers to training question answering models on explainable inference tasks is the lack of gold explanations to serve as training data. In this paper we present a corpus of explanations for standardized science exams, a recent challenge task for question answering. We manually construct a corpus of detailed explanations for nearly all publicly available standardized elementary science question (approximately 1,680 3rd through 5th grade questions) and represent these as “explanation graphs” – sets of lexically overlapping sentences that describe how to arrive at the correct answer to a question through a combination of domain and world knowledge. We also provide an explanation-centered tablestore, a collection of semi-structured tables that contain the knowledge to construct these elementary science explanations. Together, these two knowledge resources map out a substantial portion of the knowledge required for answering and explaining elementary science exams, and provide both structured and free-text training data for the explainable inference task.
Controlling Information Aggregation for Complex Question Answering
Kwon, Trivedi, Jansen, Surdeanu and Balasubramanian — ECIR 2018
Complex question answering, the task of answering complex natural language questions that rely on inference, requires the aggregation of information from multiple sources. Automatic aggregation often fails because it combines se-mantically unrelated facts leading to bad inferences. This paper proposes methods to address this inference drift problem. In particular, the paper develops unsupervised and supervised mechanisms to control random walks on Open Information Extraction (OIE) knowledge graphs. Empirical evaluation on an elementary science exam benchmark shows that the proposed methods enables effective aggregation even over larger graphs and demonstrates the complementary value of information aggregation for answering complex questions.
A Study of Automatically Acquiring Explanatory Inference Patterns from Corpora of Explanations: Lessons from Elementary Science Exams
Jansen — AKBC 2017
Our long term interest is in building inference algorithms capable of answering questions and producing human-readable explanations by aggregating information from multiple sources and knowledge bases. Currently information aggregation (also referred to as “multi-hop inference”) is challenging for more than two facts due to “semantic drift”, or the tendency for natural language inference algorithms to quickly move off-topic when assembling long chains of knowledge. In this paper we explore the possibility of generating large explanations with an average of six facts by automatically extracting common explanatory patterns from a corpus of manually authored elementary science explanations represented as lexically-connected explanation graphs grounded in a semi-structured knowledge base of tables. We empirically demonstrate that there are sufficient common explanatory patterns in this corpus that it is possible in principle to reconstruct unseen explanation graphs by merging multiple explanatory patterns, then adapting and/or adding to their knowledge. This may ultimately provide a mechanism to allow inference algorithms to surpass the two-fact “aggregation horizon” in practice by using common explanatory patterns as constraints to limit the search space during information aggregation.
Tell Me Why: Using Question Answering as Distant Supervision for Answer Justification
Sharp, Surdeanu, Jansen, Valenzuela-Escarcega, Clark, and Hammond — CoNLL 2017
For many applications of question answer- ing (QA), being able to explain why a given model chose an answer is critical. However, the lack of labeled data for answer justifications makes learning this difficult and expensive. Here we propose an approach that uses answer ranking as distant supervision for learning how to select informative justifications, where justifications serve as inferential connections be- tween the question and the correct answer while often containing little lexical over- lap with either. We propose a neural net- work architecture for QA that reranks answer justifications as an intermediate (and human-interpretable) step in answer selection. Our approach is informed by a set of features designed to combine both learned representations and explicit features to capture the connection between questions, answers, and answer justifications. We show that with this end-to-end approach we are able to significantly improve upon a strong IR baseline in both justification ranking (+9% rated highly relevant) and answer selection (+6% P@1).
Framing Question Answering as Building and Ranking Answer Justifications
Jansen, Sharp, Surdeanu, and Clark — Computational Linguistics 2017
We propose a question answering (QA) approach that both identifies correct answers and produces compelling human-readable justifications for why those answers are correct. Our method first identifies the actual information need in a question using psycholinguistic concreteness norms, then uses this information need to construct answer justifications by aggregating multiple sentences from different knowledge bases using syntactic and lexical information. We then jointly rank answers and their justifications using a reranking perceptron that treats justification quality as a latent variable. We evaluate our method on 1,000 multiple-choice questions from elementary school science exams, and empirically demonstrate that it performs better than several strong baselines. Our best configuration answers 44% of the questions correctly, where the top justifications for 57% of these correct answers contain a compelling human-readable justification that explains the inference required to arrive at the correct answer. We include a detailed characterization of the justification quality for both our method and a strong information retrieval baseline, and show that information aggregation is key to addressing the information need in complex questions.
What’s in an Explanation? Characterizing Knowledge and Inference Requirements for Elementary Science Exams
Jansen, Balasubramanian, Surdeanu, and Clark — COLING 2016
[data and tool] [slides]
QA systems have been making steady advances in the challenging elementary science exam domain. In this work, we develop an explanation-based analysis of knowledge and inference requirements, which supports a fine-grained characterization of the challenges. In particular, we model the requirements based on appropriate sources of evidence to be used for the QA task. We create requirements by first identifying suitable sentences in a knowledge base that support the correct answer, then use these to build explanations, filling in any necessary missing information. These explanations are used to create a fine-grained categorization of the requirements. Using these requirements, we compare a retrieval and an inference solver on 212 questions. The analysis validates the gains of the inference solver, demonstrating that it answers more questions requiring complex inference, while also providing insights into the relative strengths of the solvers and knowledge sources. We release the annotated questions and explanations as a resource with broad utility for science exam QA, including determining knowledge base construction targets, as well as supporting information aggregation in automated inference.
Creating Causal Embeddings for Question Answering with Minimal Supervision
Sharp, Surdeanu, Jansen, Clark, Hammond — EMNLP 2016
A common model for question answering (QA) is that a good answer is one that is closely related to the question, where relatedness is often determined using general-purpose lexical models such as word embeddings. We argue that a better approach is to look for answers that are related to the question in a relevant way, according to the information need of the question, which may be determined through task-specific embeddings. With causality as a use case, we implement this insight in three steps. First, we generate causal embeddings cost-effectively by bootstrapping cause-effect pairs extracted from free text using a small set of seed patterns. Second, we train dedicated embeddings over this data, by using task-specific contexts, i.e., the context of a cause is its effect. Finally, we extend a state-of-the-art reranking approach for QA to incorporate these causal embeddings. We evaluate the causal embedding models both directly with a casual implication task, and indirectly, in a downstream causal QA task using data from Yahoo! Answers. We show that explicitly modeling causality improves performance in both tasks. In the QA task our best model achieves 37.3% P@1, significantly outperforming a strong baseline by 7.7% (relative).
Spinning Straw into Gold: Using Free Text to Train Monolingual Alignment Models for Non-factoid Question Answering
Sharp, Jansen, Surdeanu, and Clark — NAACL 2015
[code and data]
Monolingual alignment models have been shown to boost the performance of question answering systems by “bridging the lexical chasm” between questions and answers. The main limitation of these approaches is that they require semi-structured training data in the form of question-answer pairs, which is diffi cult to obtain in specialized domains or low-resource languages. We propose two inexpensive methods for training alignment models solely using free text, by generating artificial question-answer pairs from discourse structures. Our approach is driven by two representations of discourse: a shallow sequential representation, and a deep one based on Rhetorical Structure Theory. We evaluate the proposed model on two corpora from different genres and domains: one from Yahoo! Answers and one from the biology domain, and two types of non-factoid questions: manner and reason. We show that these alignment models trained directly from discourse structures imposed on free text improve performance considerably over an information retrieval baseline and a neural network language model trained on the same data.
Higher-order Lexical Semantic Models for Non-factoid Answer Reranking
Fried, Jansen, Hahn-Powell, Surdeanu, and Clark — Transactions of the ACL 2015
Lexical semantic models provide robust performance for question answering, but, in general, can only capitalize on direct evidence seen during training. For example, monolingual alignment models acquire term alignment probabilities from semi-structured data such as question-answer pairs; neural network language models learn term embeddings from unstructured text. All this knowledge is then used to estimate the semantic similarity between question and answer candidates. We introduce a higher-order formalism that allows all these lexical semantic models to chain direct evidence to construct indirect associations between question and answer texts, by casting the task as the traversal of graphs that encode direct term associations. Using a corpus of 10,000 questions from Yahoo! Answers, we experimentally demonstrate that higher-order methods are broadly applicable to alignment and language models, across both word and syntactic representations. We show that an important criterion for success is controlling for the semantic drift that accumulates during graph traversal. All in all, the proposed higher-order approach improves five out of the six lexical semantic models investigated, with relative gains of up to +13\% over their first-order variants.
Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
Jansen, Surdeanu, and Clark — ACL 2014
[code and data]
We propose a robust answer reranking model for non-factoid questions that integrates lexical semantics with discourse information, driven by two representations of discourse: a shallow representation centered around discourse markers, and a deep one based on Rhetorical Structure Theory. We evaluate the proposed model on two corpora from different genres and domains: one from Yahoo! Answers and one from the biology domain, and two types of non-factoid questions: manner and reason. We experimentally demonstrate that the discourse structure of nonfactoid answers provides information that is complementary to lexical semantic similarity between question and answer, improving performance up to 24% (relative) over a state-of-the-art model that exploits lexical semantic similarity alone. We further demonstrate excellent domain transfer of discourse information, suggesting these discourse features have general utility to non-factoid question answering.
Transmitting Narrative: An Interactive Shift-Summarization Tool for Improving Nurse Communication.
Forbes, Surdeanu, Jansen, and Carrington — IEEE Interactive Visual Text Analytics Workshop 2013
This paper describes an ongoing visualization project that aims to improve nurse communication. In particular, we in- vestigate the transmission of information that is related to potentially life-threatening clinical events. Currently these events may remain unnoticed or are misinterpreted by nurses, or most unfortunately, are simply not communicated clearly between nurses during a shift change, leading in some cases to catastrophic results. Our visualization system is based on a novel application of machine learning and natural language processing algorithms. Results are presented in the form of an interactive shift-summarization tool which augments existing Electronic Health Records (EHRs). This tool provides a high level overview of the patientâ€™s health that is generated through an analysis of heterogeneous data: verbal summarizations de- scribing the patientâ€™s health provided by the nurse in charge of the patient, the various monitored vital signs of the patient, and historical information of patients that had unexpected ad- verse reactions that were not foreseen by the receiving nurse despite being indicated by the responding nurse. In this pa- per, we introduce the urgent need for such a tool, describe the various components of our heterogeneous data analysis system, and present proposed enhancements to EHRs via the shift-summarization tool. This interactive, visual tool clearly indicates potential clinical events generated by our automated inferencing system; lets a nurse quickly verify the likelihood of these events; provides a mechanism for annotating the gen- erated events; and finally, makes it easy for a nurse to navigate the temporal aspects of patient data collected during a shift. This temporal data can then be used to interactively articu- late a narrative that more effectively transmits pertinent data to other nurses.
Adaptive feature-specific spectral imaging
Jansen, Dunlop, Golish, Gehm — SPIE 2012
We present an architecture for rapid spectral classification in spectral imaging applications. By making use of knowledge gained in prior measurements, our spectral imaging system is able to design adaptive feature-specific measurement kernels that selectively attend to the portions of a spectrum that contain useful classification information. With measurement kernels designed using a probabilistically-weighted version of principal component analysis, simulations predict an orders-of- magnitude reduction in classification error rates. We report on our latest simulation results, as well as an experimental prototype currently under construction.
Development of a scalable image formation pipeline for multiscale gigapixel photography.
Golish, Vera, Kelly, Gong, Jansen, Hughes, Kittle, Brady, and Gehm — Optics Express 2012
We report on the image formation pipeline developed to efficiently form gigapixel-scale imagery generated by the AWARE-2 multiscale camera. The AWARE-2 camera consists of 98 â€œmicrocamerasâ€ imaging through a shared spherical objective, covering a 120Â° x 50Â° field of view with approximately 40 microradian instantaneous field of view (the angular extent of a pixel). The pipeline is scalable, capable of producing imagery ranging in scope from â€œliveâ€ one megapixel views to full resolution gigapixel images. Architectural choices that enable trivially parallelizable algorithms for rapid image formation and on-the-fly microcamera alignment compensation are discussed.
Multiscale gigapixel photography
(High performance computing work in acknowledgements) Brady et al. — Nature 2012
Pixel count is the ratio of the solid angle within a cameraâ€™s field of view to the solid angle covered by a single detector element. Because the size of the smallest resolvable pixel is proportional to aperture diameter and the maximum field of view is scale independent, the diffraction-limited pixel count is proportional to aperture area. At present, digital cameras operate near the fundamental limit of 1 to 10 megapixels for millimetre-scale apertures, but few approach the corresponding limits of 1 to 100 gigapixels for centimetre-scale apertures. Barriers to high-pixel-count imaging include scale-dependent geometric aberrations, the cost and complexity of gigapixel sensor arrays, and the computational and communications challenge of gigapixel image management. Here we describe the AWARE-2 camera, which uses a 16-mm entrance aperture to capture snapshot, one-gigapixel images at three frames per minute. AWARE-2 uses a parallel array of microcameras to reduce the problems of gigapixel imaging to those of megapixel imaging, which are more tractable. In cameras of conventional design, lens speed and field of view decrease as lens scale increases, but with the experimental system described here we confirm previous theoretical results suggesting that lens speed and field of view can be scale independent in microcamera-based imagers resolving up to 50 gigapixels. Ubiquitous gigapixel cameras may transform the central challenge of photography from the question of where to point the camera to that of how to mine the data.
Strong systematicity through sensorimotor conceptual grounding: an unsupervised, developmental approach to connectionist sentence processing.
Jansen and Watter — Connection Science 2012
Connectionist language modelling typically has difficulty with syntactic systematicity, or the ability to generalise language learning to untrained sentences. This work develops an unsupervised connectionist model of infant grammar learning. Following the semantic boostrapping hypothesis, the network distils word category using a developmentally plausible infant-scale database of grounded sensorimotor conceptual representations, as well as a biologically plausible semantic co-occurrence activation function. The network then uses this knowledge to acquire an early benchmark clausal grammar using correlational learning, and further acquires separate conceptual and grammatical category representations. The network displays strongly systematic behaviour indicative of the general acquisition of the combinatorial systematicity present in the grounded infant-scale language stream, outperforms previous contemporary models that contain primarily noun and verb word categories, and successfully generalises broadly to novel untrained sensorimotor grounded sentences composed of unfamiliar nouns and verbs. Limitations as well as implications to later grammar learning are discussed.
A computational vector-map model of neonate saccades: Modulating the externality effect through refraction periods.
Jansen, Fiacconi, and Gibson — Vision Research 2010
The present study develops an explicit and predictive computational model of neonate saccades based on the interaction of several simple mechanisms, including the tendency to fixate towards areas of high contrast, and the decay and recovery of a world-centered contrast representation simulating a low-level inhibition of return mechanism. Emergent properties similar to early visual behaviors develop, including the externality effect (or tendency to focus on external then internal features). The age-associated progression of this effect is modulated by the decay period of the model’s contrast representation, where the high-level behavior of either scanning broadly or locally is modulated by a single decay parameter.
SayWhen: an automated method for high-accuracy speech onset detection.
Jansen and Watter — Behavioral Research Methods 2008
[SayWhen website, tutorial, and tool download]
Many researchers across many experimental domains utilize the latency of spoken responses as a dependent measure. These measurements are typically made using a voice key, an electronic device that monitors the amplitude of a voice signal, and detects when a predetermined threshold is crossed. Unfortunately, voice keys have been repeatedly shown to be alarmingly errorful and biased in accurately detecting speech onset latencies. We present SayWhen–an easy-to-use software system for offline speech onset latency measurement that (1) automatically detects speech onset latencies with high accuracy, well beyond voice key performance, (2) automatically detects and flags a subset of trials most likely to have mismeasured onsets, for optional manual checking, and (3) implements a graphical user interface that greatly speeds and facilitates the checking and correction of this flagged subset of trials. This automatic-plus-selective-checking method approaches the gold standard performance of full manual coding in a small fraction of the time.