N = 1000 Evaluation Samples

Non-parametric Bootstrap Resampling Primer for NLP (in Scala)

In a recent batch of papers I reviewed for NAACL, I was surprised at the large number of papers that I reviewed that claimed to be producing state-of-the-art models or introducing novel mechanisms to existing models, but failed to include any inferential statistics supporting those claims.  The story of these papers typically takes one of the following forms:

  1. The State-Of-The-Art: The paper claims to achieve state-of-the-art performance on a competitive task (i.e. achieving the best performance on a task leaderboard), but the difference between the current-top model and the proposed model was small (typically under 0.5%).
  2. The New Mechanism: The paper spends several pages developing a compelling narrative for how a complex feature or mechanism will substantially increase performance compared with existing methods, but shows small gains compared to existing methods, and/or does not provide a statistical comparison with those methods.
  3. The Unverified Ablation: A new feature is proposed, but the ablation study suggests the overall contribution to model performance is small — so small that a model with the new feature may not be significantly different than the full model.

Being an interdisciplinary researcher I’ve transitioned fields several times.  My graduate work was done in computational cognitive modeling in a psychology and neuroscience department, where most of the graduate courses one takes are statistics and research methods to learn how to run experiments in progressively more challenging experimental scenarios where the noise may be high, and it’s unclear if you’re measuring what you think you’re measuring.  As a field, natural language processing regularly publishes work in top venues without including inferential statistics, and this can place strong limits on what can be inferred from the claims of a research article, which only hurts the field as a whole, and adds noise to the science of understanding the phenomena that we invest a great deal of resources into studying.

This recent reviewing experience suggested to me that it might be helpful to write a short primer on a common and easy-to-implement statistical method (non-parametric bootstrap resampling), in the hopes that others find it useful.  The disclaimers are that I’m a natural language processing researcher, not a statistician, and consider myself an end-user rather than a researcher of statistical methods.  It’s always critical to understand your data and measurement scenarios, and how those interact with the assumptions of the statistics that you choose.

The Experiment Setup
The technique discussed here compares the performance of two systems — the baseline system, and the experimental system. In NLP, the common scenarios this tends to apply to are:

  1. The Better Model: You have created a new model that achieves higher accuracy on some task than the previous state-of-the-art system. Here, the previous state-of-the-art is the baseline system, and your new model is the experimental system.
  2. The Better Feature: You have created a new feature (or, group of features) and added this to an existing model. Here, the model without the feature (or group of features) is the baseline system, and the model with the features is the experimental system.
  3. The Ablation Study: You have a model that contains many different components, and are running an ablation study to determine the relative contribution of each component to the overall performance of the system. Ablation studies are commonly reported as comparing the performance of removing a single component of the system, versus the full system. Here, the baseline is the system with the component removed (which should be exhibiting lower performance than the full system). The experimental system is the full system.
  4. The Simpler Model Story: You have a model that achieves slightly lower than state-of-the-art performance, but using a model that’s simpler, more interpretable, faster to compute, or some other similar story. The performance of your new, simpler, faster, more interpretable system is so close to the state-of-the-art that you believe the difference between the two models is likely not significantly different. Here, the system with lower performance is the baseline, and the higher performing system is the experimental system. Unlike the other cases, here you are looking for the p-value to be large — traditionally greater than 0.05 (and ideally much larger, i.e. not significantly different).

What you need: Scores from each model
To perform a statistical comparison, you will need two sets of scores — one from the baseline system, and one from the experimental system. The scores should be ordered — that is, kept in two parallel arrays, one containing the scores for the baseline system, one for the experimental system, where the index n for each array represents the baseline or experimental performance for the same piece of input data.

To make this a bit more concrete, in my field of question answering model performance is typically measured in terms of accuracy (the proportion of questions answered correctly by a model). If I was evaluating on a hypothetical set of 10 questions, the baseline score and experimental score arrays would each have 10 elements. BaselineScores(4) and ExperimentScores(4) would represent the scores of both models on the same question. An example of this is listed in the table below, where a score of “1” represents a given method answered the question correctly, and a score of “0” means the question was answered incorrectly. The average performance of the baseline model is 50% accuracy, and the experiment model is 60% accuracy.

Question # Question Text Baseline Score Experimental Score Difference Desc.
0 Which of… 0 1 +1 Exp. helps
1 In what year… 1 1 0 Both correct
2 The largest… 1 0 -1 Exp. hurts
3 Which city… 0 1 +1 Exp. helps
4 In what country… 0 1 +1 Exp. helps
5 When the… 1 0 -1 Exp. hurts
6 How might… 0 1 +1 Exp. helps
7 What is one… 1 1 0 Both correct
8 Which person… 0 0 0 Both incorrect
9 The amount of… 1 0 -1 Exp. hurts
Average Accuracy 50% 60%

The intuition here is that the performance difference is large — 50% vs 60%, or a +10% gain, so whatever clever experimental system we developed is clearly superior to the baseline method. Unfortunately this intuition is incorrect, and there are many factors that affect whether an experimental model is statistically significant over a baseline model, including:

  • the sample size (here 10 samples, from 10 questions)
  • the total number of questions the experimental model benefits or helps over the baseline model (here 4)
  • the total number of questions that the experimental model hurts compared to the baseline model (here 3).

The overall difference between the number of questions helped and hurt (here, 4 – 3 or +1) is the difference score, expressed in number of samples. Here in this question answering example with 10 samples, we can convert this to the difference in accuracy between baseline and experimental models by dividing the difference by the number of samples, +1/10, to find the +10% performance benefit over baseline.

To determine whether the baseline and experimental models are significantly different, we can use a variety of statistical tests. Here we’ll look at a popular statistic, non-parametric bootstrap resampling.

Non-parametric Bootstrap Resampling

The procedure for non-parametric bootstrap resampling is as follows:
For some large number N, which represents the number of bootstrap resamples being taken:

  1. Randomly draw K samples from the difference scores, with replacement (i.e. the same sample can be included in the distribution more than once). K should be equal to the original number of samples (i.e. in the example above with 10 questions, K should be equal to 10)
  2. Sum these randomly sampled difference scores, to determine whether (on the balance) this resampled distribution shows the experimental model as helping (sum > 0) or not helping (sum <= 0). Record this outcome (helped or hurt), regardless of the magnitude.
  3. Repeat this procedure N times, recording the proportion of runs that show the experimental model not helping versus the total number of runs. This proportion is the p-value.

Given that this resampling procedure can happen very quickly, and an accurate estimate of p is important, a typical value of N might be 10,000 samples. Unless the number of samples is large, this can usually be computed in a few seconds. To give a concrete example, if (with a given set of data) the experimental model helps in 9,900 resamples out of 10,000 resamples total, then the p-value would be (10,000 – 9,900) / 10,000 or p = 0.01 .

Bootstrap Resampling Code (in Scala)

The following Scala code implements the bootstrap resampling procedure, and includes two examples:

  1. Example 1: The 10-sample question answering data example from above, with the scores manually entered.
  2. Example 2: A function that generates artificial data to play with, and gain intuition about how differently sized data with different numbers of questions helped and hurt by the experimental model generate different p-values.

The normal use case is that you typically will run your baseline and experimental models, and save the scores for each sample (e.g. question) from each model to a separate text file in the same order (e.g. baseline.txt, experimental.txt).  You can then load these scores into separate arrays, and use the code provided below.

import collection.mutable.ArrayBuffer

  * Quick tool to compute statistical significance using bootstrap resampling
  * User: peter
  * Date: 9/18/13
object BootstrapResampling {

  def computeBootstrapResampling(baseline:Array[Double], experimental:Array[Double], numSamples:Int = 10000): Double = {
    val rand = new java.util.Random()

    // Step 1: Check input sizes of baseline and experimental arrays are the same
    if (baseline.size != experimental.size) throw new RuntimeException("BootstrapResampling.computeBootstrapResampling(): ERROR: scoresBefore and scoresAfter have different lengths")
    val numDataPoints = baseline.size

    // Step 2: compute difference scores
    val deltas = new ArrayBuffer[Double]
    for (i <- 0 until baseline.size) {
      val delta = experimental(i) - baseline(i)

    // Step 3: Resample 'numSample' times, computing the mean each time.  Store the results.
    val means = new ArrayBuffer[Double]
    for (i <- 0 until numSamples) {
      var mean: Double = 0.0
      for (j <- 0 until numDataPoints) {
        val randIdx = rand.nextInt(numDataPoints)
        mean = mean + deltas(randIdx)
      mean = mean / numDataPoints

    // Step 4: Compute proportion of means at or below 0 (the null hypothesis)
    var proportionBelowZero: Double = 0.0
    for (i <- 0 until numSamples) {
      println ("bootstrap: mean: " + means(i))
      if (means(i) <= 0) proportionBelowZero += 1
    proportionBelowZero = proportionBelowZero / numSamples

    // debug
    println("Proportion below zero: " + proportionBelowZero)

    // Return the p value

  // Create artificial baseline and experimental arrays that contain a certain number of samples (numSamples),
  // with some number that are helped by the experimental model (numHelped), and some that are hurt (numHurt).
  def makeArtificialData(numSamples:Int, numHelped:Int, numHurt:Int):(Array[Double], Array[Double]) = {
    val baseline = Array.fill[Double](numSamples)(0.0)
    val experimental = Array.fill[Double](numSamples)(0.0)

    // Add helped
    for (i <- 0 until numHelped) {
      experimental(i) = 1     // Answered correct by the experimental model (but not the baseline model)

    // Add hurt
    for (i <- numHelped until (numHelped + numHurt)) {
      baseline(i) = 1         // Answered correctly by the baseline model (but not the experimental model)

    // Return
    (baseline, experimental)

  def main(args:Array[String]): Unit = {
    // Example 1: Manually entering data
    val baseline = Array[Double](0, 1, 1, 0, 0, 1, 0, 1, 0, 1)
    val experimental = Array[Double](1, 1, 0, 1, 1, 0, 1, 1, 0, 0)
    val p = computeBootstrapResampling(baseline, experimental)
    println ("p-value: " + p)

    // Example 2: Generating artificial data
    val (baseline1, experimental1) = makeArtificialData(numSamples = 500, numHelped = 10, numHurt = 7)
    val p1 = computeBootstrapResampling(baseline1, experimental1)
    println ("p-value: " + p1)


And the example output:

Example 1:

bootstrap: mean: -0.2
bootstrap: mean: -0.6
bootstrap: mean: -0.1
bootstrap: mean: -0.1
bootstrap: mean: -0.3
bootstrap: mean: -0.3
bootstrap: mean: -0.2
bootstrap: mean: 0.4
bootstrap: mean: 0.1
bootstrap: mean: 0.3
bootstrap: mean: 0.1
bootstrap: mean: -0.4
Proportion below zero: 0.4316
p-value: 0.4316

The code includes debug output that displays both the mean of the difference scores for each boostrap resample iteration (allowing you to visually see the proportion above or below zero), as well as a summary display showing the overall proportion of bootstrap resampling iterations below zero.

This shows that in spite of there being a +10% performance difference between the two question answering models in the toy example, this difference has a 43% chance of being due to random chance rather than a real difference between these groups.  This is likely owing both to the small number of samples, as well as the relative proportion of questions helped vs hurt (many questions are aided by the experimental model, but a nearly equal number are hurt by it).  Here, we would say that the performance of the two models are not significantly different (p < 0.05).

Additional Examples of the Behavior of Non-parametric Bootstrap Resampling

Using the code below, we can generate artificial data that investigates the expected p-values for specific effect sizes paired with datasets of different sizes.  We can also investigate the effect of having a larger number of samples helped/hurt (i.e. changed) compared to the baseline, while controlling for the same effect size, to help gain an intuition for how this statistic behaves under specific scenarios.  Here is the code example, added to main(), for a given effect size and sample size:

// Example 3: Generating artificial data
val effectSize:Double = 2.0
val sampleSize:Double = 500

val pValues = new ArrayBuffer[(Double, Double)]
for (i <- 0 until 20) {
  val numHelped = (((i.toDouble/100.0) * sampleSize) + ((effectSize/100.0) * sampleSize)).toInt
  val numHurt = ((i.toDouble/100.0) * sampleSize).toInt

  val (baseline2, experimental2) = makeArtificialData(numSamples = sampleSize.toInt, numHelped, numHurt)
  val p2 = computeBootstrapResampling(baseline2, experimental2)
  println("p-value: " + p2)

  pValues.append( (i, p2) )

println ("")
println ("Effect Size: " + effectSize)
println ("Sample Size: " + sampleSize)
println ("pValues:")
for (pair <- pValues) {
  println (pair._1 + "\t" + pair._2.formatted("%3.5f"))


Example: The Small Evaluation Dataset (100 Samples)

Having an evaluation set (i.e. a development or test set) that contains only 100 items is not uncommon in a number of situations:

  • New Task: You have started working on a new task, and are doing exploratory analyses with little data to work with.  (I remember when we started investigating question answering on standardized science exams and had collected few exams, our evaluation set was only 68 questions.  Now it’s approximately 3,000).
  • Low Resource Domain: You are operating in a domain with few readily-available resources, and you are not able to easily collect more.
  • Existing Dataset: You are comparing against an existing dataset that has small development and/or test folds.  For example, the TREC QA sets can have as few as 65 questions for evaluating, meaning answering a single additional question correct increases QA accuracy by 1.5% (!).
  • Expensive Collection: You are collecting data, but have a task where data is expensive to collect.  For example, you need to collect or transcribe sensitive data in the medical domain.
  • Expensive Annotation: Your task requires annotating data, and you this annotation is particularly difficult or time-consuming, limiting the amount of annotation you are able to generate for training or evaluation.

Here with 100 samples, an effect size of 1% means that (for example) an experimental question answering system would answer only 1 more question correctly than the baseline system. As the plot below shows, with such small evaluation sets it’s very hard to detect that an experimental system is significantly different from a baseline system without very large effect sizes, and where few samples are hurt by the experimental system. Here, even for an effect size of 2% (e.g. a baseline accuracy of 70%, and an experimental accuracy of 72%), and a best-case scenario where the experimental system only helps (and not hurts) any of the samples in the dataset, the expected p-value is still approximately 0.13.  Even with an effect size of 5%, if the proportion of samples hurt by the experimental system exceeds 2% (i.e. the experimental system answers 7 more questions correctly over the baseline system, but also answers 2 questions incorrectly that the baseline answered correctly), then the p-value begins to exceed the generally accepted threshold of p = 0.05.

Cautionary Tale:  With small datasets where it can be challenging to achieve statistical significance, after a promising experiment that doesn’t quite meet the threshold it’s often tempting to become overconfident in p-values “trending towards significance” (<0.10 to 0.20) but that don’t quite meet the threshold.  Many folks have experience with an experiment such as this, where a large effect that seemed promising disappears after more investigation.  It’s important to stress that in the business of doing science, it’s critical to find this out sooner rather than later, so that you can make your mistakes cheaply, and not invest time and resources in efforts that ultimately are unlikely to pan out.  For example, in our early days of science exam question answering (circa 2013) when we had collected only 68 questions for our test set, a question answering model that included word embeddings (a now popular technique, but up-and-coming in 2013) showed a large +5% gain in accuracy over a model without embeddings.  This was extremely promising, and I personally spent a great deal of time pursuing this only to find after great effort that the effect ultimately disappeared with more data, and that word embedding models were notoriously challenging to train for domains with little data (like elementary science).  It ended up being years later before methods of training word embeddings for this domain were demonstrated, but not before I spent quite a bit of time on a promising-seeming effect that ultimately was due to small sample sizes rather than a genuine benefit of the model.  One of my mentors in graduate school used to say that science is about making your mistakes cheaply, and he would often quote Nobel laureate Richard Feynman, who famously said (somewhat paraphrased) that “science is about not fooling yourself, when you’re the easiest one to fool”.  Since my experience, I always try to err on the side of collecting more data when viable, since a week collecting more data is broadly more useful than a week wasted on an effect that isn’t there.

Example: Evaluation Set Size of 500 samples

Expected p-values for an evaluation set of 500 samples is shown below.   Here, it’s generally possible to detect that groups are significantly different with high confidence if the difference between baseline and experimental groups is greater than +5%.  Differences of +2% may also fall below the p<0.05 threshold if the experimental model doesn’t hurt too many samples — i.e. incorrectly answer questions that the baseline model answers correctly.


Example: Evaluation Set Size of 2,000 samples

At an evaluation set size of 2,000 samples, it’s generally possible to detect that an experimental model is significantly better than a baseline model when the effect size is approximately +2% or greater.  It’s also possible to detect this difference if the effect size is on the order of 1%, as long as the experiment model doesn’t also hurt questions that the baseline model was answering correctly.


Example: Evaluation Set Size of 10,000 samples

Large-scale datasets are becoming increasingly common due to crowdsourcing and large-scale data collection — for example, in question answering, the MS Machine Reading Comprehension (MS MARCO) dataset now contains over 1 million (!) questions.  These large datasets enable even very small differences between baseline and experimental models to be detected with high confidence.  With a hypothetical evaluation set of 10,000 samples, even groups with very small differences in performance (i.e. greater than +0.50%) may be significantly different, unless the experimental model is also hurting a large number of questions that the baseline model is answering correctly.



Non-parametric bootstrap resampling can provide an easy-to-implement method of comparing whether the difference between two groups (a baseline model and an experimental model) is significantly different in natural language processing experiments.  The sample data provided help build an intuition for what p-values one might expect for specific effect sizes (here, in terms of accuracy) when using evaluation sets of specific sizes.   Experimental models rarely only help the performance of each sample on large datasets, and the proportion of questions helped vs hurt can substantially reduce the likelihood that the differences between groups are significantly different.

N = 1000 Evaluation Samples

Papers in Plain Language: What’s in an Explanation? Characterizing Knowledge and Inference Requirements for Elementary Science Exams (COLING 2016)

One of my extracurricular hobbies is science pedagogy, largely in the context of making physical sensing devices that make physical science easier to understand — like my open source science tricorder project, or the open source computed tomography scanner.  I’ve been very much interested in starting a series of posts on explaining my recent past (and, future) research papers in plain language, both to make them more accessible for general readers, but also to make short, relatively quick, and equally accessible reads for my colleagues who (if they’re anything like me) have a long list of papers they’d like to read, and many demands on their precious research time.

With that, the inaugural post of Papers in Plain language —

Papers in Plain Language: What’s in an Explanation? Characterizing Knowledge and Inference Requirements for Elementary Science Exams.  (Peter Jansen, Niranjan Balasubramanian, Mihai Surdeanu, and Peter Clark). Published at COLING 2016.  The PDF is available here, and the associated data for the paper is available here.

Context: Why study science exams?

My primary research interests are in studying how we can teach computers enough language and inference to perform fairly complicated inference tasks (like solving science questions), and to do this in ways that are explainable — that is, that generate human-readable explanations for why the answers are correct.  Unlike most AI researchers who are primarily interested in figuring out ways to get computers to perform adult-level tasks, I’m biased to believe that a better way of achieving cognitive-like abilities in artificial intelligence is to model how humans learn from the earliest ages.   That’s why I went to graduate school to learn how to computationally model language development and knowledge representation developmentally, which is a cognitive term meaning as children learn to do.

Elementary science exams (to me) have this great capacity to study how we can teach computers to perform complex inference, because both the language understanding and complex inference tasks tend to be much simpler than what we see in adult-level tasks.  That means we’re better able to distill the essence of the inference task, like peeling back the layers of an onion to get closer to looking at how the problem works, how well we’re doing currently, and what we don’t yet understand that would help us do better.  That’s essentially what this paper is about — a detailed analysis of the knowledge and inference requirements to successfully answer (and explain the reasoning behind solving) hundreds of elementary science questions, using both existing methodologies (the “top-down” methodologies we’ll examine in a minute), and a new methodology — looking at the same questions “bottom-up” by building a large number of real and detailed explanations, and examining them to learn more about the nuts and bolts of these inference problems.  We show that these old and new methods produce radically different results, and that you should definitely use the explanation-centered method if you’re able to devote the time to the analysis.

Top-Down Analysis

Before we (humans) can solve a question, we have to understand what the question is asking.  Most of the time with automated methods of question answering, we do almost the opposite — we apply the same method of solving to all questions — because the science of understanding what a question is asking is still being developed.  One of my favorite papers (A Study of the Knowledge Base Requirements for Passing An Elementary Science Test, Clark et. al, AKBC 2013) works on this problem by analyzing about 50 elementary science questions from the New York Regents 4th grade science exam, and uncovering 7 broad classes that those questions could be, as well as what proportion of questions belong to each class.  Their figure with these classes and proportions is shown below:

Here we the 7 broad categories of questions that Clark et al. discovered — is-a questions, definition questions, questions that ask about properties of objects, questions that test examples of situations, questions that address causality or a knowledge of processes, and the last (and, arguably most complex) — domain specific model questions.  But what do these mean?  Clark et al.’s AKBC2013 paper is full of examples, easily digestible, and highly worth a read — here are a few of those examples.

  • Taxonomic (is-a) questions test a knowledge that X is a kind of Y.  For example,  Q: Sleet, rain, snow, and hail are forms of: (A) erosion (B) evaporation (C) groundwater (D) precipitation
  • Definition questions test a knowledge of the definition of concepts. For example,  Q: The movement of soil by wind or water is called (A) condensation (B) evaporation (C) erosion (D) friction
  • Property questions test knowledge that might be found in property databases.  For example, part-of knowledge is tested by the following question,  Q: Which part of a plant produces the seeds? (A) flower (B) leaves (C) stem (D) roots

The above question types can be thought of as “retrieval types”, in that they could be successfully answered by looking up the knowledge in a ready-made knowledge base.  A more interesting (to me) subset of questions are those that appear to require forms of general or model-based inference, as in the examples below:

  • Examples of situations, such as  Q: Which example describes an organism taking in nutrients? (A) A dog burying a bone (B) A girl eating an apple (C) An insect crawling on a leaf (D) A boy planting tomatoes in the garden
  • Causality, as in  Q: What is one way to change water from a liquid to a solid? (A) decrease the temperature (B) increase the temperature (C) decrease the mass (D) increase the mass
  • Simple Processes, which often appear related to causality questions, such as  Q: One way animals usually respond to a sudden drop in temperature is by (A) sweating (B) shivering (C) blinking (D) salivating
  • Domain Specific Models, that require reasoning over domain-specific representations and machinery to solve.  For example,  Q: When a baby shakes a rattle, it makes a noise. Which form of energy was changed to sound energy? (A) electrical (B) light (C) mechanical (D) heat

A Larger Top-Down Analysis

This analysis is really fascinating, because it gives a set of specific question types in the dataset, each of which requiring specific kinds of knowledge, and specific solving methods.  In a way it helps serve as the beginnings of a recipe book for the problem of building question answering systems for this dataset — before reading the paper you’re likely in the paradigm of applying the same model to each question, but after reading Clark et al’s analysis you can begin to plot out how you might make specific solving machinery for each of these 7 questions, and how you need to get to work collecting or building specific knowledge resources (like a large taxonomy, or a part-of database, or a database that represents causal relationships) before you can make headway on certain classes of problems.

I wanted to do exactly that — and also, to automatically detect which question type a given question was, so that I could go about building a system to intelligently solve them.  So I set out to perform a larger analysis on more questions, and make a much larger set of training data for this question classification task.  That’s when things started to get especially interesting. 

Above is the beginnings of that analysis — essentially repeating the Clark et al. (AKBC 2013) analysis, but on all 432 training questions from the AI2 Elementary questions set (a smaller subset of what is now known as ARC, the Aristo Reasoning Challenge), including 3rd to 5th grade questions drawn from standardized exams in 14 US states.  By and large the proportions here are very similar to the original analysis (except that there appear to be more domain-specific model questions in the larger set).  The really interesting part was the labeling process itself.

The trouble with top-down categories: They’re perfectly obvious, except when they’re not. 

The New York Regents standardized exam is known for being a very well constructed exam, and each of the problems appear to be carefully designed to test very specific student knowledge and inference capacities.  When looking at real questions from different exams, things tend to get a little murkier.  For example, consider the following question:

Q: Many bacteria are decomposer organisms.  Which of the following statements best describes how these bacteria help make soil more fertile?

  • (A) The bacteria break down water into food.
  • (B) The bacteria change sunlight into minerals.
  • (C) The bacteria combine with sand to form rocks.
  • (D) The bacteria break down dead plant and animal matter (correct answer).

Now ask yourself: Which of the 7 knowledge types does this question fit into?

Thankfully, the answer is perfectly obvious.  Unfortunately, it’s obviously one category to one person, and obviously a different category to the next person.

The question might be labeled causal, because it asks “how … bacteria help make [cause] soil [to become] more fertile?“.  But similarly, it might be thought of as a process question, because decomposers are a stage in the life cycle process, which is the curriculum topic this question would be found under.  Except that decomposers are part of an ecosystem model of recycling nutrients back into the soil.  But the question might more simply be solved by just looking up the definition of the word decomposer, which dictionary.com describes as “an organism, usually a bacterium or fungus, that breaks down the cells of dead plants and animals into simpler substances”.  The high number of overlapping words between this definition and the question and correct answer would make it particularly amenable to simple word matching methods.

The problem, unfortunately, is that all of these labels are essentially equally correct.  They’re each different ways of solving the problem, using different solving algorithms.

Changing the problem: Building explanations that answer questions and explain the reasoning behind the answer.

When reaching a stumbling block like this (and, the depressingly low performance on a question classification system that I achieved using these labels that wasn’t reported in the paper), it’s sometimes helpful to take a step back and see if the analysis can be reframed to be more mechanical and less open to interpretation.  Ultimately we’re not studying science exams so that we can make the best multiple choice science exam solver on Earth — we’re studying them because we’re interested in taking apart complex inference problems to understand how they work, how we can build question answering systems that automatically build explanations for their answers, and what kinds of knowledge and inference capacities we would need to make this happen.  So we decided to flip the analysis upside down — instead of looking at a question and trying to figure out which of the 7 AKBC2013 question types it might be, we would instead build very detailed explanations for each question, and perform our analyses on those explanations instead.

Discretizing explanations so they can be taken apart and analyzed

There are many challenges with building a corpus of explanations in this context.  One of the central issues is that we’d like to analyze these explanations for their knowledge and inference requirements, so they need to be amenable to automated analysis in some way.  In order to make this happen, we “discretized” explanations into sets of (roughly) simple (atomic) facts about the world, and some collection of these facts together would then form the explanation for why the answer to a given question is correct.  We further imposed more constraints, to make an automated analysis of the knowledge requirements possible:

  • Simple sentences: Each “fact” in an explanation was expressed as a short, simple sentence in elementary grade-appropriate language.  We tried to simplify sentences from existing resources for solving the exam (such as study guides, or simple wikipedia), but much of the time these weren’t available and we had to manually author simple sentences to suit.
  • Reuse: So that we could keep track of how often the same knowledge was used across different questions, we added the requirement that if the same knowledge (fact) was used in the explanations to multiple questions, it had to be written in the exact same way, so a simple string matching algorithm could pick up each of the questions a given fact was used in.  This made the annotation much more challenging to author, but is critical for the explanation-centered analysis.  (In later papers, such as the WorldTree Explanation Corpus (LREC 2018), we developed tools to make this go much more quickly).
  • Explicit linking between sentences in an explanation: To investigate how knowledge connects in an explanation, we also added the requirement that each fact in an explanation must explicitly “connect” to other facts in the explanation, or to the question or answer text.  We did this primarily because we’re interested in studying how to combine multiple facts to perform inference, a problem sometimes called information aggregation or multi-hop inference, and a major barrier to solving inference problems.  Here, in order to add this explicit linking, each sentence/fact in an explanation must share at least one word in common with the question, answer, or another sentence in the explanation.  This allows the explanations to act as “explanation graphs” connected on shared words, which is the foundation upon which the more recent (and much larger) WorldTree Explanation Graph Corpus for Multi-Hop Inference (LREC 2018) was built.

Some examples of these simple explanations are shown below:

Here we can see that each simple sentence in an explanation appears to embody one kind of knowledge.  I find it easier to think about these explanations visually (as explanation graphs), with the knowledge types labeled, and overlap between explanation sentences explicitly labeled.  Here’s such an example, from a slide used in the talk:

We performed a detailed examination of the explanations for 212 of these questions (approximately 1000 explanation sentences) by annotating the different kinds of relations we observed, and these are summarized in the table below:


Fine-Grained Explanatory Knowledge and Inference Methods

The table above provides a fine-grained list of knowledge and inference requirements to build detailed explanations for science exam questions, and is (to me) the most interesting part of the paper.  Just like the Clark et al. AKBC2013 paper, I read this table like a recipe book — if I want to be able to make a question answering system capable of answering and explaining the answers to elementary science questions, three quarters of which require some form of complex inference, these are the kinds of knowledge I have to have in my system, as well as a hint at the inference capacities required for combining that knowledge together.

There are a few parts of the table that are striking to me, that I’ll highlight:

  • Proportions are wildly different when you look at an explanation-centered analysis versus a top-down analysis:  The first relation in this table is taxonomic (kind-of) knowledge, which is found in 83% of explanations.  But when we performed the analysis top-down (the pie-chart above), we found only about 2% of questions were testing this same taxonomic knowledge.  That’s because the top-down analysis obscures many of the details of knowledge and inference requirements — it’s relatively easy to conjecture about how a question might be solved (the top-down method), but it’s also very misleading.  Requiring an annotator to specify all of the knowledge required to answer a question and explain it’s answer forces a rather detailed exposition of knowledge, and that’s where the most informative content appears to be.  Put another way: Using the top-down method, one might believe taxonomic knowledge unimportant, because it’s only central to answering 2% of questions.  In reality taxonomic knowledge is the most prevalent form of knowledge used on this explanation-centered inference task, and likely absolutely critical to having a complete and functioning inference system.
  • 21 fine-grained types: While many of the 7 AKBC types are easily visible, performing the analysis in this way, we’re able to identify much more fine-grained knowledge and inference types, as well as types (like coupled relationships, requirements, and transfers) that remained hidden in the earlier analysis.
  • N-ary relations: Relations are most often extracted from text and used in question answering as triples — sets of 2-argument (X – relation – Y) tuples, as in X-is a kind of-Y (such as that a cat is a kind of mammal).  What we observe here is that many relation types naturally have more arguments, as in the 5-argument “change” relation “melting (arg: who) changes a substance (arg: what) from a solid (arg: from) to a liquid (arg: to) by adding heat energy (arg: method)”, a sentence broadly applicable in questions about changes of states of matter.


This analysis is interesting, but can we use it to show some question answering models are solving more complex questions than others?

One of the natural questions one has when spending many months working on developing a new question answering model is:

“Is my ‘inference’ model actually answering more of the complex questions correctly, or is it simply doing better than the last model by answering more of the simpler questions correctly?”

Unfortunately, it’s traditionally been very challenging to answer this question — especially quickly, in an automated fashion.  Here we can use the two types of annotation we’ve generated to compare a “simple” question answering system (a model that answers questions by looking at term frequency — a tf.idf model), and a particular inference solver — the TableILP solver by Khashabi et al. (2016).  We can look question answering accuracy broken down using two methods — one top-down, using the 7 AKBC2013 question types, and the other bottom-up, using the 21 fine-grained knowledge types from the detailed explanations to questions.

QA Performance: Top-down

Here, the L2R model is the simpler question answering system that tries to answer questions by looking up a pre-made answer in a database.  The ILP model is the TableILP inference model by Khashabi et al. (2016).  The simpler model answers about 43% of questions correctly, where the ILP inference model answers about 54% of questions correctly (a gain of +11% accuracy over the simpler model, using the same knowledge).  When looking at the top-down analysis (middle columns), we see that this performance gain isn’t simply from the ILP inference model answering more of the simpler (“retrieval”) questions correctly — it’s making substantial gains (up to +22%) on 3 out of the 4 complex inference type questions as well.  This gives us a method of validating the inference method is doing what it’s claiming to do — answering more of the harder, inference-type questions.

QA Performance: Bottom-up

Except that I just spent quite a bit of time convincing you that the top-down analysis is much less informative than the bottom-up analysis, so let’s have a look. There’s a lot going on in this table, so let’s spend a moment to orient.  The rows represent specific knowledge types identified in the fine-grained analysis.  The columns represent specific models (L2R, the simpler fact-retrieval model, and ILP, the inference model) paired with specific knowledge resources (a “corpus” of study guides and simple wikipedia, or the “tablestore”, a collection of science-relevant facts).  (Note that Stitch is another inference algorithm, but for simplicity I’ll focus on ILP — please see the paper for more details).  The numbers in the cells of this table represent the accuracy of a given question answering system on questions whose explanation requires a given knowledge type.  For example, in the first row, we see that for questions that require Taxonomic knowledge, the L2R model using the Tablestore knowledge resource answers 46% of these questions correctly.  The TableILP model answers 56% of these same questions correctly, meaning the inference model shows moderate gains for these taxonomic questions.   The easiest way to read this table is to look at the “Inference Advantage” column — if it’s pointing towards ILP, it means the inference model helped more on questions requiring a given knowledge type.

The take-away summary points from this table are:

  • The inference model provides a substantial performance boost to the questions requiring inference knowledge.  The highest gains are found in the “Inference Supporting” knowledge types, but “Complex Inference” types also show substantial gains.
  • While relative gains are high, absolute performance is still low in many areas.  The inference model helps almost all questions, but some questions requiring challenging kinds of inference still have a very low performance.  For example, for questions requiring coupled relationships, the simpler L2R model answers only 28% of these correctly — slightly higher than chance (25% on a 4-choice multiple choice exam).  The ILP inference model increases performance on these questions to 44%, which is much higher, but still one of the lowest performances.  In contrast, questions requiring some types of knowledge (examples, definitions, durations) achieve between 63-70% accuracy, highlighting the relative difficulty of these questions, and providing a solid area to target in future work to boost performance.


The Overall Take-away: Inference is challenging, but we can instrument it using detailed, explanation-centered analyses

Answering (and explaining the answers to) elementary science questions is easy for most 9 year olds, but it is still largely beyond the capacity for current automated methods.  Here we show that top-down methods for analyzing the knowledge and inference requirements have many challenges, limitations, and inaccuracies, and that bottom-up explanation-centered methods can provide more detailed, fine-grained analyses.  This data can also be used to instrument question answering models to determine the kinds of knowledge and inference that a model performs well at, as well as identify particularly challenging knowledge and inference requirements that can be targeted to increase overall question answering performance.

Context in terms of contributions to subsequent work

This paper was critical to identifying the central kinds of knowledge and inference in standardized science exams, and has been extensively used in our subsequent work — most notably on the WorldTree Explanation Graph Corpus (LREC 2018), whose knowledge base contains many of the same types as this COLING2016 corpus, but extended to approximately 60 very fine-grained types.  The paper also piloted the explanation construction analysis methodology, which has been used and refined in subsequent work.  In the context of multi-hop inference, this paper also first quantified the average number of facts that need to be combined to build an explanation to an elementary science question (4 facts/question), where subsequent analyses on the larger WorldTree explanation corpus refined this to an average of 6 facts per explanation, when building explanations targeted at the explanatory detail required to be meaningful to a young child.

N = 1000 Evaluation Samples

Postdoctoral Position Available

I have a position open for a postdoctoral scholar in my lab, primarily centered around a project in explanation-centered inference (more details below). Folks with interdisciplinary backgrounds (for example, but not limited to: cognitive science) are encouraged to apply — the most important qualifications are that you’re comfortable writing software, that you’re fascinated by the research problem, and that you feel you have tools in your toolbox (that you’ll enjoy expanding after joining the lab) to make significant progress on the task.

The start date is flexible, and we’ll review applications as they come in until the position is filled. If you have any questions, please feel free to get in touch: pajansen@email.arizona.edu

April 2019 Note: We are still actively seeking applicants for this position. The start date is flexible. Interdisciplinary applicants comfortable writing software are encouraged to apply.

Postdoctoral Research Associate I

Position Summary
The Cognitive Artificial Intelligence Laboratory ( http://www.cognitiveai.org ) in the School of Information at the University of Arizona invites applications for a Postdoctoral Research Associate for projects specializing in natural language processing and explanation-centered inference.

Natural language processing systems are steadily increasing performance on inference tasks like question answering, but few systems are able to provide explanations describing why their answers are correct. These explanations are critical in domains like science or medicine, where user trust is paramount and the cost of making errors is high. Our work has shown that one of the main barriers to increasing inference and explanation capability is the ability to combine information – for example, elementary science questions generally require combining between 6 and 12 different facts to answer and explain, but state-of-the-art systems generally struggle integrating more than two facts together. The successful candidate will combine novel methods in data collection, annotation, representation, and algorithmic development to exceed this limitation in combining information, and apply these methods to answering and explaining science questions.

A talk on our recent work in this area is available here: https://www.youtube.com/watch?v=EneqL2sr6cQ

Minimum Qualifications
– A Ph.D. in Computer Science, Information Science, Computational Linguistics, or a related field.
– Demonstrated interest in natural language processing, machine learning, or related techniques.
– Excellent verbal and written communication skills

Duties and Responsibilities
– Engage in innovative natural language processing research
– Write and publish scientific articles describing methods and findings in high-quality venues (e.g. ACL, EMNLP, NAACL, etc.)
– Assist in mentoring graduate and undergraduate students, and the management of ongoing projects
– Support writing grant proposals for external funding opportunities
– Serve as a collaborative member of a team of interdisciplinary researchers

Preferred Qualifications (One or more of the following, note not required for application)
– Knowledge of computational approaches to semantic knowledge representation, graph-based inference, and/or rule-based systems
– Experience applying machine learning methods to question answering tasks
– Knowledge of or interest in graphical visualization and/or user interface design
– Strong scholarly writing skills and publication record

Full Posting/To Apply

Contact Information for Candidates Questions
Peter Jansen ( pajansen@email.arizona.edu )

Other Information
Tucson has been rated “the most affordable large city in the U.S.” and was the first city in the US to be designated as a World City of Gastronomy by the United Nations Educational, Scientific, and Cultural Organization (UNESCO). With easy access to both a vibrant arts and culture scene and outdoor activities ranging from hiking to rock climbing to bird watching, Tucson offers a bit of something for everyone.

The University of Arizona is committed to meeting the needs of its multi-varied communities by recruiting diverse faculty, staff, and students. The University of Arizona is an EEO/AA-M/W/D/V Employer. As an equal opportunity and affirmative action employer, the University of Arizona recognizes the power of a diverse community and encourages applications from individuals with varied experiences, perspectives, and backgrounds.

Outstanding UA benefits include health, dental, vision, and life insurance; paid vacation, sick leave, and holidays; UA/ASU/NAU tuition reduction for the employee and qualified family members; access to UA recreation and cultural activities; and more!

The University of Arizona has been listed by Forbes as one of America’s Best Employers in the United States and WorldatWork and the Arizona Department of Health Services have recognized us for our innovative work-life programs.

N = 1000 Evaluation Samples

AI2 Talk: What’s in an Explanation? Toward Explanation-centered Inference for Science Exams

I recently gave a talk at the Allen Institute for Artificial Intelligence (AI2) on my work in explanation-centered inference for solving standardized science exams. This talk is a good high-level introduction to three recent papers on understanding the kinds of knowledge and inference required to build explanations (COLING 2016), our work in building a very large corpus of semi-structured explanations (the largest I’m aware of) to help us learn how to combine large amounts of information for inference (LREC 2018), and examining this corpus for common explanatory patterns that would help us make the task of building new explanations easier (AKBC 2017).

We’re very excited about this new knowledge resource that’s been two years in the making, and it’s potential for exploring explanation-centered inference. The Worldtree corpus is available here, at the Explanation Bank!