In a recent batch of papers I reviewed for NAACL, I was surprised at the large number of papers that I reviewed that claimed to be producing state-of-the-art models or introducing novel mechanisms to existing models, but failed to include any inferential statistics supporting those claims. The story of these papers typically takes one of the following forms:

**The State-Of-The-Art:**The paper claims to achieve state-of-the-art performance on a competitive task (i.e. achieving the best performance on a task leaderboard), but the difference between the current-top model and the proposed model was small (typically under 0.5%).**The New Mechanism:**The paper spends several pages developing a compelling narrative for how a complex feature or mechanism will substantially increase performance compared with existing methods, but shows small gains compared to existing methods, and/or does not provide a statistical comparison with those methods.**The Unverified Ablation:**A new feature is proposed, but the ablation study suggests the overall contribution to model performance is small — so small that a model with the new feature may not be significantly different than the full model.

Being an interdisciplinary researcher I’ve transitioned fields several times. My graduate work was done in computational cognitive modeling in a psychology and neuroscience department, where most of the graduate courses one takes are statistics and research methods to learn how to run experiments in progressively more challenging experimental scenarios where the noise may be high, and it’s unclear if you’re measuring what you think you’re measuring. As a field, natural language processing regularly publishes work in top venues without including inferential statistics, and this can place strong limits on what can be inferred from the claims of a research article, which only hurts the field as a whole, and adds noise to the science of understanding the phenomena that we invest a great deal of resources into studying.

This recent reviewing experience suggested to me that it might be helpful to write a short primer on a common and easy-to-implement statistical method (non-parametric bootstrap resampling), in the hopes that others find it useful. The disclaimers are that I’m a natural language processing researcher, not a statistician, and consider myself an end-user rather than a researcher of statistical methods. It’s always critical to understand your data and measurement scenarios, and how those interact with the assumptions of the statistics that you choose.

**The Experiment Setup**

The technique discussed here compares the performance of two systems — the *baseline* system, and the *experimental* system. In NLP, the common scenarios this tends to apply to are:

**The Better Model:**You have created a new model that achieves higher accuracy on some task than the previous state-of-the-art system. Here, the previous state-of-the-art is the baseline system, and your new model is the experimental system.**The Better Feature:**You have created a new feature (or, group of features) and added this to an existing model. Here, the model without the feature (or group of features) is the baseline system, and the model with the features is the experimental system.**The Ablation Study:**You have a model that contains many different components, and are running an ablation study to determine the relative contribution of each component to the overall performance of the system. Ablation studies are commonly reported as comparing the performance of removing a single component of the system, versus the full system. Here, the baseline is the system with the component removed (which should be exhibiting lower performance than the full system). The experimental system is the full system.**The Simpler Model Story:**You have a model that achieves slightly lower than state-of-the-art performance, but using a model that’s simpler, more interpretable, faster to compute, or some other similar story. The performance of your new, simpler, faster, more interpretable system is so close to the state-of-the-art that you believe the difference between the two models is likely not significantly different. Here, the system with lower performance is the baseline, and the higher performing system is the experimental system. Unlike the other cases, here you are looking for the p-value to be large — traditionally greater than 0.05 (and ideally much larger, i.e. not significantly different).

**What you need: Scores from each model**

To perform a statistical comparison, you will need two sets of scores — one from the baseline system, and one from the experimental system. The scores should be ordered — that is, kept in two parallel arrays, one containing the scores for the baseline system, one for the experimental system, where the index *n* for each array represents the baseline or experimental performance for the same piece of input data.

To make this a bit more concrete, in my field of question answering model performance is typically measured in terms of accuracy (the proportion of questions answered correctly by a model). If I was evaluating on a hypothetical set of 10 questions, the baseline score and experimental score arrays would each have 10 elements. BaselineScores(4) and ExperimentScores(4) would represent the scores of both models on the same question. An example of this is listed in the table below, where a score of “1” represents a given method answered the question correctly, and a score of “0” means the question was answered incorrectly. The average performance of the baseline model is 50% accuracy, and the experiment model is 60% accuracy.

Question # | Question Text | Baseline Score | Experimental Score | Difference | Desc. |
---|---|---|---|---|---|

0 | Which of… | 0 | 1 | +1 | Exp. helps |

1 | In what year… | 1 | 1 | 0 | Both correct |

2 | The largest… | 1 | 0 | -1 | Exp. hurts |

3 | Which city… | 0 | 1 | +1 | Exp. helps |

4 | In what country… | 0 | 1 | +1 | Exp. helps |

5 | When the… | 1 | 0 | -1 | Exp. hurts |

6 | How might… | 0 | 1 | +1 | Exp. helps |

7 | What is one… | 1 | 1 | 0 | Both correct |

8 | Which person… | 0 | 0 | 0 | Both incorrect |

9 | The amount of… | 1 | 0 | -1 | Exp. hurts |

Average Accuracy |
50% |
60% |

The intuition here is that the performance difference is large — 50% vs 60%, or a +10% gain, so whatever clever experimental system we developed is *clearly superior* to the baseline method. Unfortunately this intuition is incorrect, and there are many factors that affect whether an experimental model is statistically significant over a baseline model, including:

- the sample size (here 10 samples, from 10 questions)
- the total number of questions the experimental model benefits or helps over the baseline model (here 4)
- the total number of questions that the experimental model hurts compared to the baseline model (here 3).

The overall difference between the number of questions helped and hurt (here, 4 – 3 or +1) is the *difference score*, expressed in number of samples. Here in this question answering example with 10 samples, we can convert this to the difference in accuracy between baseline and experimental models by dividing the difference by the number of samples, +1/10, to find the +10% performance benefit over baseline.

To determine whether the baseline and experimental models are significantly different, we can use a variety of statistical tests. Here we’ll look at a popular statistic, non-parametric bootstrap resampling.

**Non-parametric Bootstrap Resampling**

The procedure for non-parametric bootstrap resampling is as follows:

For some large number *N*, which represents the number of bootstrap resamples being taken:

- Randomly draw K samples from the difference scores, with replacement (i.e. the same sample can be included in the distribution more than once). K should be equal to the original number of samples (i.e. in the example above with 10 questions, K should be equal to 10)
- Sum these randomly sampled difference scores, to determine whether (on the balance) this resampled distribution shows the experimental model as helping (sum > 0) or not helping (sum <= 0). Record this outcome (helped or hurt), regardless of the magnitude.
- Repeat this procedure N times, recording the proportion of runs that show the experimental model not helping versus the total number of runs. This proportion is the p-value.

Given that this resampling procedure can happen very quickly, and an accurate estimate of *p* is important, a typical value of N might be 10,000 samples. Unless the number of samples is large, this can usually be computed in a few seconds. To give a concrete example, if (with a given set of data) the experimental model helps in 9,900 resamples out of 10,000 resamples total, then the p-value would be (10,000 – 9,900) / 10,000 or p = 0.01 .

**Bootstrap Resampling Code (in Scala)**

The following Scala code implements the bootstrap resampling procedure, and includes two examples:

**Example 1:**The 10-sample question answering data example from above, with the scores manually entered.**Example 2:**A function that generates artificial data to play with, and gain intuition about how differently sized data with different numbers of questions helped and hurt by the experimental model generate different p-values.

The normal use case is that you typically will run your baseline and experimental models, and save the scores for each sample (e.g. question) from each model to a separate text file in the same order (e.g. baseline.txt, experimental.txt). You can then load these scores into separate arrays, and use the code provided below.

import collection.mutable.ArrayBuffer /** * Quick tool to compute statistical significance using bootstrap resampling * User: peter * Date: 9/18/13 */ object BootstrapResampling { def computeBootstrapResampling(baseline:Array[Double], experimental:Array[Double], numSamples:Int = 10000): Double = { val rand = new java.util.Random() // Step 1: Check input sizes of baseline and experimental arrays are the same if (baseline.size != experimental.size) throw new RuntimeException("BootstrapResampling.computeBootstrapResampling(): ERROR: scoresBefore and scoresAfter have different lengths") val numDataPoints = baseline.size // Step 2: compute difference scores val deltas = new ArrayBuffer[Double] for (i <- 0 until baseline.size) { val delta = experimental(i) - baseline(i) deltas.append(delta) } // Step 3: Resample 'numSample' times, computing the mean each time. Store the results. val means = new ArrayBuffer[Double] for (i <- 0 until numSamples) { var mean: Double = 0.0 for (j <- 0 until numDataPoints) { val randIdx = rand.nextInt(numDataPoints) mean = mean + deltas(randIdx) } mean = mean / numDataPoints means.append(mean) } // Step 4: Compute proportion of means at or below 0 (the null hypothesis) var proportionBelowZero: Double = 0.0 for (i <- 0 until numSamples) { println ("bootstrap: mean: " + means(i)) if (means(i) <= 0) proportionBelowZero += 1 } proportionBelowZero = proportionBelowZero / numSamples // debug println("Proportion below zero: " + proportionBelowZero) // Return the p value proportionBelowZero } // Create artificial baseline and experimental arrays that contain a certain number of samples (numSamples), // with some number that are helped by the experimental model (numHelped), and some that are hurt (numHurt). def makeArtificialData(numSamples:Int, numHelped:Int, numHurt:Int):(Array[Double], Array[Double]) = { val baseline = Array.fill[Double](numSamples)(0.0) val experimental = Array.fill[Double](numSamples)(0.0) // Add helped for (i <- 0 until numHelped) { experimental(i) = 1 // Answered correct by the experimental model (but not the baseline model) } // Add hurt for (i <- numHelped until (numHelped + numHurt)) { baseline(i) = 1 // Answered correctly by the baseline model (but not the experimental model) } // Return (baseline, experimental) } def main(args:Array[String]): Unit = { // Example 1: Manually entering data val baseline = Array[Double](0, 1, 1, 0, 0, 1, 0, 1, 0, 1) val experimental = Array[Double](1, 1, 0, 1, 1, 0, 1, 1, 0, 0) val p = computeBootstrapResampling(baseline, experimental) println ("p-value: " + p) // Example 2: Generating artificial data val (baseline1, experimental1) = makeArtificialData(numSamples = 500, numHelped = 10, numHurt = 7) val p1 = computeBootstrapResampling(baseline1, experimental1) println ("p-value: " + p1) } }

And the example output:

Example 1:…

bootstrap: mean: -0.2

bootstrap: mean: -0.6

bootstrap: mean: -0.1

bootstrap: mean: -0.1

bootstrap: mean: -0.3

bootstrap: mean: -0.3

bootstrap: mean: -0.2

bootstrap: mean: 0.4

bootstrap: mean: 0.1

bootstrap: mean: 0.3

bootstrap: mean: 0.1

bootstrap: mean: -0.4

Proportion below zero: 0.4316

p-value: 0.4316

The code includes debug output that displays both the mean of the difference scores for each boostrap resample iteration (allowing you to visually see the proportion above or below zero), as well as a summary display showing the overall proportion of bootstrap resampling iterations below zero.

This shows that in spite of there being a +10% performance difference between the two question answering models in the toy example, this difference has a 43% chance of being due to random chance rather than a real difference between these groups. This is likely owing both to the small number of samples, as well as the relative proportion of questions helped vs hurt (many questions are aided by the experimental model, but a nearly equal number are hurt by it). Here, we would say that the performance of the two models are not significantly different (p < 0.05).

**Additional Examples of the Behavior of Non-parametric Bootstrap Resampling**

Using the code below, we can generate artificial data that investigates the expected p-values for specific effect sizes paired with datasets of different sizes. We can also investigate the effect of having a larger number of samples helped/hurt (i.e. changed) compared to the baseline, while controlling for the same effect size, to help gain an intuition for how this statistic behaves under specific scenarios. Here is the code example, added to main(), for a given effect size and sample size:

// Example 3: Generating artificial data val effectSize:Double = 2.0 val sampleSize:Double = 500 val pValues = new ArrayBuffer[(Double, Double)] for (i <- 0 until 20) { val numHelped = (((i.toDouble/100.0) * sampleSize) + ((effectSize/100.0) * sampleSize)).toInt val numHurt = ((i.toDouble/100.0) * sampleSize).toInt val (baseline2, experimental2) = makeArtificialData(numSamples = sampleSize.toInt, numHelped, numHurt) val p2 = computeBootstrapResampling(baseline2, experimental2) println("p-value: " + p2) pValues.append( (i, p2) ) } println ("") println ("Effect Size: " + effectSize) println ("Sample Size: " + sampleSize) println ("pValues:") for (pair <- pValues) { println (pair._1 + "\t" + pair._2.formatted("%3.5f")) }

**Example: The Small Evaluation Dataset (100 Samples)**

Having an evaluation set (i.e. a development or test set) that contains only 100 items is not uncommon in a number of situations:

**New Task:**You have started working on a new task, and are doing exploratory analyses with little data to work with. (I remember when we started investigating question answering on standardized science exams and had collected few exams, our evaluation set was only 68 questions. Now it’s approximately 3,000).**Low Resource Domain:**You are operating in a domain with few readily-available resources, and you are not able to easily collect more.**Existing Dataset:**You are comparing against an existing dataset that has small development and/or test folds. For example, the TREC QA sets can have as few as 65 questions for evaluating, meaning answering a single additional question correct increases QA accuracy by 1.5% (!).**Expensive Collection:**You are collecting data, but have a task where data is expensive to collect. For example, you need to collect or transcribe sensitive data in the medical domain.**Expensive Annotation:**Your task requires annotating data, and you this annotation is particularly difficult or time-consuming, limiting the amount of annotation you are able to generate for training or evaluation.

Here with 100 samples, an effect size of 1% means that (for example) an experimental question answering system would answer only 1 more question correctly than the baseline system. As the plot below shows, with such small evaluation sets it’s very hard to detect that an experimental system is significantly different from a baseline system without very large effect sizes, and where few samples are hurt by the experimental system. Here, even for an effect size of 2% (e.g. a baseline accuracy of 70%, and an experimental accuracy of 72%), and a best-case scenario where the experimental system only helps (and not hurts) any of the samples in the dataset, the expected p-value is still approximately 0.13. Even with an effect size of 5%, if the proportion of samples hurt by the experimental system exceeds 2% (i.e. the experimental system answers 7 more questions correctly over the baseline system, but also answers 2 questions incorrectly that the baseline answered correctly), then the p-value begins to exceed the generally accepted threshold of p = 0.05.

**Cautionary Tale: ** With small datasets where it can be challenging to achieve statistical significance, after a promising experiment that doesn’t quite meet the threshold it’s often tempting to become overconfident in p-values “trending towards significance” (<0.10 to 0.20) but that don’t quite meet the threshold. Many folks have experience with an experiment such as this, where a large effect that seemed promising disappears after more investigation. It’s important to stress that in the business of doing science, it’s critical to find this out sooner rather than later, so that you can make your mistakes cheaply, and not invest time and resources in efforts that ultimately are unlikely to pan out. For example, in our early days of science exam question answering (circa 2013) when we had collected only 68 questions for our test set, a question answering model that included word embeddings (a now popular technique, but up-and-coming in 2013) showed a large +5% gain in accuracy over a model without embeddings. This was extremely promising, and I personally spent a great deal of time pursuing this only to find after great effort that the effect ultimately disappeared with more data, and that word embedding models were notoriously challenging to train for domains with little data (like elementary science). It ended up being years later before methods of training word embeddings for this domain were demonstrated, but not before I spent quite a bit of time on a promising-seeming effect that ultimately was due to small sample sizes rather than a genuine benefit of the model. One of my mentors in graduate school used to say that science is about making your mistakes cheaply, and he would often quote Nobel laureate Richard Feynman, who famously said (somewhat paraphrased) that “science is about not fooling yourself, when you’re the easiest one to fool”. Since my experience, I always try to err on the side of collecting more data when viable, since a week collecting more data is broadly more useful than a week wasted on an effect that isn’t there.

**Example: Evaluation Set Size of 500 samples**

Expected p-values for an evaluation set of 500 samples is shown below. Here, it’s generally possible to detect that groups are significantly different with high confidence if the difference between baseline and experimental groups is greater than +5%. Differences of +2% may also fall below the p<0.05 threshold if the experimental model doesn’t hurt too many samples — i.e. incorrectly answer questions that the baseline model answers correctly.

**Example: Evaluation Set Size of 2,000 samples**

At an evaluation set size of 2,000 samples, it’s generally possible to detect that an experimental model is significantly better than a baseline model when the effect size is approximately +2% or greater. It’s also possible to detect this difference if the effect size is on the order of 1%, as long as the experiment model doesn’t also hurt questions that the baseline model was answering correctly.

**Example: Evaluation Set Size of 10,000 samples**

Large-scale datasets are becoming increasingly common due to crowdsourcing and large-scale data collection — for example, in question answering, the MS Machine Reading Comprehension (MS MARCO) dataset now contains over 1 million (!) questions. These large datasets enable even very small differences between baseline and experimental models to be detected with high confidence. With a hypothetical evaluation set of 10,000 samples, even groups with very small differences in performance (i.e. greater than +0.50%) may be significantly different, unless the experimental model is also hurting a large number of questions that the baseline model is answering correctly.

**Summary**

Non-parametric bootstrap resampling can provide an easy-to-implement method of comparing whether the difference between two groups (a baseline model and an experimental model) is significantly different in natural language processing experiments. The sample data provided help build an intuition for what p-values one might expect for specific effect sizes (here, in terms of accuracy) when using evaluation sets of specific sizes. Experimental models rarely only help the performance of each sample on large datasets, and the proportion of questions helped vs hurt can substantially reduce the likelihood that the differences between groups are significantly different.