visually-grounded-language-without-vision-findings-emnlp2020-slides

Visually-Grounded Planning Without Vision (Talk)

Interested in NLP and vision for virtual robotic agent tasks? Here’s a talk on my “no-vision vision” Findings of EMNLP 2020 paper, “Visually-Grounded Planning Without Vision: Language Models Infer Detailed Plans from High-Level Instructions”. Recorded for the Spatial Language Understanding (SpLU 2020) workshop at EMNLP 2020.

TL;DR: Language models can successfully reconstruct 26% of long visual semantic plans in the ALFRED virtual robotic agent task using only text — i.e. without visual input. If you allow the language model to know where the virtual robotic agent should start from in the environment, this performance increases to 56%! But, for the other 44% of cases that are not correct, the robot agent sometimes microwaves forks.

Paper, code, data, output, and analyses are available at this ALFRED-GPT-2 Github Repository (cognitiveailab).



visually-grounded-language-without-vision-findings-emnlp2020-slides

Using Tensorflow Ranking Bert (TFR-Bert), an end-to-end example

Recently an interesting paper (Han et al., “Learning-to-Rank with BERT in TF-Ranking”) appeared on Arxiv that combines Tensorflow Ranking with BERT to perform ranking (or, re-ranking) on BERT-encoded queries and documents. This seems like a major step in ranking algorithms, combining the power of large language model embeddings with ranking tasks. The paper reports using the TFR-BERT model to rank answers in the MS MARCO dataset and achieve high performance.

The code was recently released, but unfortunately without end-to-end examples for getting a model trained and generating predictions on unseen data. Not being a Tensorflow or Tensorflow Ranking expert, or a frequent Python user, the process of getting it running had many non-obvious steps that took some digging to figure out. Being that I think this is the first successful demonstration of using TFR-BERT end-to-end I could find, I thought I’d generate example code and a short tutorial in the hopes that this helps other folks get running quicker.

A word of note: TFR-BERT appears to have large computational requirements, even for large language models. Estimates of this are described at the bottom of this tutorial.

Preconditions

In this example I’m using a conda environment that has Python 3.7.7, and the Tensorflow (GPU) and other supporting dependencies installed. Here is the requirements.txt for my conda environment.

Before starting, I would recommend reading the instructions in the README.md of the Tensorflow Ranking repository, and installing any additional dependencies (like bazel). The TFR-BERT extensions and README are in the tensorflow_ranking/extension path, and are recommended to read beforehand, too.

Step 1: Clone the example repository

This tutorial uses wrappers, helpers, and example data I’ve put together in an end-to-end TFR-BERT example forked repository together with the official code base, so I would recommend cloning it to get started. Once you’re an expert, you can clone the official Tensorflow Ranking repository here, which includes the TFR-BERT code.

git clone https://github.com/cognitiveailab/ranking.git

Step 2: Download BERT checkpoints in Tensorflow 2 format.

Google recently released checkpoints for smaller versions of BERT that run faster on more modest hardware. These include BERT-Tiny, BERT-Mini, BERT-Small, and BERT-Medium, to complement the existing BERT-Base and BERT-Large models. TFR-BERT is a bit of a heavy model and requires fairly serious computational resources, so like the TFR team I would recommend debugging/developing with smaller checkpoints until you need to scale.

TFR-BERT requires BERT checkpoints in the Tensorflow 2 (TF2) format, which are (as of this writing) a little challenging to find pre-generated as typically the checkpoints are released in TF1 format. Converting between TF1 and TF2 takes a bit of tinkering with the conversion script, or you’re welcome to use the TF2 checkpoints linked below that I’ve converted (though, again, I’m not an expert on Tensorflow, so if you see something amiss, please send me a note).

Step 3: Convert your ranking problems into an appropriate format

Internally, TFR-BERT loads training and evaluation files that are lists of BERT-encoded query-document pairs, that have been converted into their ELWC format, then saved as a TFRecord. It’s a little challenging to do this, so I’ve put together a quick utility and set of helper functions to convert between a simple JSON format and their format.

The input JSON format:

{"rankingProblems": [
    {"queryText": "Where can you buy cat food?",
     "documents": [
        {"relevance": 3, "docText": "The pet food store"},
        {"relevance": 1, "docText": "Bicycles have two wheels"},
        {"relevance": 3, "docText": "The grocery store"},
        {"relevance": 2, "docText": "Cats eat cat food"}
        ]
    },
    {"queryText": "Where can you go swimming?",
     "documents": [
        {"relevance": 2, "docText": "At the lake"},
        {"relevance": 3, "docText": "In a swimming pool"}, 
        {"relevance": 1, "docText": "In a cloud"},
        {"relevance": 1, "docText": "On a pile of rocks"},
        {"relevance": 1, "docText": "In a garden"}        
        ]
    },
    {"queryText": "What helps to build a campfire?",
     "documents": [
        {"relevance": 1, "docText": "Rocks"},
        {"relevance": 2, "docText": "Tinder"}, 
        {"relevance": 3, "docText": "Wood"}, 
        {"relevance": 3, "docText": "Match"},
        {"relevance": 1, "docText": "Potato"},
        {"relevance": 1, "docText": "Can of soup"},
        {"relevance": 1, "docText": "Marshmallow"},
        {"relevance": 1, "docText": "Hot dog"},
        {"relevance": 1, "docText": "Rice"},
        {"relevance": 1, "docText": "Pot and pan"}
        ]
    }
    ]
}

Here, in the JSON input format, rankingProblems is a list of query-document pairs that define each of the ranking problems in your train, development, or test set. Each ranking problem has a query string (queryText), and a list of documents. Each document is an object containing the document text (docText), and a gold relevancy score (relevance) represented as an integer. The document list is unordered, and can be stored in any order (as shown). Higher relevancy scores mean the documents are more relevant for the query. 

I’ve put toy train and evaluation examples in the repository, to help illustrate how you might convert your own data into this JSON format.

Conversion script (JSON to TFRecord):

Once you have the data in JSON format, you need to convert it into the TFRecord format used by TFR-BERT. A conversion script that runs the tool is available here:

 #!/bin/bash
BERT_DIR="/home/peter/github/tensorflow/ranking/uncased_L-12_H-768_A-12_TF2"  && \
python tensorflow_ranking/extension/examples/tfrbert_convert_json_to_elwc.py \
    --vocab_file=${BERT_DIR}/vocab.txt \
    --sequence_length=128 \
    --input_file=/home/peter/github/peter-ranking/ranking/TFRBertExample-eval.json \
    --output_file=eval.toy.elwc.tfrecord \
    --do_lower_case 

The critical bits here are ensuring that BERT_DIR points to the BERT model checkpoint you’re using, that input_file points to your input JSON, and that the output_file is the TFRecord file that you’d like generated. sequence_length should be set to the maximum sequence length your model will be trained on (commonly, 128 tokens), and –do_lower_case should be set if you’re using uncased BERT models. Successfully running this script should output something similar to:

(tfranking-bert) peter@neutronium:~/github/peter-ranking/ranking$ ./tfrbert_convert_json_to_elwc.sh  
2020-09-07 17:34:52.464134: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Utility to convert between JSON and ELWC for TFR-Bert

Model Parameters:  
Vocabulary filename: /home/peter/github/tensorflow/ranking/uncased_L-12_H-768_A-12_TF2/vocab.txt
sequence_length: 128
do_lower_case: True

Input file:  /home/peter/github/peter-ranking/ranking/TFRBertExample-eval.json
Output file: eval.toy.elwc.tfrecord
Success.

Run this to convert each file (train, development, and test) in your dataset.

Step 4: Train your TFR-BERT model

This step largely proceeds as in the official TFR-BERT documentation, using the tfrbert_example.py training example. Here’s a script to run it:

 #!/bin/bash
#BERT_DIR="/home/peter/github/tensorflow/ranking/uncased_L-4_H-256_A-4_TF2"  && \
BERT_DIR="/home/peter/github/tensorflow/ranking/uncased_L-12_H-768_A-12_TF2"  && \
OUTPUT_DIR="/tmp/tfr/model-petertoy-bertbase/" && \
DATA_DIR="/home/peter/github/peter-ranking/ranking" && \
rm -rf "${OUTPUT_DIR}" && \
bazel build -c opt \
tensorflow_ranking/extension/examples:tfrbert_example_py_binary && \
./bazel-bin/tensorflow_ranking/extension/examples/tfrbert_example_py_binary \
   --train_input_pattern=${DATA_DIR}/train.toy.elwc.tfrecord \
   --eval_input_pattern=${DATA_DIR}/eval.toy.elwc.tfrecord \
   --bert_config_file=${BERT_DIR}/bert_config.json \
   --bert_init_ckpt=${BERT_DIR}/bert_model.ckpt \
   --bert_max_seq_length=128 \
   --model_dir="${OUTPUT_DIR}" \
   --list_size=15 \
   --loss=softmax_loss \
   --train_batch_size=1 \
   --eval_batch_size=1 \
   --learning_rate=1e-5 \
   --num_train_steps=500 \
   --num_eval_steps=10 \
   --checkpoint_secs=500 \
   --num_checkpoints=2

There are a lot of knobs to turn here. For the purposes of this example, the critical bits are that the train_input_pattern and eval_input_pattern match the TFRecord files that you generated from your JSON dataset above, in Step 4. BERT_DIR should point to your BERT model (and, it’s helpful to have a few different/smaller/faster models quickly available during development — commented out, as shown). OUTPUT_DIR is where your model will be saved to — and note that the default script overwrites this directory each time it’s run. Finally, the list_size defines the maximum number of documents in each ranking problem (in the script above, this is set to 15). Increasing this increases the memory requirements of your model (see below for an estimate of memory requirements), so for even modest sized lists, you may find yourself frequently training with a batch size of 1. num_train_steps defines the number of training steps before completion, and this number is often quite high in the few examples I’ve seen (e.g. the example script for running the ANTIQUE dataset in the documentation lists 100000 training steps), so you may expect requiring some serious GPU or TPU hours during training.

After training, you should have several trained models in your ${OUTPUT_DIR}/exports/ folder — typically the most recent model, as well the best model (evaluated in terms of lowest loss). Plenty of output will stream by during training, but the end will likely look something like this:

INFO:tensorflow:SavedModel written to: /tmp/tfr/model-petertoy-testtrain/export/best_model_by_loss/temp-1599526275/saved_model.pb
I0907 17:51:17.281934 139878410606400 builder_impl.py:426] SavedModel written to: /tmp/tfr/model-petertoy-testtrain/export/best_model_by_loss/temp-1599526275/saved_model.pb

INFO:tensorflow:Loss for final step: 12.917309.
I0907 17:51:17.350415 139878410606400 estimator.py:352] Loss for final step: 12.917309.

If you see something different, such as a bunch of errors (particularly if you’re working off the official examples with the ANTIQUE dataset), then you might find a bunch of out-of-memory (OOM) errors when sifting through the output. One way of dealing with these is either reducing the model (e.g. to BERT-Mini or Tiny), reducing the list size, or (of course) finding a system with more GPU memory.

After running, the directory that the model is exported to should look something like the following, where the model files are exported into a numbered directory (the “version number”) of the model:

Step 5: Predictions: Setup a Tensorflow Serving prediction server

The training/evaluation procedure does not generate predictions, and there is no official example on how to perform this. Here we’ll setup a prediction mechanism.

Again, prefacing this by noting that I’m not a Tensorflow expert, there appear to be two methods of generating predictions — (1) directly, by loading the model and using the API to call a predict() method on the Estimator, or (2) indirectly, by using the Tensorflow Serving model server to load your model, then sending queries (and receiving prediction scores) over a socket. The latter seems much more common and supported, so that’s the approach described here.

There are a lot of tutorials for setting up a Tensorflow Serving model server, and they vary depending on your serving preference (CPU vs GPU) and whether you prefer the model server to be in a docker container. I preferred to get up and running quickly, and found this tutorial on installing a model server using apt-get on Ubuntu to be the simplest.

Once you setup Tensorflow Serving, assuming you chose the same method as I did (apt-get, no container), the model server can be started with a script such as this one:

 #!/bin/bash
export MODEL_DIR=/tmp/tfr/model-petertoy-bertbase/export/latest_model/
tensorflow_model_server \
  --rest_api_port=8501 \
  --model_name=tfrbert \
  --model_base_path="${MODEL_DIR}"

A somewhat counter-intuitive step here is that MODEL_DIR should not point directly to the model files (i.e. the version-numbered folder), but rather to the parent folder that contains one or more version number folders that contain the actual model(s). If you run the training step above, an example of this directory structure can be found in ${OUTPUT_DIR}/export/latest_model/ , which is pointed to in the example script.

Step 6: Predictions: Finally generating predictions

With the model server up, we can now connect to it using the client-side prediction example, and generate predictions for our ranking problems. This code takes ranking problems in the JSON format as input, serves each one individually to the model server, and exports a ranked list with document scores added. Thisexample run script shows predictions being generated for both the train and evaluation toy data:

#!/bin/bash
BERT_DIR="/home/peter/github/tensorflow/ranking/uncased_L-12_H-768_A-12_TF2"  && \
python tensorflow_ranking/extension/examples/tfrbert_client_predict_from_json.py \
    --vocab_file=${BERT_DIR}/vocab.txt \
    --sequence_length=128 \
    --input_file=TFRBertExample-train.json \
    --output_file=train.scoresOut.json \
    --do_lower_case 

python tensorflow_ranking/extension/examples/tfrbert_client_predict_from_json.py \
    --vocab_file=${BERT_DIR}/vocab.txt \
    --sequence_length=128 \
    --input_file=TFRBertExample-eval.json \
    --output_file=eval.scoresOut.json \
    --do_lower_case 

And here’s an example of the script running:

(tfranking-bert) peter@neutronium:~/github/peter-ranking/ranking$ ./tfrbert_predict_from_json.sh  
2020-09-07 23:33:13.007881: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
* Running with arguments: Namespace(do_lower_case=True, input_file=’TFRBertExample-train.json’, output_file=’train.scoresOut.json’, sequence_length=128, vocab_file=’/home/peter/github/tensorflow/ranking/uncased_L-12_H-768_A-12_TF2/vocab.txt’)
* Generating predictions for JSON ranking problems (filename: TFRBertExample-train.json)

Predicting 1 / 3 (33.33%)
Predicting 2 / 3 (66.67%)
Predicting 3 / 3 (100.00%)

* exportRankingOutput(): Exporting scores to JSON (train.scoresOut.json)
* Total execution time: 0:00:01.241

2020-09-07 23:33:15.467087: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
* Running with arguments: Namespace(do_lower_case=True, input_file=’TFRBertExample-eval.json’, output_file=’eval.scoresOut.json’, sequence_length=128, vocab_file=’/home/peter/github/tensorflow/ranking/uncased_L-12_H-768_A-12_TF2/vocab.txt’)
* Generating predictions for JSON ranking problems (filename: TFRBertExample-eval.json)

Predicting 1 / 3 (33.33%)
Predicting 2 / 3 (66.67%)
Predicting 3 / 3 (100.00%)

* exportRankingOutput(): Exporting scores to JSON (eval.scoresOut.json)
* Total execution time: 0:00:01.492

The JSON output adds a score for each document generated from the prediction model, and returns the document lists for each query sorted by these predicted scores. Here’s an example output file with predictions on the toy training set, which should of course be very good since we’re training and evaluating on the same data:

 {
    "rankingProblemsOutput": [
        {
            "queryText": "Where can you buy cat food?",
            "documents": [
                {
                    "relevance": 3,
                    "docText": "The grocery store",
                    "score": 0.44388255
                },
                {
                    "relevance": 3,
                    "docText": "The pet food store",
                    "score": 0.40264943
                },
                {
                    "relevance": 2,
                    "docText": "Cats eat cat food",
                    "score": -0.15662411
                },
                {
                    "relevance": 1,
                    "docText": "Bicycles have two wheels",
                    "score": -0.8667503
                }
            ]
        },
        {
            "queryText": "Where can you go swimming?",
            "documents": [
                {
                    "relevance": 3,
                    "docText": "In a swimming pool",
                    "score": 0.48036778
                },
                {
                    "relevance": 2,
                    "docText": "At the lake",
                    "score": -0.10280942
                },
                {
                    "relevance": 1,
                    "docText": "On a pile of rocks",
                    "score": -0.7149895
                },
                {
                    "relevance": 1,
                    "docText": "In a cloud",
                    "score": -0.7245462
                },
                {
                    "relevance": 1,
                    "docText": "In a garden",
                    "score": -0.75645095
                }
            ]
        },
        {
            "queryText": "What helps to build a campfire?",
            "documents": [
                {
                    "relevance": 3,
                    "docText": "Wood",
                    "score": -0.008856705
                },
                {
                    "relevance": 3,
                    "docText": "Match",
                    "score": -0.05323608
                },
                {
                    "relevance": 2,
                    "docText": "Tinder",
                    "score": -0.42123765
                },
                {
                    "relevance": 1,
                    "docText": "Pot and pan",
                    "score": -1.0901607
                },
                {
                    "relevance": 1,
                    "docText": "Rocks",
                    "score": -1.1488856
                },
                {
                    "relevance": 1,
                    "docText": "Rice",
                    "score": -1.1492822
                },
                {
                    "relevance": 1,
                    "docText": "Can of soup",
                    "score": -1.154706
                },
                {
                    "relevance": 1,
                    "docText": "Hot dog",
                    "score": -1.1791945
                },
                {
                    "relevance": 1,
                    "docText": "Potato",
                    "score": -1.2208372
                },
                {
                    "relevance": 1,
                    "docText": "Marshmallow",
                    "score": -1.2553226
                }
            ]
        }
    ]
}

Pre-generated predictions on the toy evaluation set are also available, which illustrate performance on unseen toy data:

 {
    "rankingProblemsOutput": [
        {
            "queryText": "Where can you buy dog food?",
            "documents": [
                {
                    "relevance": 3,
                    "docText": "The pet food store",
                    "score": 0.4045086
                },
                {
                    "relevance": 1,
                    "docText": "Cars are for driving place to place",
                    "score": -0.087259516
                },
                {
                    "relevance": 2,
                    "docText": "Dogs eat dog food",
                    "score": -0.14085332
                },
                {
                    "relevance": 1,
                    "docText": "Red strawberries grow on strawberry plants",
                    "score": -0.97852093
                }
            ]
        },
        {
            "queryText": "Where can you go rock climbing?",
            "documents": [
                {
                    "relevance": 3,
                    "docText": "In a climbing gym",
                    "score": 0.14498216
                },
                {
                    "relevance": 1,
                    "docText": "In a swimming pool",
                    "score": 0.14445521
                },
                {
                    "relevance": 1,
                    "docText": "At the lake",
                    "score": -0.2063725
                },
                {
                    "relevance": 3,
                    "docText": "On a mountain cliff",
                    "score": -0.23254125
                },
                {
                    "relevance": 1,
                    "docText": "On a pile of rocks",
                    "score": -0.59751
                },
                {
                    "relevance": 1,
                    "docText": "In a cloud",
                    "score": -0.73646384
                },
                {
                    "relevance": 1,
                    "docText": "In a garden",
                    "score": -0.77151793
                }
            ]
        },
        {
            "queryText": "What parts are most important for a computer?",
            "documents": [
                {
                    "relevance": 3,
                    "docText": "Hard drive",
                    "score": -0.16578771
                },
                {
                    "relevance": 3,
                    "docText": "CPU",
                    "score": -0.27822962
                },
                {
                    "relevance": 3,
                    "docText": "Keyboard",
                    "score": -0.34836236
                },
                {
                    "relevance": 2,
                    "docText": "Printer",
                    "score": -0.44954214
                },
                {
                    "relevance": 1,
                    "docText": "Soldering iron",
                    "score": -0.44981638
                },
                {
                    "relevance": 2,
                    "docText": "Scanner",
                    "score": -0.512414
                },
                {
                    "relevance": 1,
                    "docText": "Street",
                    "score": -0.68731207
                },
                {
                    "relevance": 1,
                    "docText": "Mouse",
                    "score": -0.696253
                },
                {
                    "relevance": 2,
                    "docText": "Monitor",
                    "score": -0.74959594
                },
                {
                    "relevance": 1,
                    "docText": "Lamp",
                    "score": -0.9156646
                },
                {
                    "relevance": 1,
                    "docText": "Tree",
                    "score": -0.9277943
                },
                {
                    "relevance": 1,
                    "docText": "Couch",
                    "score": -0.9586045
                },
                {
                    "relevance": 1,
                    "docText": "Electronics factory",
                    "score": -1.0005659
                }
            ]
        }
    ]
}

That’s it! Congratulations, if you’ve made it this far, then you’ve successfully taken the first steps in using TFR-BERT to generate ranked lists. You can now translate your own data into the JSON format, and write an importer to use the ranked lists in your downstream tasks.

Frequently Asked Questions

Q: What are the memory requirements of the model as they relate to list size?

As of writing, I’m not aware of any official requirements. I’ve empirically found that on a 24GB Titan RTX (batch size=1, sequence length=128 tokens, BERT-base-uncased), I’m able to fit a list of approximately 80 documents in GPU memory before receiving out of memory errors during training. Using this as a guide, we can assume:

  • Lists of up to 35 documents per 11GB card (e.g. RTX 2080 Ti)
  • Lists of up to 80 documents per 24GB card (e.g. Titan RTX, 3090 RTX)
  • Lists of up to 105 documents per 32GB card (e.g. V100)
  • Lists of up to 130 documents per 40GB card (e.g. A100)
  • Lists of up to 425 documents per 4x32GB cards (e.g. 4xV100)
  • Lists of up to 530 documents per 4x40GB cards (e.g. 4xA100)

Q: What are the training time requirements of the model as they relate to list size?

There are no official benchmarks as of writing time, but empirically the training time appears to scale linearly with list size. The above graph shows total train runtime (training, evaluation for one cycle, and model export) per 100 training cycles for toy data of different list sizes on a Titan RTX (batch size=1, sequence length=128 tokens), on differently sized BERT models, on a workstation with very fast M2 SSD I/O. Using this, we can roughly gauge that:

  • BERT-Tiny is about 8-10X as fast as BERT-Base
  • BERT-Medium is about 2X as fast as BERT-BASE

The Google TFR-BERT paper describes their MS MARCO experiment using Google TPU V3s and reranking lists of size 12 (with BERT-Large), so it’s likely that TFR-BERT’s computational requirements are somewhat steep, particularly for large datasets, model sizes, and list sizes. For reference in estimating possible model runtime, the TFR-BERT team mentioned their MS MARCO experiment was fine-tuned on only 5% of the corpus, and that each experiment took approximately 1 day to 1 week to complete. If this was on a 32-Core TPU V3 pod, we might ballpark the training cost (as of writing) to replicate these models as anywhere between $1k to $8k per experiment, and perhaps between $20k-$50k to replicate the entire paper — definitely interesting work on a large dataset with a lot of compute behind it.

Q: How do I generate evaluation scores (MAP, NDCG, etc) for a given train/dev/test set?

You may either wish to use the evaluation metric code built into Tensorflow, or write your own scorer that takes the ranked prediction output generated in this tutorial as input.

Acknowledgements

The TFR-BERT authors have been kind to answer a number of e-mails and github issues in my process of figuring out how to create a working end-to-end example. Alexander Zagniotov has spent quite a bit of time with Tensorflow Ranking, and has kindly posted detailed github issue responses and code snippits that helped create this tutorial.

visually-grounded-language-without-vision-findings-emnlp2020-slides

Non-parametric Bootstrap Resampling Primer for NLP (in Scala)

In a recent batch of papers I reviewed for NAACL, I was surprised at the large number of papers that I reviewed that claimed to be producing state-of-the-art models or introducing novel mechanisms to existing models, but failed to include any inferential statistics supporting those claims.  The story of these papers typically takes one of the following forms:

  1. The State-Of-The-Art: The paper claims to achieve state-of-the-art performance on a competitive task (i.e. achieving the best performance on a task leaderboard), but the difference between the current-top model and the proposed model was small (typically under 0.5%).
  2. The New Mechanism: The paper spends several pages developing a compelling narrative for how a complex feature or mechanism will substantially increase performance compared with existing methods, but shows small gains compared to existing methods, and/or does not provide a statistical comparison with those methods.
  3. The Unverified Ablation: A new feature is proposed, but the ablation study suggests the overall contribution to model performance is small — so small that a model with the new feature may not be significantly different than the full model.

Being an interdisciplinary researcher I’ve transitioned fields several times.  My graduate work was done in computational cognitive modeling in a psychology and neuroscience department, where most of the graduate courses one takes are statistics and research methods to learn how to run experiments in progressively more challenging experimental scenarios where the noise may be high, and it’s unclear if you’re measuring what you think you’re measuring.  As a field, natural language processing regularly publishes work in top venues without including inferential statistics, and this can place strong limits on what can be inferred from the claims of a research article, which only hurts the field as a whole, and adds noise to the science of understanding the phenomena that we invest a great deal of resources into studying.

This recent reviewing experience suggested to me that it might be helpful to write a short primer on a common and easy-to-implement statistical method (non-parametric bootstrap resampling), in the hopes that others find it useful.  The disclaimers are that I’m a natural language processing researcher, not a statistician, and consider myself an end-user rather than a researcher of statistical methods.  It’s always critical to understand your data and measurement scenarios, and how those interact with the assumptions of the statistics that you choose.

The Experiment Setup
The technique discussed here compares the performance of two systems — the baseline system, and the experimental system. In NLP, the common scenarios this tends to apply to are:

  1. The Better Model: You have created a new model that achieves higher accuracy on some task than the previous state-of-the-art system. Here, the previous state-of-the-art is the baseline system, and your new model is the experimental system.
  2. The Better Feature: You have created a new feature (or, group of features) and added this to an existing model. Here, the model without the feature (or group of features) is the baseline system, and the model with the features is the experimental system.
  3. The Ablation Study: You have a model that contains many different components, and are running an ablation study to determine the relative contribution of each component to the overall performance of the system. Ablation studies are commonly reported as comparing the performance of removing a single component of the system, versus the full system. Here, the baseline is the system with the component removed (which should be exhibiting lower performance than the full system). The experimental system is the full system.
  4. The Simpler Model Story: You have a model that achieves slightly lower than state-of-the-art performance, but using a model that’s simpler, more interpretable, faster to compute, or some other similar story. The performance of your new, simpler, faster, more interpretable system is so close to the state-of-the-art that you believe the difference between the two models is likely not significantly different. Here, the system with lower performance is the baseline, and the higher performing system is the experimental system. Unlike the other cases, here you are looking for the p-value to be large — traditionally greater than 0.05 (and ideally much larger, i.e. not significantly different).

What you need: Scores from each model
To perform a statistical comparison, you will need two sets of scores — one from the baseline system, and one from the experimental system. The scores should be ordered — that is, kept in two parallel arrays, one containing the scores for the baseline system, one for the experimental system, where the index n for each array represents the baseline or experimental performance for the same piece of input data.

To make this a bit more concrete, in my field of question answering model performance is typically measured in terms of accuracy (the proportion of questions answered correctly by a model). If I was evaluating on a hypothetical set of 10 questions, the baseline score and experimental score arrays would each have 10 elements. BaselineScores(4) and ExperimentScores(4) would represent the scores of both models on the same question. An example of this is listed in the table below, where a score of “1” represents a given method answered the question correctly, and a score of “0” means the question was answered incorrectly. The average performance of the baseline model is 50% accuracy, and the experiment model is 60% accuracy.

Question # Question Text Baseline Score Experimental Score Difference Desc.
0 Which of… 0 1 +1 Exp. helps
1 In what year… 1 1 0 Both correct
2 The largest… 1 0 -1 Exp. hurts
3 Which city… 0 1 +1 Exp. helps
4 In what country… 0 1 +1 Exp. helps
5 When the… 1 0 -1 Exp. hurts
6 How might… 0 1 +1 Exp. helps
7 What is one… 1 1 0 Both correct
8 Which person… 0 0 0 Both incorrect
9 The amount of… 1 0 -1 Exp. hurts
Average Accuracy 50% 60%

The intuition here is that the performance difference is large — 50% vs 60%, or a +10% gain, so whatever clever experimental system we developed is clearly superior to the baseline method. Unfortunately this intuition is incorrect, and there are many factors that affect whether an experimental model is statistically significant over a baseline model, including:

  • the sample size (here 10 samples, from 10 questions)
  • the total number of questions the experimental model benefits or helps over the baseline model (here 4)
  • the total number of questions that the experimental model hurts compared to the baseline model (here 3).

The overall difference between the number of questions helped and hurt (here, 4 – 3 or +1) is the difference score, expressed in number of samples. Here in this question answering example with 10 samples, we can convert this to the difference in accuracy between baseline and experimental models by dividing the difference by the number of samples, +1/10, to find the +10% performance benefit over baseline.

To determine whether the baseline and experimental models are significantly different, we can use a variety of statistical tests. Here we’ll look at a popular statistic, non-parametric bootstrap resampling.

Non-parametric Bootstrap Resampling

The procedure for non-parametric bootstrap resampling is as follows:
For some large number N, which represents the number of bootstrap resamples being taken:

  1. Randomly draw K samples from the difference scores, with replacement (i.e. the same sample can be included in the distribution more than once). K should be equal to the original number of samples (i.e. in the example above with 10 questions, K should be equal to 10)
  2. Sum these randomly sampled difference scores, to determine whether (on the balance) this resampled distribution shows the experimental model as helping (sum > 0) or not helping (sum <= 0). Record this outcome (helped or hurt), regardless of the magnitude.
  3. Repeat this procedure N times, recording the proportion of runs that show the experimental model not helping versus the total number of runs. This proportion is the p-value.

Given that this resampling procedure can happen very quickly, and an accurate estimate of p is important, a typical value of N might be 10,000 samples. Unless the number of samples is large, this can usually be computed in a few seconds. To give a concrete example, if (with a given set of data) the experimental model helps in 9,900 resamples out of 10,000 resamples total, then the p-value would be (10,000 – 9,900) / 10,000 or p = 0.01 .

Bootstrap Resampling Code (in Scala)

The following Scala code implements the bootstrap resampling procedure, and includes two examples:

  1. Example 1: The 10-sample question answering data example from above, with the scores manually entered.
  2. Example 2: A function that generates artificial data to play with, and gain intuition about how differently sized data with different numbers of questions helped and hurt by the experimental model generate different p-values.

The normal use case is that you typically will run your baseline and experimental models, and save the scores for each sample (e.g. question) from each model to a separate text file in the same order (e.g. baseline.txt, experimental.txt).  You can then load these scores into separate arrays, and use the code provided below.

import collection.mutable.ArrayBuffer

/**
  * Quick tool to compute statistical significance using bootstrap resampling
  * User: peter
  * Date: 9/18/13
  */
object BootstrapResampling {

  def computeBootstrapResampling(baseline:Array[Double], experimental:Array[Double], numSamples:Int = 10000): Double = {
    val rand = new java.util.Random()

    // Step 1: Check input sizes of baseline and experimental arrays are the same
    if (baseline.size != experimental.size) throw new RuntimeException("BootstrapResampling.computeBootstrapResampling(): ERROR: scoresBefore and scoresAfter have different lengths")
    val numDataPoints = baseline.size

    // Step 2: compute difference scores
    val deltas = new ArrayBuffer[Double]
    for (i <- 0 until baseline.size) {
      val delta = experimental(i) - baseline(i)
      deltas.append(delta)
    }

    // Step 3: Resample 'numSample' times, computing the mean each time.  Store the results.
    val means = new ArrayBuffer[Double]
    for (i <- 0 until numSamples) {
      var mean: Double = 0.0
      for (j <- 0 until numDataPoints) {
        val randIdx = rand.nextInt(numDataPoints)
        mean = mean + deltas(randIdx)
      }
      mean = mean / numDataPoints
      means.append(mean)
    }

    // Step 4: Compute proportion of means at or below 0 (the null hypothesis)
    var proportionBelowZero: Double = 0.0
    for (i <- 0 until numSamples) {
      println ("bootstrap: mean: " + means(i))
      if (means(i) <= 0) proportionBelowZero += 1
    }
    proportionBelowZero = proportionBelowZero / numSamples

    // debug
    println("Proportion below zero: " + proportionBelowZero)

    // Return the p value
    proportionBelowZero
  }

  // Create artificial baseline and experimental arrays that contain a certain number of samples (numSamples),
  // with some number that are helped by the experimental model (numHelped), and some that are hurt (numHurt).
  def makeArtificialData(numSamples:Int, numHelped:Int, numHurt:Int):(Array[Double], Array[Double]) = {
    val baseline = Array.fill[Double](numSamples)(0.0)
    val experimental = Array.fill[Double](numSamples)(0.0)

    // Add helped
    for (i <- 0 until numHelped) {
      experimental(i) = 1     // Answered correct by the experimental model (but not the baseline model)
    }

    // Add hurt
    for (i <- numHelped until (numHelped + numHurt)) {
      baseline(i) = 1         // Answered correctly by the baseline model (but not the experimental model)
    }

    // Return
    (baseline, experimental)
  }


  def main(args:Array[String]): Unit = {
    // Example 1: Manually entering data
    val baseline = Array[Double](0, 1, 1, 0, 0, 1, 0, 1, 0, 1)
    val experimental = Array[Double](1, 1, 0, 1, 1, 0, 1, 1, 0, 0)
    val p = computeBootstrapResampling(baseline, experimental)
    println ("p-value: " + p)

    // Example 2: Generating artificial data
    val (baseline1, experimental1) = makeArtificialData(numSamples = 500, numHelped = 10, numHurt = 7)
    val p1 = computeBootstrapResampling(baseline1, experimental1)
    println ("p-value: " + p1)
  }

}

And the example output:

Example 1:

bootstrap: mean: -0.2
bootstrap: mean: -0.6
bootstrap: mean: -0.1
bootstrap: mean: -0.1
bootstrap: mean: -0.3
bootstrap: mean: -0.3
bootstrap: mean: -0.2
bootstrap: mean: 0.4
bootstrap: mean: 0.1
bootstrap: mean: 0.3
bootstrap: mean: 0.1
bootstrap: mean: -0.4
Proportion below zero: 0.4316
p-value: 0.4316

The code includes debug output that displays both the mean of the difference scores for each boostrap resample iteration (allowing you to visually see the proportion above or below zero), as well as a summary display showing the overall proportion of bootstrap resampling iterations below zero.

This shows that in spite of there being a +10% performance difference between the two question answering models in the toy example, this difference has a 43% chance of being due to random chance rather than a real difference between these groups.  This is likely owing both to the small number of samples, as well as the relative proportion of questions helped vs hurt (many questions are aided by the experimental model, but a nearly equal number are hurt by it).  Here, we would say that the performance of the two models are not significantly different (p < 0.05).

Additional Examples of the Behavior of Non-parametric Bootstrap Resampling

Using the code below, we can generate artificial data that investigates the expected p-values for specific effect sizes paired with datasets of different sizes.  We can also investigate the effect of having a larger number of samples helped/hurt (i.e. changed) compared to the baseline, while controlling for the same effect size, to help gain an intuition for how this statistic behaves under specific scenarios.  Here is the code example, added to main(), for a given effect size and sample size:

// Example 3: Generating artificial data
val effectSize:Double = 2.0
val sampleSize:Double = 500

val pValues = new ArrayBuffer[(Double, Double)]
for (i <- 0 until 20) {
  val numHelped = (((i.toDouble/100.0) * sampleSize) + ((effectSize/100.0) * sampleSize)).toInt
  val numHurt = ((i.toDouble/100.0) * sampleSize).toInt

  val (baseline2, experimental2) = makeArtificialData(numSamples = sampleSize.toInt, numHelped, numHurt)
  val p2 = computeBootstrapResampling(baseline2, experimental2)
  println("p-value: " + p2)

  pValues.append( (i, p2) )
}

println ("")
println ("Effect Size: " + effectSize)
println ("Sample Size: " + sampleSize)
println ("pValues:")
for (pair <- pValues) {
  println (pair._1 + "\t" + pair._2.formatted("%3.5f"))
}

 

Example: The Small Evaluation Dataset (100 Samples)

Having an evaluation set (i.e. a development or test set) that contains only 100 items is not uncommon in a number of situations:

  • New Task: You have started working on a new task, and are doing exploratory analyses with little data to work with.  (I remember when we started investigating question answering on standardized science exams and had collected few exams, our evaluation set was only 68 questions.  Now it’s approximately 3,000).
  • Low Resource Domain: You are operating in a domain with few readily-available resources, and you are not able to easily collect more.
  • Existing Dataset: You are comparing against an existing dataset that has small development and/or test folds.  For example, the TREC QA sets can have as few as 65 questions for evaluating, meaning answering a single additional question correct increases QA accuracy by 1.5% (!).
  • Expensive Collection: You are collecting data, but have a task where data is expensive to collect.  For example, you need to collect or transcribe sensitive data in the medical domain.
  • Expensive Annotation: Your task requires annotating data, and you this annotation is particularly difficult or time-consuming, limiting the amount of annotation you are able to generate for training or evaluation.

Here with 100 samples, an effect size of 1% means that (for example) an experimental question answering system would answer only 1 more question correctly than the baseline system. As the plot below shows, with such small evaluation sets it’s very hard to detect that an experimental system is significantly different from a baseline system without very large effect sizes, and where few samples are hurt by the experimental system. Here, even for an effect size of 2% (e.g. a baseline accuracy of 70%, and an experimental accuracy of 72%), and a best-case scenario where the experimental system only helps (and not hurts) any of the samples in the dataset, the expected p-value is still approximately 0.13.  Even with an effect size of 5%, if the proportion of samples hurt by the experimental system exceeds 2% (i.e. the experimental system answers 7 more questions correctly over the baseline system, but also answers 2 questions incorrectly that the baseline answered correctly), then the p-value begins to exceed the generally accepted threshold of p = 0.05.

Cautionary Tale:  With small datasets where it can be challenging to achieve statistical significance, after a promising experiment that doesn’t quite meet the threshold it’s often tempting to become overconfident in p-values “trending towards significance” (<0.10 to 0.20) but that don’t quite meet the threshold.  Many folks have experience with an experiment such as this, where a large effect that seemed promising disappears after more investigation.  It’s important to stress that in the business of doing science, it’s critical to find this out sooner rather than later, so that you can make your mistakes cheaply, and not invest time and resources in efforts that ultimately are unlikely to pan out.  For example, in our early days of science exam question answering (circa 2013) when we had collected only 68 questions for our test set, a question answering model that included word embeddings (a now popular technique, but up-and-coming in 2013) showed a large +5% gain in accuracy over a model without embeddings.  This was extremely promising, and I personally spent a great deal of time pursuing this only to find after great effort that the effect ultimately disappeared with more data, and that word embedding models were notoriously challenging to train for domains with little data (like elementary science).  It ended up being years later before methods of training word embeddings for this domain were demonstrated, but not before I spent quite a bit of time on a promising-seeming effect that ultimately was due to small sample sizes rather than a genuine benefit of the model.  One of my mentors in graduate school used to say that science is about making your mistakes cheaply, and he would often quote Nobel laureate Richard Feynman, who famously said (somewhat paraphrased) that “science is about not fooling yourself, when you’re the easiest one to fool”.  Since my experience, I always try to err on the side of collecting more data when viable, since a week collecting more data is broadly more useful than a week wasted on an effect that isn’t there.

Example: Evaluation Set Size of 500 samples

Expected p-values for an evaluation set of 500 samples is shown below.   Here, it’s generally possible to detect that groups are significantly different with high confidence if the difference between baseline and experimental groups is greater than +5%.  Differences of +2% may also fall below the p<0.05 threshold if the experimental model doesn’t hurt too many samples — i.e. incorrectly answer questions that the baseline model answers correctly.

 

Example: Evaluation Set Size of 2,000 samples

At an evaluation set size of 2,000 samples, it’s generally possible to detect that an experimental model is significantly better than a baseline model when the effect size is approximately +2% or greater.  It’s also possible to detect this difference if the effect size is on the order of 1%, as long as the experiment model doesn’t also hurt questions that the baseline model was answering correctly.

 

Example: Evaluation Set Size of 10,000 samples

Large-scale datasets are becoming increasingly common due to crowdsourcing and large-scale data collection — for example, in question answering, the MS Machine Reading Comprehension (MS MARCO) dataset now contains over 1 million (!) questions.  These large datasets enable even very small differences between baseline and experimental models to be detected with high confidence.  With a hypothetical evaluation set of 10,000 samples, even groups with very small differences in performance (i.e. greater than +0.50%) may be significantly different, unless the experimental model is also hurting a large number of questions that the baseline model is answering correctly.

 

Summary

Non-parametric bootstrap resampling can provide an easy-to-implement method of comparing whether the difference between two groups (a baseline model and an experimental model) is significantly different in natural language processing experiments.  The sample data provided help build an intuition for what p-values one might expect for specific effect sizes (here, in terms of accuracy) when using evaluation sets of specific sizes.   Experimental models rarely only help the performance of each sample on large datasets, and the proportion of questions helped vs hurt can substantially reduce the likelihood that the differences between groups are significantly different.

visually-grounded-language-without-vision-findings-emnlp2020-slides

Papers in Plain Language: What’s in an Explanation? Characterizing Knowledge and Inference Requirements for Elementary Science Exams (COLING 2016)

One of my extracurricular hobbies is science pedagogy, largely in the context of making physical sensing devices that make physical science easier to understand — like my open source science tricorder project, or the open source computed tomography scanner.  I’ve been very much interested in starting a series of posts on explaining my recent past (and, future) research papers in plain language, both to make them more accessible for general readers, but also to make short, relatively quick, and equally accessible reads for my colleagues who (if they’re anything like me) have a long list of papers they’d like to read, and many demands on their precious research time.

With that, the inaugural post of Papers in Plain language —

Papers in Plain Language: What’s in an Explanation? Characterizing Knowledge and Inference Requirements for Elementary Science Exams.  (Peter Jansen, Niranjan Balasubramanian, Mihai Surdeanu, and Peter Clark). Published at COLING 2016.  The PDF is available here, and the associated data for the paper is available here.

Context: Why study science exams?

My primary research interests are in studying how we can teach computers enough language and inference to perform fairly complicated inference tasks (like solving science questions), and to do this in ways that are explainable — that is, that generate human-readable explanations for why the answers are correct.  Unlike most AI researchers who are primarily interested in figuring out ways to get computers to perform adult-level tasks, I’m biased to believe that a better way of achieving cognitive-like abilities in artificial intelligence is to model how humans learn from the earliest ages.   That’s why I went to graduate school to learn how to computationally model language development and knowledge representation developmentally, which is a cognitive term meaning as children learn to do.

Elementary science exams (to me) have this great capacity to study how we can teach computers to perform complex inference, because both the language understanding and complex inference tasks tend to be much simpler than what we see in adult-level tasks.  That means we’re better able to distill the essence of the inference task, like peeling back the layers of an onion to get closer to looking at how the problem works, how well we’re doing currently, and what we don’t yet understand that would help us do better.  That’s essentially what this paper is about — a detailed analysis of the knowledge and inference requirements to successfully answer (and explain the reasoning behind solving) hundreds of elementary science questions, using both existing methodologies (the “top-down” methodologies we’ll examine in a minute), and a new methodology — looking at the same questions “bottom-up” by building a large number of real and detailed explanations, and examining them to learn more about the nuts and bolts of these inference problems.  We show that these old and new methods produce radically different results, and that you should definitely use the explanation-centered method if you’re able to devote the time to the analysis.

Top-Down Analysis

Before we (humans) can solve a question, we have to understand what the question is asking.  Most of the time with automated methods of question answering, we do almost the opposite — we apply the same method of solving to all questions — because the science of understanding what a question is asking is still being developed.  One of my favorite papers (A Study of the Knowledge Base Requirements for Passing An Elementary Science Test, Clark et. al, AKBC 2013) works on this problem by analyzing about 50 elementary science questions from the New York Regents 4th grade science exam, and uncovering 7 broad classes that those questions could be, as well as what proportion of questions belong to each class.  Their figure with these classes and proportions is shown below:

Here we the 7 broad categories of questions that Clark et al. discovered — is-a questions, definition questions, questions that ask about properties of objects, questions that test examples of situations, questions that address causality or a knowledge of processes, and the last (and, arguably most complex) — domain specific model questions.  But what do these mean?  Clark et al.’s AKBC2013 paper is full of examples, easily digestible, and highly worth a read — here are a few of those examples.

  • Taxonomic (is-a) questions test a knowledge that X is a kind of Y.  For example,  Q: Sleet, rain, snow, and hail are forms of: (A) erosion (B) evaporation (C) groundwater (D) precipitation
  • Definition questions test a knowledge of the definition of concepts. For example,  Q: The movement of soil by wind or water is called (A) condensation (B) evaporation (C) erosion (D) friction
  • Property questions test knowledge that might be found in property databases.  For example, part-of knowledge is tested by the following question,  Q: Which part of a plant produces the seeds? (A) flower (B) leaves (C) stem (D) roots

The above question types can be thought of as “retrieval types”, in that they could be successfully answered by looking up the knowledge in a ready-made knowledge base.  A more interesting (to me) subset of questions are those that appear to require forms of general or model-based inference, as in the examples below:

  • Examples of situations, such as  Q: Which example describes an organism taking in nutrients? (A) A dog burying a bone (B) A girl eating an apple (C) An insect crawling on a leaf (D) A boy planting tomatoes in the garden
  • Causality, as in  Q: What is one way to change water from a liquid to a solid? (A) decrease the temperature (B) increase the temperature (C) decrease the mass (D) increase the mass
  • Simple Processes, which often appear related to causality questions, such as  Q: One way animals usually respond to a sudden drop in temperature is by (A) sweating (B) shivering (C) blinking (D) salivating
  • Domain Specific Models, that require reasoning over domain-specific representations and machinery to solve.  For example,  Q: When a baby shakes a rattle, it makes a noise. Which form of energy was changed to sound energy? (A) electrical (B) light (C) mechanical (D) heat

A Larger Top-Down Analysis

This analysis is really fascinating, because it gives a set of specific question types in the dataset, each of which requiring specific kinds of knowledge, and specific solving methods.  In a way it helps serve as the beginnings of a recipe book for the problem of building question answering systems for this dataset — before reading the paper you’re likely in the paradigm of applying the same model to each question, but after reading Clark et al’s analysis you can begin to plot out how you might make specific solving machinery for each of these 7 questions, and how you need to get to work collecting or building specific knowledge resources (like a large taxonomy, or a part-of database, or a database that represents causal relationships) before you can make headway on certain classes of problems.

I wanted to do exactly that — and also, to automatically detect which question type a given question was, so that I could go about building a system to intelligently solve them.  So I set out to perform a larger analysis on more questions, and make a much larger set of training data for this question classification task.  That’s when things started to get especially interesting. 

Above is the beginnings of that analysis — essentially repeating the Clark et al. (AKBC 2013) analysis, but on all 432 training questions from the AI2 Elementary questions set (a smaller subset of what is now known as ARC, the Aristo Reasoning Challenge), including 3rd to 5th grade questions drawn from standardized exams in 14 US states.  By and large the proportions here are very similar to the original analysis (except that there appear to be more domain-specific model questions in the larger set).  The really interesting part was the labeling process itself.

The trouble with top-down categories: They’re perfectly obvious, except when they’re not. 

The New York Regents standardized exam is known for being a very well constructed exam, and each of the problems appear to be carefully designed to test very specific student knowledge and inference capacities.  When looking at real questions from different exams, things tend to get a little murkier.  For example, consider the following question:

Q: Many bacteria are decomposer organisms.  Which of the following statements best describes how these bacteria help make soil more fertile?

  • (A) The bacteria break down water into food.
  • (B) The bacteria change sunlight into minerals.
  • (C) The bacteria combine with sand to form rocks.
  • (D) The bacteria break down dead plant and animal matter (correct answer).

Now ask yourself: Which of the 7 knowledge types does this question fit into?

Thankfully, the answer is perfectly obvious.  Unfortunately, it’s obviously one category to one person, and obviously a different category to the next person.

The question might be labeled causal, because it asks “how … bacteria help make [cause] soil [to become] more fertile?“.  But similarly, it might be thought of as a process question, because decomposers are a stage in the life cycle process, which is the curriculum topic this question would be found under.  Except that decomposers are part of an ecosystem model of recycling nutrients back into the soil.  But the question might more simply be solved by just looking up the definition of the word decomposer, which dictionary.com describes as “an organism, usually a bacterium or fungus, that breaks down the cells of dead plants and animals into simpler substances”.  The high number of overlapping words between this definition and the question and correct answer would make it particularly amenable to simple word matching methods.

The problem, unfortunately, is that all of these labels are essentially equally correct.  They’re each different ways of solving the problem, using different solving algorithms.

Changing the problem: Building explanations that answer questions and explain the reasoning behind the answer.

When reaching a stumbling block like this (and, the depressingly low performance on a question classification system that I achieved using these labels that wasn’t reported in the paper), it’s sometimes helpful to take a step back and see if the analysis can be reframed to be more mechanical and less open to interpretation.  Ultimately we’re not studying science exams so that we can make the best multiple choice science exam solver on Earth — we’re studying them because we’re interested in taking apart complex inference problems to understand how they work, how we can build question answering systems that automatically build explanations for their answers, and what kinds of knowledge and inference capacities we would need to make this happen.  So we decided to flip the analysis upside down — instead of looking at a question and trying to figure out which of the 7 AKBC2013 question types it might be, we would instead build very detailed explanations for each question, and perform our analyses on those explanations instead.

Discretizing explanations so they can be taken apart and analyzed

There are many challenges with building a corpus of explanations in this context.  One of the central issues is that we’d like to analyze these explanations for their knowledge and inference requirements, so they need to be amenable to automated analysis in some way.  In order to make this happen, we “discretized” explanations into sets of (roughly) simple (atomic) facts about the world, and some collection of these facts together would then form the explanation for why the answer to a given question is correct.  We further imposed more constraints, to make an automated analysis of the knowledge requirements possible:

  • Simple sentences: Each “fact” in an explanation was expressed as a short, simple sentence in elementary grade-appropriate language.  We tried to simplify sentences from existing resources for solving the exam (such as study guides, or simple wikipedia), but much of the time these weren’t available and we had to manually author simple sentences to suit.
  • Reuse: So that we could keep track of how often the same knowledge was used across different questions, we added the requirement that if the same knowledge (fact) was used in the explanations to multiple questions, it had to be written in the exact same way, so a simple string matching algorithm could pick up each of the questions a given fact was used in.  This made the annotation much more challenging to author, but is critical for the explanation-centered analysis.  (In later papers, such as the WorldTree Explanation Corpus (LREC 2018), we developed tools to make this go much more quickly).
  • Explicit linking between sentences in an explanation: To investigate how knowledge connects in an explanation, we also added the requirement that each fact in an explanation must explicitly “connect” to other facts in the explanation, or to the question or answer text.  We did this primarily because we’re interested in studying how to combine multiple facts to perform inference, a problem sometimes called information aggregation or multi-hop inference, and a major barrier to solving inference problems.  Here, in order to add this explicit linking, each sentence/fact in an explanation must share at least one word in common with the question, answer, or another sentence in the explanation.  This allows the explanations to act as “explanation graphs” connected on shared words, which is the foundation upon which the more recent (and much larger) WorldTree Explanation Graph Corpus for Multi-Hop Inference (LREC 2018) was built.

Some examples of these simple explanations are shown below:

Here we can see that each simple sentence in an explanation appears to embody one kind of knowledge.  I find it easier to think about these explanations visually (as explanation graphs), with the knowledge types labeled, and overlap between explanation sentences explicitly labeled.  Here’s such an example, from a slide used in the talk:

We performed a detailed examination of the explanations for 212 of these questions (approximately 1000 explanation sentences) by annotating the different kinds of relations we observed, and these are summarized in the table below:

 

Fine-Grained Explanatory Knowledge and Inference Methods

The table above provides a fine-grained list of knowledge and inference requirements to build detailed explanations for science exam questions, and is (to me) the most interesting part of the paper.  Just like the Clark et al. AKBC2013 paper, I read this table like a recipe book — if I want to be able to make a question answering system capable of answering and explaining the answers to elementary science questions, three quarters of which require some form of complex inference, these are the kinds of knowledge I have to have in my system, as well as a hint at the inference capacities required for combining that knowledge together.

There are a few parts of the table that are striking to me, that I’ll highlight:

  • Proportions are wildly different when you look at an explanation-centered analysis versus a top-down analysis:  The first relation in this table is taxonomic (kind-of) knowledge, which is found in 83% of explanations.  But when we performed the analysis top-down (the pie-chart above), we found only about 2% of questions were testing this same taxonomic knowledge.  That’s because the top-down analysis obscures many of the details of knowledge and inference requirements — it’s relatively easy to conjecture about how a question might be solved (the top-down method), but it’s also very misleading.  Requiring an annotator to specify all of the knowledge required to answer a question and explain it’s answer forces a rather detailed exposition of knowledge, and that’s where the most informative content appears to be.  Put another way: Using the top-down method, one might believe taxonomic knowledge unimportant, because it’s only central to answering 2% of questions.  In reality taxonomic knowledge is the most prevalent form of knowledge used on this explanation-centered inference task, and likely absolutely critical to having a complete and functioning inference system.
  • 21 fine-grained types: While many of the 7 AKBC types are easily visible, performing the analysis in this way, we’re able to identify much more fine-grained knowledge and inference types, as well as types (like coupled relationships, requirements, and transfers) that remained hidden in the earlier analysis.
  • N-ary relations: Relations are most often extracted from text and used in question answering as triples — sets of 2-argument (X – relation – Y) tuples, as in X-is a kind of-Y (such as that a cat is a kind of mammal).  What we observe here is that many relation types naturally have more arguments, as in the 5-argument “change” relation “melting (arg: who) changes a substance (arg: what) from a solid (arg: from) to a liquid (arg: to) by adding heat energy (arg: method)”, a sentence broadly applicable in questions about changes of states of matter.

 

This analysis is interesting, but can we use it to show some question answering models are solving more complex questions than others?

One of the natural questions one has when spending many months working on developing a new question answering model is:

“Is my ‘inference’ model actually answering more of the complex questions correctly, or is it simply doing better than the last model by answering more of the simpler questions correctly?”

Unfortunately, it’s traditionally been very challenging to answer this question — especially quickly, in an automated fashion.  Here we can use the two types of annotation we’ve generated to compare a “simple” question answering system (a model that answers questions by looking at term frequency — a tf.idf model), and a particular inference solver — the TableILP solver by Khashabi et al. (2016).  We can look question answering accuracy broken down using two methods — one top-down, using the 7 AKBC2013 question types, and the other bottom-up, using the 21 fine-grained knowledge types from the detailed explanations to questions.

QA Performance: Top-down

Here, the L2R model is the simpler question answering system that tries to answer questions by looking up a pre-made answer in a database.  The ILP model is the TableILP inference model by Khashabi et al. (2016).  The simpler model answers about 43% of questions correctly, where the ILP inference model answers about 54% of questions correctly (a gain of +11% accuracy over the simpler model, using the same knowledge).  When looking at the top-down analysis (middle columns), we see that this performance gain isn’t simply from the ILP inference model answering more of the simpler (“retrieval”) questions correctly — it’s making substantial gains (up to +22%) on 3 out of the 4 complex inference type questions as well.  This gives us a method of validating the inference method is doing what it’s claiming to do — answering more of the harder, inference-type questions.

QA Performance: Bottom-up

Except that I just spent quite a bit of time convincing you that the top-down analysis is much less informative than the bottom-up analysis, so let’s have a look. There’s a lot going on in this table, so let’s spend a moment to orient.  The rows represent specific knowledge types identified in the fine-grained analysis.  The columns represent specific models (L2R, the simpler fact-retrieval model, and ILP, the inference model) paired with specific knowledge resources (a “corpus” of study guides and simple wikipedia, or the “tablestore”, a collection of science-relevant facts).  (Note that Stitch is another inference algorithm, but for simplicity I’ll focus on ILP — please see the paper for more details).  The numbers in the cells of this table represent the accuracy of a given question answering system on questions whose explanation requires a given knowledge type.  For example, in the first row, we see that for questions that require Taxonomic knowledge, the L2R model using the Tablestore knowledge resource answers 46% of these questions correctly.  The TableILP model answers 56% of these same questions correctly, meaning the inference model shows moderate gains for these taxonomic questions.   The easiest way to read this table is to look at the “Inference Advantage” column — if it’s pointing towards ILP, it means the inference model helped more on questions requiring a given knowledge type.

The take-away summary points from this table are:

  • The inference model provides a substantial performance boost to the questions requiring inference knowledge.  The highest gains are found in the “Inference Supporting” knowledge types, but “Complex Inference” types also show substantial gains.
  • While relative gains are high, absolute performance is still low in many areas.  The inference model helps almost all questions, but some questions requiring challenging kinds of inference still have a very low performance.  For example, for questions requiring coupled relationships, the simpler L2R model answers only 28% of these correctly — slightly higher than chance (25% on a 4-choice multiple choice exam).  The ILP inference model increases performance on these questions to 44%, which is much higher, but still one of the lowest performances.  In contrast, questions requiring some types of knowledge (examples, definitions, durations) achieve between 63-70% accuracy, highlighting the relative difficulty of these questions, and providing a solid area to target in future work to boost performance.

 

The Overall Take-away: Inference is challenging, but we can instrument it using detailed, explanation-centered analyses

Answering (and explaining the answers to) elementary science questions is easy for most 9 year olds, but it is still largely beyond the capacity for current automated methods.  Here we show that top-down methods for analyzing the knowledge and inference requirements have many challenges, limitations, and inaccuracies, and that bottom-up explanation-centered methods can provide more detailed, fine-grained analyses.  This data can also be used to instrument question answering models to determine the kinds of knowledge and inference that a model performs well at, as well as identify particularly challenging knowledge and inference requirements that can be targeted to increase overall question answering performance.

Context in terms of contributions to subsequent work

This paper was critical to identifying the central kinds of knowledge and inference in standardized science exams, and has been extensively used in our subsequent work — most notably on the WorldTree Explanation Graph Corpus (LREC 2018), whose knowledge base contains many of the same types as this COLING2016 corpus, but extended to approximately 60 very fine-grained types.  The paper also piloted the explanation construction analysis methodology, which has been used and refined in subsequent work.  In the context of multi-hop inference, this paper also first quantified the average number of facts that need to be combined to build an explanation to an elementary science question (4 facts/question), where subsequent analyses on the larger WorldTree explanation corpus refined this to an average of 6 facts per explanation, when building explanations targeted at the explanatory detail required to be meaningful to a young child.

visually-grounded-language-without-vision-findings-emnlp2020-slides

Postdoctoral Position Available

I have a position open for a postdoctoral scholar in my lab, primarily centered around a project in explanation-centered inference (more details below). Folks with interdisciplinary backgrounds (for example, but not limited to: cognitive science) are encouraged to apply — the most important qualifications are that you’re comfortable writing software, that you’re fascinated by the research problem, and that you feel you have tools in your toolbox (that you’ll enjoy expanding after joining the lab) to make significant progress on the task.

The start date is flexible, and we’ll review applications as they come in until the position is filled. If you have any questions, please feel free to get in touch: pajansen@email.arizona.edu

April 2019 Note: We are still actively seeking applicants for this position. The start date is flexible. Interdisciplinary applicants comfortable writing software are encouraged to apply.

Postdoctoral Research Associate I
https://uacareers.com/postings/31213

Position Summary
The Cognitive Artificial Intelligence Laboratory ( http://www.cognitiveai.org ) in the School of Information at the University of Arizona invites applications for a Postdoctoral Research Associate for projects specializing in natural language processing and explanation-centered inference.

Natural language processing systems are steadily increasing performance on inference tasks like question answering, but few systems are able to provide explanations describing why their answers are correct. These explanations are critical in domains like science or medicine, where user trust is paramount and the cost of making errors is high. Our work has shown that one of the main barriers to increasing inference and explanation capability is the ability to combine information – for example, elementary science questions generally require combining between 6 and 12 different facts to answer and explain, but state-of-the-art systems generally struggle integrating more than two facts together. The successful candidate will combine novel methods in data collection, annotation, representation, and algorithmic development to exceed this limitation in combining information, and apply these methods to answering and explaining science questions.

A talk on our recent work in this area is available here: https://www.youtube.com/watch?v=EneqL2sr6cQ

Minimum Qualifications
– A Ph.D. in Computer Science, Information Science, Computational Linguistics, or a related field.
– Demonstrated interest in natural language processing, machine learning, or related techniques.
– Excellent verbal and written communication skills

Duties and Responsibilities
– Engage in innovative natural language processing research
– Write and publish scientific articles describing methods and findings in high-quality venues (e.g. ACL, EMNLP, NAACL, etc.)
– Assist in mentoring graduate and undergraduate students, and the management of ongoing projects
– Support writing grant proposals for external funding opportunities
– Serve as a collaborative member of a team of interdisciplinary researchers

Preferred Qualifications (One or more of the following, note not required for application)
– Knowledge of computational approaches to semantic knowledge representation, graph-based inference, and/or rule-based systems
– Experience applying machine learning methods to question answering tasks
– Knowledge of or interest in graphical visualization and/or user interface design
– Strong scholarly writing skills and publication record

Full Posting/To Apply
https://uacareers.com/postings/31213

Contact Information for Candidates Questions
Peter Jansen ( pajansen@email.arizona.edu )

Other Information
Tucson has been rated “the most affordable large city in the U.S.” and was the first city in the US to be designated as a World City of Gastronomy by the United Nations Educational, Scientific, and Cultural Organization (UNESCO). With easy access to both a vibrant arts and culture scene and outdoor activities ranging from hiking to rock climbing to bird watching, Tucson offers a bit of something for everyone.

The University of Arizona is committed to meeting the needs of its multi-varied communities by recruiting diverse faculty, staff, and students. The University of Arizona is an EEO/AA-M/W/D/V Employer. As an equal opportunity and affirmative action employer, the University of Arizona recognizes the power of a diverse community and encourages applications from individuals with varied experiences, perspectives, and backgrounds.

Outstanding UA benefits include health, dental, vision, and life insurance; paid vacation, sick leave, and holidays; UA/ASU/NAU tuition reduction for the employee and qualified family members; access to UA recreation and cultural activities; and more!

The University of Arizona has been listed by Forbes as one of America’s Best Employers in the United States and WorldatWork and the Arizona Department of Health Services have recognized us for our innovative work-life programs.

visually-grounded-language-without-vision-findings-emnlp2020-slides

AI2 Talk: What’s in an Explanation? Toward Explanation-centered Inference for Science Exams

I recently gave a talk at the Allen Institute for Artificial Intelligence (AI2) on my work in explanation-centered inference for solving standardized science exams. This talk is a good high-level introduction to three recent papers on understanding the kinds of knowledge and inference required to build explanations (COLING 2016), our work in building a very large corpus of semi-structured explanations (the largest I’m aware of) to help us learn how to combine large amounts of information for inference (LREC 2018), and examining this corpus for common explanatory patterns that would help us make the task of building new explanations easier (AKBC 2017).

We’re very excited about this new knowledge resource that’s been two years in the making, and it’s potential for exploring explanation-centered inference. The Worldtree corpus is available here, at the Explanation Bank!