
Transcription
A large annotated corpus for learning natural language inference Samuel R. Bowman †[email protected] Angeli†‡[email protected] Potts [email protected] D. Manning †‡[email protected] Linguistics†Stanford NLP GroupAbstractUnderstanding entailment and contradiction is fundamental to understanding natural language, and inference about entailment and contradiction is a valuable testing ground for the development of semantic representations. However, machinelearning research in this area has been dramatically limited by the lack of large-scaleresources. To address this, we introducethe Stanford Natural Language Inferencecorpus, a new, freely available collectionof labeled sentence pairs, written by humans doing a novel grounded task basedon image captioning. At 570K pairs, itis two orders of magnitude larger thanall other resources of its type. This increase in scale allows lexicalized classifiers to outperform some sophisticated existing entailment models, and it allows aneural network-based model to performcompetitively on natural language inference benchmarks for the first time.1IntroductionThe semantic concepts of entailment and contradiction are central to all aspects of natural language meaning (Katz, 1972; van Benthem, 2008),from the lexicon to the content of entire texts.Thus, natural language inference (NLI) — characterizing and using these relations in computationalsystems (Fyodorov et al., 2000; Condoravdi et al.,2003; Bos and Markert, 2005; Dagan et al., 2006;MacCartney and Manning, 2009) — is essential intasks ranging from information retrieval to semantic parsing to commonsense reasoning.NLI has been addressed using a variety of techniques, including those based on symbolic logic,knowledge bases, and neural networks. In recentyears, it has become an important testing ground‡Stanford Computer Sciencefor approaches employing distributed word andphrase representations. Distributed representations excel at capturing relations based in similarity, and have proven effective at modeling simpledimensions of meaning like evaluative sentiment(e.g., Socher et al. 2013), but it is less clear thatthey can be trained to support the full range oflogical and commonsense inferences required forNLI (Bowman et al., 2015; Weston et al., 2015b;Weston et al., 2015a). In a SemEval 2014 taskaimed at evaluating distributed representations forNLI, the best-performing systems relied heavilyon additional features and reasoning capabilities(Marelli et al., 2014a).Our ultimate objective is to provide an empirical evaluation of learning-centered approaches toNLI, advancing the case for NLI as a tool forthe evaluation of domain-general approaches tosemantic representation. However, in our view,existing NLI corpora do not permit such an assessment. They are generally too small for training modern data-intensive, wide-coverage models,many contain sentences that were algorithmicallygenerated, and they are often beset with indeterminacies of event and entity coreference that significantly impact annotation quality.To address this, this paper introduces the Stanford Natural Language Inference (SNLI) corpus,a collection of sentence pairs labeled for entailment, contradiction, and semantic independence.At 570,152 sentence pairs, SNLI is two orders ofmagnitude larger than all other resources of itstype. And, in contrast to many such resources,all of its sentences and labels were written by humans in a grounded, naturalistic context. In a separate validation phase, we collected four additionaljudgments for each label for 56,941 of the examples. Of these, 98% of cases emerge with a threeannotator consensus, and 58% see a unanimousconsensus from all five annotators.In this paper, we use this corpus to evaluate
A man inspects the uniform of a figure in some EastAsian country.contradiction The man is sleepingAn older and younger man smiling.neutralCCCCCNNENNTwo men are smiling and laughing at the cats playing on the floor.A black race car starts up in front of a crowd ofpeople.contradiction A man is driving down a lonely road.A soccer game with multiple males playing.entailmentCCCCCSome men are playing a sport.EEEEEA smiling costumed woman is holding an umbrella.neutralNNECNA happy woman in a fairy costume holds an umbrella.Table 1: Randomly chosen examples from the development section of our new corpus, shown with boththe selected gold labels and the full set of labels (abbreviated) from the individual annotators, including(in the first position) the label used by the initial author of the pair.a variety of models for natural language inference, including rule-based systems, simple linear classifiers, and neural network-based models.We find that two models achieve comparable performance: a feature-rich classifier model and aneural network model centered around a LongShort-Term Memory network (LSTM; Hochreiterand Schmidhuber 1997). We further evaluate theLSTM model by taking advantage of its ready support for transfer learning, and show that it can beadapted to an existing NLI challenge task, yieldingthe best reported performance by a neural networkmodel and approaching the overall state of the art.2A new corpus for NLITo date, the primary sources of annotated NLI corpora have been the Recognizing Textual Entailment (RTE) challenge tasks.1 These are generallyhigh-quality, hand-labeled data sets, and they havestimulated innovative logical and statistical models of natural language reasoning, but their smallsize (fewer than a thousand examples each) limitstheir utility as a testbed for learned distributed representations. The data for the SemEval 2014 taskcalled Sentences Involving Compositional Knowledge (SICK) is a step up in terms of size, butonly to 4,500 training examples, and its partlyautomatic construction introduced some spuriouspatterns into the data (Marelli et al. 2014a, §6).The Denotation Graph entailment set (Young etal., 2014) contains millions of examples of entailments between sentences and artificially constructed short phrases, but it was labeled usingfully automatic methods, and is noisy enough thatit is probably suitable only as a source of sup1http://aclweb.org/aclwiki/index.php?title Textual Entailment Resource Poolplementary training data. Outside the domain ofsentence-level entailment, Levy et al. (2014) introduce a large corpus of semi-automatically annotated entailment examples between subject–verb–object relation triples, and the second release ofthe Paraphrase Database (Pavlick et al., 2015) includes automatically generated entailment annotations over a large corpus of pairs of words andshort phrases.Existing resources suffer from a subtler issuethat impacts even projects using only humanprovided annotations: indeterminacies of eventand entity coreference lead to insurmountable indeterminacy concerning the correct semantic label (de Marneffe et al. 2008 §4.3; Marelli et al.2014b). For an example of the pitfalls surrounding entity coreference, consider the sentence pairA boat sank in the Pacific Ocean and A boat sankin the Atlantic Ocean. The pair could be labeledas a contradiction if one assumes that the two sentences refer to the same single event, but couldalso be reasonably labeled as neutral if that assumption is not made. In order to ensure that ourlabeling scheme assigns a single correct label toevery pair, we must select one of these approachesacross the board, but both choices present problems. If we opt not to assume that events arecoreferent, then we will only ever find contradictions between sentences that make broad universal assertions, but if we opt to assume coreference,new counterintuitive predictions emerge. For example, Ruth Bader Ginsburg was appointed to theUS Supreme Court and I had a sandwich for lunchtoday would unintuitively be labeled as a contradiction, rather than neutral, under this assumption.Entity coreference presents a similar kind of indeterminacy, as in the pair A tourist visited New
York and A tourist visited the city. Assumingcoreference between New York and the city justifies labeling the pair as an entailment, but without that assumption the city could be taken to referto a specific unknown city, leaving the pair neutral. This kind of indeterminacy of label can be resolved only once the questions of coreference areresolved.With SNLI, we sought to address the issues ofsize, quality, and indeterminacy. To do this, weemployed a crowdsourcing framework with thefollowing crucial innovations. First, the examples were grounded in specific scenarios, and thepremise and hypothesis sentences in each example were constrained to describe that scenario fromthe same perspective, which helps greatly in controlling event and entity coreference.2 Second, theprompt gave participants the freedom to produceentirely novel sentences within the task setting,which led to richer examples than we see with themore proscribed string-editing techniques of earlier approaches, without sacrificing consistency.Third, a subset of the resulting sentences were sentto a validation task aimed at providing a highly reliable set of annotations over the same data, and atidentifying areas of inferential uncertainty.We will show you the caption for a photo. We will notshow you the photo. Using only the caption and whatyou know about the world: Write one alternate caption that is definitely atrue description of the photo. Example: For thecaption “Two dogs are running through a field.”you could write “There are animals outdoors.” Write one alternate caption that might be a truedescription of the photo. Example: For the caption “Two dogs are running through a field.” youcould write “Some puppies are running to catch astick.” Write one alternate caption that is definitely afalse description of the photo. Example: For thecaption “Two dogs are running through a field.”you could write “The pets are sitting on a couch.”This is different from the maybe correct categorybecause it’s impossible for the dogs to be bothrunning and sitting.Figure 1: The instructions used on MechanicalTurk for data collection.We used Amazon Mechanical Turk for data collection. In each individual task (each HIT), aworker was presented with premise scene descriptions from a pre-existing corpus, and asked tosupply hypotheses for each of our three labels—entailment, neutral, and contradiction—forcingthe data to be balanced among these classes.The instructions that we provided to the workers are shown in Figure 1. Below the instructionswere three fields for each of three requested sentences, corresponding to our entailment, neutral,and contradiction labels, a fourth field (markedoptional) for reporting problems, and a link to anFAQ page. That FAQ grew over the course ofdata collection. It warned about disallowed techniques (e.g., reusing the same sentence for manydifferent prompts, which we saw in a few cases),provided guidance concerning sentence length andcomplexity (we did not enforce a minimum length,and we allowed bare NPs as well as full sentences), and reviewed logistical issues around payment timing. About 2,500 workers contributed.For the premises, we used captions from theFlickr30k corpus (Young et al., 2014), a collectionof approximately 160k captions (corresponding toabout 30k images) collected in an earlier crowdsourced effort.3 The captions were not authoredby the photographers who took the source images,and they tend to contain relatively literal scene descriptions that are suited to our approach, ratherthan those typically associated with personal photographs (as in their example: Our trip to theOlympic Peninsula). In order to ensure that the label for each sentence pair can be recovered solelybased on the available text, we did not use the images at all during corpus collection.Table 2 reports some key statistics about the collected corpus, and Figure 2 shows the distributionsof sentence lengths for both our source hypothesesand our newly collected premises. We observedthat while premise sentences varied considerablyin length, hypothesis sentences tended to be as2Issues of coreference are not completely solved, butgreatly mitigated. For example, with the premise sentenceA dog is lying in the grass, a worker could safely assume thatthe dog is the most prominent thing in the photo, and verylikely the only dog, and build contradicting sentences assuming reference to the same dog.3We additionally include about 4k sentence pairs froma pilot study in which the premise sentences were insteaddrawn from the VisualGenome corpus (under construction;visualgenome.org). These examples appear only in thetraining set, and have pair identifiers prefixed with vg in ourcorpus.2.1Data collection
Data set sizes:Training pairsDevelopment pairsTest pairs550,15210,00010,000Sentence length:Premise mean token countHypothesis mean token count14.18.3Parser output:Premise ‘S’-rooted parsesHypothesis ‘S’-rooted parsesDistinct words (ignoring case)74.0%88.9%37,026Figure 2: The distribution of sentence length.Table 2: Key statistics for the raw sentence pairsin SNLI. Since the two halves of each pair werecollected separately, we report some statistics forboth.short as possible while still providing enough information to yield a clear judgment, clustering ataround seven words. We also observed that thebulk of the sentences from both sources were syntactically complete rather than fragments, and thefrequency with which the parser produces a parserooted with an ‘S’ (sentence) node attests to this.2.2Data validationIn order to measure the quality of our corpus,and in order to construct maximally useful testing and development sets, we performed an additional round of validation for about 10% of ourdata. This validation phase followed the samebasic form as the Mechanical Turk labeling taskused to label the SICK entailment data: we presented workers with pairs of sentences in batchesof five, and asked them to choose a single labelfor each pair. We supplied each pair to four annotators, yielding five labels per pair including thelabel used by the original author. The instructionswere similar to the instructions for initial data collection shown in Figure 1, and linked to a similarFAQ. Though we initially used a very restrictivequalification (based on past approval rate) to select workers for the validation task, we nonetheless discovered (and deleted) some instances ofrandom guessing in an early batch of work, andsubsequently instituted a fully closed qualificationrestricted to about 30 trusted workers.For each pair that we validated, we assigned agold label. If any one of the three labels was chosen by at least three of the five annotators, it waschosen as the gold label. If there was no such consensus, which occurred in about 2% of cases, weassigned the placeholder label ‘-’. While these unlabeled examples are included in the corpus distribution, they are unlikely to be helpful for thestandard NLI classification task, and we do not include them in either training or evaluation in theexperiments that we discuss in this paper.The results of this validation process are summarized in Table 3. Nearly all of the examplesreceived a majority label, indicating broad consensus about the nature of the data and categories.The gold-labeled examples are very nearly evenlydistributed across the three labels. The Fleissκ scores (computed over every example with afull five annotations) are likely to be conservativegiven our large and unevenly distributed pool ofannotators, but they still provide insights about thelevels of disagreement across the three semanticclasses. This disagreement likely reflects not justthe limitations of large crowdsourcing efforts butalso the uncertainty inherent in naturalistic NLI.Regardless, the overall rate of agreement is extremely high, suggesting that the corpus is sufficiently high quality to pose a challenging but realistic machine learning task.2.3The distributed corpusTable 1 shows a set of randomly chosen validatedexamples from the development set with their labels. Qualitatively, we find the data that we collected draws fairly extensively on commonsenseknowledge, and that hypothesis and premise sentences often differ structurally in significant ways,suggesting that there is room for improvement beyond superficial word alignment models. We alsofind the sentences that we collected to be largely
General:Validated pairsPairs w/ unanimous gold label56,95158.3%Individual annotator label agreement:Individual label gold label89.0%Individual label author’s label85.8%Gold label/author’s label agreement:Gold label author’s label91.2%Gold label 6 author’s label6.8%No gold label (no 3 labels match)2.0%Fleiss 600.70Table 3: Statistics for the validated pairs. The author’s label is the label used by the worker whowrote the premise to create the sentence pair. Agold label reflects a consensus of three votes fromamong the author and the four annotators.fluent, correctly spelled English, with a mix offull sentences and caption-style noun phrase fragments, though punctuation and capitalization areoften omitted.The corpus is available under a CreativeCommons Attribution-ShareAlike license, the same license used for the Flickr30k source captions. Itcan be downloaded at:nlp.stanford.edu/projects/snli/Partition We distribute the corpus with a prespecified train/test/development split. The testand development sets contain 10k examples each.Each original ImageFlickr caption occurs in onlyone of the three sets, and all of the examples in thetest and development sets have been validated.Parses The distributed corpus includes parsesproduced by the Stanford PCFG Parser 3.5.2(Klein and Manning, 2003), trained on the standard training set as well as on the Brown Corpus(Francis and Kucera 1979), which we found to improve the parse quality of the descriptive sentencesand noun phrases found in the descriptions.3Our data as a platform for evaluationThe most immediate application for our corpus isin developing models for the task of NLI. In par-SystemSNLISICKRTE-3Edit Distance Based71.965.461.9Classifier Based72.271.461.575.078.863.6 Lexical ResourcesTable 4: 2-class test accuracy for two simplebaseline systems included in the Excitement OpenPlatform, as well as SICK and RTE results for amodel making use of more sophisticated lexicalresources.ticular, since it is dramatically larger than any existing corpus of comparable quality, we expect it tobe suitable for training parameter-rich models likeneural networks, which have not previously beencompetitive at this task. Our ability to evaluatestandard classifier-base NLI models, however, waslimited to those which were designed to scale toSNLI’s size without modification, so a more complete comparison of approaches will have to waitfor future work. In this section, we explore the performance of three classes of models which couldscale readily: (i) models from a well-known NLIsystem, the Excitement Open Platform; (ii) variants of a strong but simple feature-based classifier model, which makes use of both unlexicalizedand lexicalized features, and (iii) distributed representation models, including a baseline model andneural network sequence models.3.1Excitement Open Platform modelsThe first class of models is from the ExcitementOpen Platform (EOP, Padó et al. 2014; Magniniet al. 2014)—an open source platform for RTE research. EOP is a tool for quickly developing NLIsystems while sharing components such as common lexical resources and evaluation sets. Weevaluate on two algorithms included in the distribution: a simple edit-distance based algorithmand a classifier-based algorithm, the latter both ina bare form and augmented with EOP’s full suiteof lexical resources.Our initial goal was to better understand the difficulty of the task of classifying SNLI corpus inferences, rather than necessarily the performanceof a state-of-the-art RTE system. We approachedthis by running the same system on several datasets: our own test set, the SICK test data, and thestandard RTE-3 test set (Giampiccolo et al., 2007).We report results in Table 4. Each of the models
was separately trained on the training set of eachcorpus. All models are evaluated only on 2-classentailment. To convert 3-class problems like SICKand SNLI to this setting, all instances of contradiction and unknown are converted to nonentailment.This yields a most-frequent-class baseline accuracy of 66% on SNLI, and 71% on SICK. This isintended primarily to demonstrate the difficulty ofthe task, rather than necessarily the performanceof a state-of-the-art RTE system. The edit distance algorithm tunes the weight of the three caseinsensitive edit distance operations on the training set, after removing stop words. In additionto the base classifier-based system distributed withthe platform, we train a variant which includes information from WordNet (Miller, 1995) and VerbOcean (Chklovski and Pantel, 2004), and makesuse of features based on tree patterns and dependency tree skeletons (Wang and Neumann, 2007).3.2Lexicalized ClassifierUnlike the RTE datasets, SNLI’s size supports approaches which make use of rich lexicalized features. We evaluate a simple lexicalized classifierto explore the ability of non-specialized models toexploit these features in lieu of more involved language understanding. Our classifier implements 6feature types; 3 unlexicalized and 3 lexicalized:1. The BLEU score of the hypothesis with respect to the premise, using an n-gram lengthbetween 1 and 4.2. The length difference between the hypothesisand the premise, as a real-valued feature.3. The overlap between words in the premiseand hypothesis, both as an absolute count anda percentage of possible overlap, and bothover all words and over just nouns, verbs, adjectives, and adverbs.4. An indicator for every unigram and bigram inthe hypothesis.5. Cross-unigrams: for every pair of wordsacross the premise and hypothesis whichshare a POS tag, an indicator feature over thetwo words.6. Cross-bigrams: for every pair of bigramsacross the premise and hypothesis whichshare a POS tag on the second word, an indicator feature over the two bigrams.We report results in Table 5, along with ablation studies for removing the cross-bigram features (leaving only the cross-unigram feature) andSystemSNLITrain TestSICKTrain TestLexicalizedUnigrams OnlyUnlexicalized99.7 78.293.1 71.649.4 50.490.4 77.888.1 77.069.9 69.6Table 5: 3-class accuracy, training on either ourdata or SICK, including models lacking crossbigram features (Feature 6), and lacking all lexicalfeatures (Features 4–6). We report results both onthe test set and the training set to judge overfitting.for removing all lexicalized features. On our largecorpus in particular, there is a substantial jump inaccuracy from using lexicalized features, and another from using the very sparse cross-bigram features. The latter result suggests that there is valuein letting the classifier automatically learn to recognize structures like explicit negations and adjective modification. A similar result was shown inWang and Manning (2012) for bigram features insentiment analysis.It is surprising that the classifier performs aswell as it does without any notion of alignmentor tree transformations. Although we expect thatricher models would perform better, the resultssuggest that given enough data, cross bigrams withthe noisy part-of-speech overlap constraint canproduce an effective model.3.3Sentence embeddings and NLISNLI is suitably large and diverse to make it possible to train neural network models that producedistributed representations of sentence meaning.In this section, we compare the performance ofthree such models on the corpus. To focus specifically on the strengths of these models at producing informative sentence representations, we usesentence embedding as an intermediate step in theNLI classification task: each model must producea vector representation of each of the two sentences without using any context from the othersentence, and the two resulting vectors are thenpassed to a neural network classifier which predicts the label for the pair. This choice allows us tofocus on existing models for sentence embedding,and it allows us to evaluate the ability of thosemodels to learn useful representations of meaning (which may be independently useful for subsequent tasks), at the cost of excluding from con-
3-way softmax classifierSentence model100d Sum of words100d RNN100d LSTM RNN200d tanh layer200d tanh layer200d tanh layer100d premise100d hypothesissentence modelwith premise inputsentence modelwith hypothesis inputFigure 3: The neural network classification architecture: for each sentence embedding model evaluated in Tables 6 and 7, two identical copies ofthe model are run with the two sentences as input,and their outputs are used as the two 100d inputsshown here.sideration possible strong neural models for NLIthat directly compare the two inputs at the word orphrase level.Our neural network classifier, depicted in Figure 3 (and based on a one-layer model in Bowman et al. 2015), is simply a stack of three 200dtanh layers, with the bottom layer taking the concatenated sentence representations as input and thetop layer feeding a softmax classifier, all trainedjointly with the sentence embedding model itself.We test three sentence embedding models, eachset to use 100d phrase and sentence embeddings.Our baseline sentence embedding model simplysums the embeddings of the words in each sentence. In addition, we experiment with two simplesequence embedding models: a plain RNN and anLSTM RNN (Hochreiter and Schmidhuber, 1997).The word embeddings for all of the models areinitialized with the 300d reference GloVe vectors(840B token version, Pennington et al. 2014) andfine-tuned as part of training. In addition, allof the models use an additional tanh neural network layer to map these 300d embeddings intothe lower-dimensional phrase and sentence embedding space. All of the models are randomlyinitialized using standard techniques and trainedusing AdaDelta (Zeiler, 2012) minibatch SGD until performance on the development set stops improving. We applied L2 regularization to all models, manually tuning the strength coefficient λ foreach, and additionally applied dropout (Srivastavaet al., 2014) to the inputs and outputs of the sen-TrainTest79.373.184.875.372.277.6Table 6: Accuracy in 3-class classification on ourtraining and test sets for each model.tence embedding models (though not to its internalconnections) with a fixed dropout rate. All models were implemented in a common framework forthis paper.The results are shown in Table 6. The sumof words model performed slightly worse thanthe fundamentally similar lexicalized classifier—while the sum of words model can use pretrainedword embeddings to better handle rare words, itlacks even the rudimentary sensitivity to word order that the lexicalized model’s bigram featuresprovide. Of the two RNN models, the LSTM’smore robust ability to learn long-term dependencies serves it well, giving it a substantial advantage over the plain RNN, and resulting in performance that is essentially equivalent to the lexicalized classifier on the test set (LSTM performancenear the stopping iteration varies by up to 0.5%between evaluation steps). While the lexicalizedmodel fits the training set almost perfectly, the gapbetween train and test set accuracy is relativelysmall for all three neural network models, suggesting that research into significantly higher capacityversions of these models would be productive.3.4Analysis and discussionFigure 4 shows a learning curve for the LSTM andthe lexicalized and unlexicalized feature-basedmodels. It shows that the large size of the corpusis crucial to both the LSTM and the lexicalizedmodel, and suggests that additional data wouldyield still better performance for both. In addition, though the LSTM and the lexicalized modelshow similar performance when trained on the current full corpus, the somewhat steeper slope forthe LSTM hints that its ability to learn arbitrarily structured representations of sentence meaning may give it an advantage over the more constrained lexicalized model on still larger datasets.We were struck by the speed with which thelexicalized classifier outperforms its unlexicalizedcounterpart. With only 100 training examples, the
73.9576.7878.22UnlexicalizedLexicalizedLSTM80% Accuracy70605040301101001,00010,000100,000 1,000,000Training pairs used (log scale)Figure 4:A learning curve showing how thebaseline classifiers and the LSTM perform whentrained to convergence on varied amounts of training data. The y-axis starts near a random-chanceaccuracy of 33%. The minibatch size of 64 thatwe used to tune the LSTM sets a lower bound ondata for that model.cross-bigram classifier is already performing better. Empirically, we find that the top weightedfeatures for the classifier trained on 100 examplestend to be high precision entailments; e.g., playing outside (most scenes are outdoors), a banana person eating. If relatively few spurious entailments get high weight—as it appears is the case—then it makes sense that, when these do fire, theyboost accuracy in identifying entailments.There are revealing patterns in the errors common to all the models considered here. Despitethe large size of the training corpus and the distributional information captured by GloVe initialization, many lexical relationships are still misanalyzed, leading to incorrect predictions of independent, even for pairs that are common in the training corpus like beach/surf and sprinter/runner.Semantic mistakes at the phrasal level (e.g., predicting contradiction for A male is placing anorder in a deli/A man buying a sandwich at adeli) indicate that additional attention to compositional semantics would pay off. However, many ofthe persistent proble
resentations. The data for the SemEval 2014 task called Sentences Involving Compositional Knowl-edge (SICK) is a step up in terms of size, but only to 4,500 training examples, and its partly automatic construction introduced some spurious patterns into the data (Marelli et al.