Behav Res (2012) 44:991–997DOI 10.3758/s13428-012-0190-4Adding part-of-speech information to the SUBTLEX-USword frequenciesMarc Brysbaert & Boris New & Emmanuel KeuleersPublished online: 7 March 2012# Psychonomic Society, Inc. 2012Abstract The SUBTLEX-US corpus has been parsed withthe CLAWS tagger, so that researchers have informationabout the possible word classes (parts‐of‐speech, or PoSs)of the entries. Five new columns have been added to theSUBTLEX-US word frequency list: the dominant (mostfrequent) PoS for the entry, the frequency of the dominantPoS, the frequency of the dominant PoS relative to theentry’s total frequency, all PoSs observed for the entry, andthe respective frequencies of these PoSs. Because the current definition of lemma frequency does not seem to provideword recognition researchers with useful information (asillustrated by a comparison of the lemma frequencies andthe word form frequencies from the Corpus of Contemporary American English), we have not provided a columnwith this variable. Instead, we hope that the full list of PoSfrequencies will help researchers to collectively determinewhich combination of frequencies is the most informative.Keywords SUBTLEX . Word frequency . Part-of-speechinformation . Subtitles . Lexical decisionWhereas throughout most of the twentieth century, collecting a corpus of texts and tagging it with part-of-speech(PoS) information required a massive investment in timeand manpower, nowadays it can be done in a matter of dayson the basis of digital archives and automatic parsingM. Brysbaert (*) : E. KeuleersDepartment of Experimental Psychology, Ghent University,Henri Dunantlaan 2,9000 Gent, Belgiume-mail: [email protected]. NewUniversité René Descartes,Paris, Francealgorithms. As a result, researchers in psycholinguistics arebecoming more aware of quality differences between wordfrequency measures (Balota et al., 2007; Brysbaert, Buchmeier,et al., 2011; Brysbaert & New, 2009). The use of an appropriateword frequency measure for research was demonstrated bycomparing the widely used Kučera and Francis (1967) frequency counts to the best available frequency measure, whichexplained 10% more variance in naming and lexical decisiontimes of English words. For all languages for which these dataare available, word frequency estimates based on a corpus ofsome 30 million words from film and television subtitles turnout to be the best available predictor of lexical decision andnaming times (Brysbaert, Buchmeier, et al., 2011; Brysbaert,Keuleers, & New, 2011; Cai & Brysbaert, 2010; Cuetos, GlezNosti, Barbón, & Brysbaert, 2011; Dimitropoulou, Duñabeitia,Avilés, Corral, & Carreiras, 2010; New, Brysbaert, Veronis, &Pallier, 2007).A second way to improve the available word frequencymeasures would be to add PoS information, or informationabout the word classes of the entries. Having information onthe number of times that a word is observed in a representative corpus is essential, but at the same time is limited inmany respects. For a start, researchers are often interested ina particular type of words (e.g., nouns, verbs, or adjectives).This is the case, for instance, when eye movement researcherswant to insert words in carrier sentences. In these cases, allwords must be of the same syntactic class, and selection ismuch more efficient if such information is included in themaster list from which the words are selected. The same is truefor researchers investigating the cortical regions involved inthe processing of different types of words, such as nouns orverbs (e.g., Pulvermüller, 1999; Yang, Tan, & Li, 2011). They,too, would prefer to have syntactic information from theoutset, so that they can select on this variable, rather thanhaving to clean lists manually after the initial selection.

992Also, when researchers present words in isolation, it is agood idea to match the various conditions on syntactic class.Otherwise, syntactic class could turn out to be a confounding variable. For instance, many very-high-frequency wordsare syntactic function words (articles, determiners, or prepositions). They differ in important aspects from contentwords, because they come from a limited set (new functionwords in a language are extremely rare) and are often usedin many different constructions (which is the reason of theirhigh frequency). Therefore, researchers would not like tosee these words unequally distributed across conditions.A related concern is that systematic differences may existbetween different types of content words. For instance,Baayen, Feldman, and Schreuder (2006) reported fasterlexical decisions to monosyllabic verbs than to monosyllabic nouns. This again suggests that researchers may want tomatch their words on this variable, even though Sereno andJongman (1997, Exp. 1) reported exactly the opposite finding (i.e., longer lexical decision times for verbs—bothmono- and disyllabic—than for nouns).Finally, many English words are classified under severaldifferent PoSs. For instance, the entries “play” and “plays”may be either nouns or verbs. The same is true for “playing,” which in addition can be an adjective. Having accessto word frequencies that are disambiguated for PoS wouldallow researchers not only to better select their stimuli in thisrespect, but also to do research on the topic. For instance,Baayen et al. (2006) found faster lexical decision times toverbs that were also frequently used as nouns.Syntactic ambiguities are a particular problem when theyinvolve an inflected form and a lemma form, as is the casefor many past and present participles of verbs. Researchersprobably would not be inclined to include such words as“played” and “playing” in a list of base words (e.g., for aword-rating study) because these words are inflected formsof the verb “to play.” However, the same is intuitively nottrue for “appalled” and “appalling.” These words seem to beadjectives in the first place. Again, rather than having to relyentirely on intuition, it would be good also to have information about the relative PoS frequencies of these words.Below, we first report how PoS information was obtainedfor the SUBTLEX-US word frequencies and then presentsome initial analyses.MethodThe SUBTLEX-US corpus is based on subtitles from filmsand television programs and contains 51 million word tokenscoming from 8,388 different subtitle files (Brysbaert & New,2009). To extract PoS information, we used the CLAWS(“constituent likelihood automatic word-tagging system”) algorithm. This algorithm is a PoS tagger developed atBehav Res (2012) 44:991–997Lancaster University (available at We chose this tagger because it is one of the fewdeveloped by a team of computational linguists over a prolonged period of time and is optimized for word frequencyresearch in both written and spoken language. CLAWS wasthe PoS tagger used and improved in creating the BritishNational Corpus, a major research effort to collect a representative corpus of 100 million words and make this corpusavailable in tagged form (Garside, 1996). It is also the taggerused in an equivalent American initiative to make taggedspoken and written language available to researchers (theCorpus of Contemporary American English: Davies, 2008;see also below).Even though major research efforts have been invested in theCLAWS tagger, it is important to realize that its output is notcompletely error-free (just like the outputs of its alternatives).Performance checks have indicated that it achieves 96%–97%overall accuracy, or 98.5% accuracy if judgments are limited tothe major grammatical categories (Garside, Leech, & McEnery,1997; see also the more detailed information available at manual.htm). Therefore,users must be aware that, although most of the time the CLAWSgives accurate information, it is better to consider the output asuseful guidelines rather than as a set of dictionary definitions(below, we will describe a few examples of errors that wespotted). As far as we know, at present there are no betteralternatives to CLAWS (even human taggers disagree aboutthe correct interpretation on some 2% of instances, and the coststhat such an effort would involve would be prohibitive).The CLAWS algorithm parses sentences and assigns themost likely syntactic roles to the words in six steps (Garside,1996):1. First, the input text is read in and divided into individualtokens, and sentence breaks are established.2. A list of possible grammatical tags is assigned to thewords on the basis of a lexicon.3. For the words in the text not found in the lexicon, asequence of rules is applied to assign a list of suitabletags.4. Libraries of template patterns are used to adapt the listof word tags from Steps 2 and 3 in light of the immediate context in which each word occurs (e.g., “theplay” vs. “I play”).5. The probability of each potential sequence of tags iscalculated (as an index of how grammatically wellformed the sentence would be), and the sequence withthe highest probability is selected.6. The input text and the associated information about thetags are returned.The algorithm uses a set of over 160 tags, which wereduced to the following main syntactic categories: noun,verb, adjective, adverb, pronoun, article, preposition,

Behav Res (2012) 44:991–997993conjunction, determiner, number, letter, name (or propernoun), interjection, and unclassified. For each word in theSUBTLEX-US frequency list, we calculated five values:&&&&&The syntactic category with the highest frequencyThe frequency of this categoryThe relative frequency of the dominant categoryOther categories assigned to the wordThe frequencies of these categoriesOutputTable 1 shows the outcome of the PoS-tagging process forsome entries related to “appal(l)” and “play.” It illustratesthe ways in which words are used in different roles withdifferent frequencies. For instance, “playing” is used mostoften as a verb (observed 7,340 times in the corpus), but alsoas an adjective (101 times) and a noun (67 times). Examplesfrom the corpus are “I was playing [V] with it first!,” “Imean, if somehow we could level the playing [A] field, then,um, maybe I could find a way to come back,” and “The onlyperson my playing [N] is bothering is you.” Table 1 alsoclearly shows that “appalled” and “appalling” are predominantly used as adjectives, whereas “played” and “playing”are predominantly used as inflected verb forms.When reading the figures in the table, it is good to keep inmind that a small number of entries should be consideredTable 1 Processed outcome ofthe CLAWS algorithm for somewords related to “appal(l)” and“play”The respective columns contain(1) the word, (2) the most frequent part of speech, (3) the frequency of the dominant part ofspeech (PoS), (4) the relativefrequency of the dominant PoSversus the total frequency ascalculated by CLAWS, (5) allPoSs taken by the word, in decreasing order, and (6) the respective frequencies of the PoSs.Frequencies are based on theSUBTLEX-US corpus, whichincludes 51 million wordsincorrect, as indicated above. This becomes clear when welook at the results of a very-high-frequency word such as“a.” This entry is not only classified as an article (943,636times) and a letter (7 times), but also as an adverb (30,910),a noun (257), a preposition (50), an adjective (2), andunclassified (743). The high number of assignments as anadverb comes from situations in which the article precedes asequence of adjectives, as in the sentences “It feels a [Adv]little familiar.” and “I left it in a [Adv] little longer than Ishould’ve.” The wrong assignments of “a” as an adjectivecome from the sentences “it would be good to start thinkingthe differences between the a [A] posteriori truths . . .” and“Yale preppies reuniting their stupid a [A] capella group.”Whereas assignment errors lead to easily recognizablenoise for high-frequency words, they may result in misclassifications for low-frequency words. One of the most conspicuous examples we found in this respect is the word “horsefly,”which occurred 5 times in the corpus and was consistentlytagged as an adverb instead of as a noun, presumably becausethe word is not present in the CLAWS lexicon and the endletters -ly are interpreted as evidence for an adverbial role.Therefore, researchers using small sets of low-frequencywords are advised to always manually check their stimuli tomake sure that they are not working with materials that aremanifestly parsed in the wrong way (as with “horsefly”).Attentive readers will further notice that the frequencycounts of the CLAWS algorithm do not always fully agreewith those of SUBTLEX-US. This is because the CLAWSWordDom PoSFreq dPoSRel FreqAll PoSAll 1214,6461.00.831.; VerbAdjectiveAdverbVerbVerbVerb; Noun; Name249; 109931214,646; 3,417; NounNounNounNoun; NameNoun; NameVerb; AdjectiveNounNoun; VerbAdjectiveAdverb; NameVerb; Adjective; NounVerb; Noun31452169; 4748; 32,843; 261,926872; 1597; 17,340; 101; 671,163; 356

994algorithm does more than merely count the letter strings: Itimposes some structure on the input. This becomes clearwhen we look at the SUBTLEX-US entries not observed inthe CLAWS output. These are such entries as “gonna,”“gotta,” “wanna,” “cannot,” “gimme,” “dunno,” “isn,” and“hes.” The algorithm automatically corrects these entriesand gives them their proper, full-length transcription. Thealterations are small and mainly involve high-frequencywords, so that for practical purposes they do not matter (i.e.,they do not affect the correlation with RTs in typical wordprocessing tasks). Because the word form frequencies seem tobe most important, at present we advise users to keep using theSUBTLEX-US frequencies, which are based on simply counting letter strings. The CLAWS total frequencies are used tocalculate the relative frequencies of the dominant PoSs.We prefer the format of Table 1 over the morefrequently used format in which words are given separate lines for each PoS. It is our experience that thelatter organization makes the search for good stimuli inpsycholinguistic research harder. As we will argue later,word form frequency is the most important variable forpsycholinguistic research, and therefore, it is good tohave this frequency for a word as a single entry. PoSrelated information is secondary, and this is communicated best by putting it on a single line.Application: Verbs versus nounsAs a first application, we examined whether response times(RTs) to verbs and nouns differ, as had been suggested bySereno and Jongman (1997) and Baayen et al. (2006), butwith opposite results. To this end, we selected the entriesfrom SUBTLEX that only took noun and verb PoS tags andthat were recognized by at least two thirds of the participantsin the lexical decision experiment of the Elexicon Project. Inthis project, lexical decision times and naming times weregathered for over 40,000 English words (Balota et al.,2007). The majority of the entries selected were used onlyas nouns (Table 2). The second most frequent categorycomprised entries that predominantly served as nouns, butin addition acted as verbs. Then followed the entries onlyused as verbs, and the verbs also used as nouns.As can be seen in Table 2, the entries serving both asnouns and verbs were responded to faster than the entriesserving as a noun or a verb only [F(3, 16909) 0 488, MSE 011,221]. However, the various categories also differed on aseries of confounding variables. Therefore, we examinedhow much of the differences could be predicted on the basisof the SUBTLEX-US word form frequencies (nonlinearregression using cubic splines), word length in number ofletters (nonlinear regression using cubic splines), wordlength in number of phonemes, orthographic Levenshteindistance to the 20 closest words, and phonologicalBehav Res (2012) 44:991–997Table 2 Lexical decision response times (RTs) from the ElexiconProject for verbs and nouns according to the CLAWS part-of-speechinformation (only entries that were known to two thirds of theparticipants)NounVerbNoun VerbVerb NounNRT (SD)RTpred 700701–1 (78.5)6 (73.9)–9 (62.2)5 77.2)The Noun row indicates all instances of the entry in the corpus thatwere classified as nouns; Verb indicates all instances of the entry thatwere classified as verbs; for Noun Verb, the majority of instanceswere classified as nouns, the remainder as verbs; for Verb Noun,most of the instances were classified as verbs, the remainder as nounsLevenshtein distance to the 20 closest words (see Balota etal., 2007, for more information on these variables). Allvariables had a significant effect, and together theyaccounted for 54% of the variance in RTs. They alsoaccounted for most of the differences observed betweenthe four categories, as can be seen in the RTpred columnof Table 2. Still, the residual scores of the categories differedsignificantly from each other [F(3, 16909) 0 22.9, MSE 05,543], mainly due to the fact that the entries primarily usedas nouns were processed faster than predicted on the basis ofthe confounding variables, whereas the entries primarilyused as verbs were processed more slowly than predicted.This is in line with the findings of Sereno and Jongman(1997) and different from those of Baayen et al. (2006),possibly because an analysis limited to monosyllabic wordsdoes not generalize to the full corpus. The difference between nouns and verbs illustrates, however, that researchersshould match their stimuli on PoS information in addition toword form frequency, word length, and similarity to otherwords.Does lemma frequency, as currently defined, add muchto the prediction of lexical decision times?Historically, researchers have added PoS information toword frequencies because they believed that a combinedfrequency measure based on the different word forms belonging to the same PoS category would be informative.Francis and Kučera (1982) were the first to do so. In 1967,they had published a word frequency list on the basis of theBrown corpus, without information about the word classes(Kučera & Francis, 1967). In 1982, they added PoS information and used the notion of lemma frequency. A lemmawas defined as “a set of lexical forms having the same stemand belonging to the same major word class, differing onlyin inflection and/or spelling” (see also Knowles & Don,2004). In this case, lemma frequency was the summed

Behav Res (2012) 44:991–997frequency of a base word and all its inflections. For instance,the lemma frequency of the verb “to play” is the sum of thefrequencies of the verb forms “play,” “plays,” “played,” and“playing.” Similarly, the lemma frequency of the noun“play” is the sum of the frequencies of the noun forms“play” and “plays.” Lemma frequencies gained further attention because of their inclusion in the CELEX lexicaldatabase (Baayen, Piepenbrock, & van Rijn, 1993).Using the CELEX frequencies, Baayen, Dijkstra, andSchreuder (1997) published evidence that lemma frequencymay be more informative than word form frequency. Theyshowed that Dutch singular nouns with high-frequency plurals (such as the equivalent of English “cloud”) were processed faster than matched singular nouns with lowfrequency plurals (such as the equivalent of “thumb”). Thisseemed to indicate that not the word form frequency of thesingular noun, but the combined frequency of the singularand plural forms (i.e., the lemma frequency), was important.This conclusion was put in question for English, however,when Sereno and Jongman (1997) examined the same issueand argued that for English, the frequency of the word formwas more important than the lemma frequency. Possibly as aresult of this finding, American researchers kept on usingthe Kučera and Francis (1967) word form frequencies ratherthan the 1982 lemma frequencies, even though New, Brysbaert, Segui, Ferrand, and Rastle (2004) published resultsfor English closer to those of Baayen et al. (1997) than ofSereno and Jongman.Brysbaert and New (2009) addressed the usefulness ofword form frequency versus lemma frequency in a moregeneral way by making use of the word-processing times ofthe English Lexicon Project (Balota et al., 2007). Theyobserved that, across the 40,000 words, the CELEX wordform frequencies accounted for slightly more variance in theRTs than did the CELEX lemma frequencies, and they thusadvised researchers to continue working with word formfrequencies rather than lemma frequencies. Similar conclusions were reached for Dutch (Keuleers, Brysbaert, & New,2010) and German (Brysbaert, Buchmeier, et al., 2011).To further assess the usefulness of lemma frequenciesversus word form frequencies for general psycholinguisticresearch, we turned to a new, independent source of information. In recent years, Davies has compiled a Corpus of Contemporary American English (e.g., Davies, 2008; available This corpus is based on five different sources with equal weight: transcriptions of TV andradio talk shows, fiction (short stories, books, and moviescripts), popular magazines, newspapers, and academic journals. It is regularly updated, and at the time of purchase (fall2011) it contained 425 million words. Frequencies can bedownloaded or purchased for word forms (depending on thelevel of detail wanted) and purchased for lemmas; these normsare known as the COCA word frequencies.995We used the lemma frequency list provided by COCA andadded the word form frequencies from COCA and SUBTLEXUS. Frequencies of homographs were summed. Thus, thelemma frequency of the word “play” was the sum of the lemmafrequencies of “play” as a verb (197,153 counts) and “play” as anoun (43,818 counts). Similarly, the COCAword form frequency of the word “play” was the sum of the frequencies of theword “play” classified as a verb (78,621), a noun (36,201), anadjective (36), a name (9), and a pronoun (5). For theSUBTLEX-US word form frequency, we simply took thenumber of times that the letter sequence “play” had beencounted in Brysbaert and New (2009). We correlated the various frequencies with the standardized lexical decision timesand the accuracy levels of the English Lexicon Project (Balotaet al., 2007) and the British Lexicon Project (Keuleers, Lacey,Rastle, & Brysbaert, 2012). Some basic cleaning was done toget rid of questionable entries. Only entries accepted by theMicrosoft Office spell checker (American spellings) were included. This excluded most names (which are not accepted ifthey do not start with a capital) and British spellings. All in all,the analysis based on the English Lexicon Project included26,073 words; the analysis based on the British Lexicon Projectcomprised 14,765 words. Entries not observed in theSUBTLEX-US lists were given a frequency of 0. The analyseswere based on log(frequency count 1) and consisted ofnonlinear regressions (as in Brysbaert & New, 2009).As can be seen in Table 3, for the COCA frequencies wereplicated the finding that lemma frequencies in general arenot more informative than word form frequencies for typicalpsycholinguistic research, such as matching words in lexicaldecision experiments. This is surprising, given the results ofBaayen et al. (1997) and New et al. (2004). Some furtherscrutiny suggests why the lemma frequencies, as currentlydefined, perform as they do. The main differences betweenlemma frequencies and word form frequencies have to dowith such words as “playing.” In the COCA lemma frequencies, in line with the linguistic definition, the counts areTable 3 Percentages of variance accounted for by the COCA lemmafrequencies and the word form frequencies in lexical decision performance in the Elexicon Project and the British Lexicon ProjectCOCA lemmaCOCA word formSUBTLEX word formElexicon ProjectBritish Lexicon 847.628.640.939.2Nonlinear regression analysis on entries accepted by the MicrosoftOffice spell checker (American English, 2007 Version). All valuesare statistically significant (N 0 26,073 for the Elexicon Project, andN 0 14,765 for the British Lexicon Project). zRT, response time z score;Acc, percentage accuracy

996Behav Res (2012) 44:991–997Table 4 Percentages of variance accounted for by the various language registers included in the COCA corpus, based on lemmafrequenciesElexiconProjectBritish LexiconProjectzRTAcczRTAccCOCA lemma totalCOCA (lemma )magazines)newspapers)academic)limited to those of the noun “playing” (in both singular andplural forms) and the adjective “playing,” for a total of 2,686counts. In contrast, the frequency of the word form “playing” does not include the plural noun “playings,” but it doesinclude the verb form “playing,” giving a total of 53,512counts. A similar situation occurs for the word “played”(COCA lemma frequency of 306 vs. word form frequencyof 50,724). Because the verb forms “playing” and “played”are added to the verb lemma “play,” the lemma frequency ofthis word (240,971) is much higher than the word formfrequency (114,872). Also worth mentioning is the fact thatthe word “plays” does not figure in the COCA lemma list,because it is part of either the verb lemma “play” or the nounlemma “play.”It is clear that the contributions of base words and inflectedforms require further scrutiny. On the one hand, good evidence exists that the frequencies of inflected forms affect therecognition of base words in at least one case (Baayen et al.,1997; New et al., 2004). On the other hand, it is also clear thatlemma frequencies as currently defined are, in general, notvery helpful for selecting the stimuli for word recognitionexperiments (Table 3). One way to improve the situationmay be to try out different definitions of lemma frequencyand see which one best predicts lexical decision times forvarious types of words (and in different languages). Anotherapproach may be to use other measures of inflectional andmorphological complexity, as proposed by Martín, Kostić,and Baayen (2004). However, it is clear that the issue isTable 5 Relative frequencies ofthe words “the,” “I,” and “you”in various language registersThe more social the languageregister, the more frequently thepronouns “you” and “I” appear.The more descriptive the register, the more frequently the article “the” appearsunlikely to be settled in a single study such as this one.Therefore, we felt that including a single lemma frequencyin our database would send the wrong signal. It seemed morein line with current knowledge to limit the PoS information tothe various frequencies provided by the CLAWS algorithm, sothat researchers can collectively sink their teeth into the issueand try out different combinations of word frequencies. Hopefully, over time, convergent evidence will emerge about whichequivalent to lemma frequency (if any) provides the bestinformation for word recognition research. This could thenbe added to the SUBTLEX-US database.Of further interest in Table 3 is the finding that the COCAfrequencies, despite being based on a larger and more diverse corpus, do not predict word-processing times betterthan the SUBTLEX-US frequencies do (although they arebetter at predicting which words are known). This onceagain illustrates the importance of the language register.Further evidence is obtained when we look at the performance of the various frequency sources used in COCA(Table 4). Unfortunately, we only have this information forlemma frequencies, but it still shows that, in particular, wordfrequencies based on academic journals tend to predict theleast amount of variance.Attentive readers may wonder why the COCA spokenfrequencies are not equivalent to the SUBTLEX-US frequencies, given that they are both based on transcriptions of spokenmaterials. To answer this question, it is important to keep inmind that the language registers of the two corpora differ. Inthe COCA corpus, the spoken sources are talk shows on radioand television, whereas in the SUBTLEX corpus, they aresubtitles from films and television series, which typically referto social interactions. This difference can clearly be shown bylooking at the frequencies of the words “I,” “you,” and “the.”In a recent Internet discussion about the most frequent word inEnglish (held on the Corpora List and available at, it became clear that the relative frequenciesof these three words differ systematically between corpora.Whereas the word “the” is the most frequent in all corpora thatinclude descriptions, “I” and “you” tend to be more prevalentin corpora centered on social interactions, such as SUBTLEXUS (and some of Shakespeare’s plays). Table 5 lists thefrequencies of the three words in SUBTLEX-US and thevarious COCA subcorpora. As can be seen, the �I”/”the”“you”/”the”COCA (spoken)COCA (fiction)COCA (magazines)COCA (newspapers)COCA (academic)SUBTLEX 42

Behav Res (2012) 44:991–997and “you”/“the” ratios decrease the less socially oriented that asource is, and (critically) also differ between the SUBTLEXUS corpus and the COCA spoken corpus.Summary and availabilityWe parsed the SUBTLEX-US corpus with the CLAWS ta

the PoS tagger used and improved in creating the British National Corpus, a major researcheffortto collect a represen-tative corpus of 100 million words and make this corpus available in tagged form (Garside, 1996). It is also the tagger used in an equivalent American initiative to