Transcription

DBpedia: A Multilingual Cross-Domain Knowledge BasePablo N. Mendes1 , Max Jakob2 , Christian Bizer11Web Based Systems Group, Freie Universität Berlin, Germany2Neofonie GmbH, Berlin, [email protected], [email protected] DBpedia project extracts structured information from Wikipedia editions in 97 different languages and combines this informationinto a large multi-lingual knowledge base covering many specific domains and general world knowledge. The knowledge base containstextual descriptions (titles and abstracts) of concepts in up to 97 languages. It also contains structured knowledge that has been extractedfrom the infobox systems of Wikipedias in 15 different languages and is mapped onto a single consistent ontology by a communityeffort. The knowledge base can be queried using a structured query language and all its data sets are freely available for download. Inthis paper, we describe the general DBpedia knowledge base and extended data sets that specifically aim at supporting computationallinguistics tasks. These task include Entity Linking, Word Sense Disambiguation, Question Answering, Slot Filling and RelationshipExtraction. These use cases are outlined, pointing at added value that the structured data of DBpedia provides.Keywords: Knowledge Base, Semantic Web, Ontology1.IntroductionWikipedia has grown into one of the central knowledgesources of mankind and is maintained by thousands of contributors. Wikipedia articles consist mostly of natural language text, but also contain different types of structured information, such as infobox templates, categorization information, images, geo-coordinates, and links to external Webpages. The DBpedia project (Bizer et al., 2009) extractsvarious kinds of structured information from Wikipediaeditions in multiple languages through an open source extraction framework. It combines all this information into amultilingual multidomain knowledge base. For every pagein Wikipedia, a Uniform Resource Identifier (URI) is created in DBpedia to identify an entity or concept being described by the corresponding Wikipedia page. During theextraction process, structured information from the wikisuch as infobox fields, categories and page links are extracted as RDF triples and are added to the knowledge baseas properties of the corresponding URI.In order to homogenize the description of information inthe knowledge base, a community effort has been initiated to develop an ontology schema and mappings fromWikipedia infobox properties to this ontology. This significantly increases the quality of the raw Wikipedia infoboxdata by typing resources, merging name variations and assigning specific datatypes to the values. As of March 2012,there are mapping communities for 23 languages1 . The English Language Wikipedia, as well as the Greek, Polish,Portuguese and Spanish language editions have mapped (tothe DBpedia Ontology) templates covering approximately80% of template occurrences2 . Other languages such asCatalan, Slovenian, German, Georgian and Hungarian havecovered nearly 60% of template occurrences. As a con1See: edia.org/index.php/Mapping Statistics2sequence, most of the facts displayed in Wikipedia pagesvia templates are being extracted and mapped to a unifiedschema.In this paper, we describe the DBpedia knowledge base andextended data sets that specifically aim at supporting computational linguistics tasks. These include the Lexicalization, Topic Signatures, Topical Concepts and GrammaticalGender data sets.2.Resources2.1. The DBpedia OntologyThe DBpedia Ontology organizes the knowledge onWikipedia in 320 classes which form a subsumption hierarchy and are described by 1,650 different properties. It features labels and abstracts for 3.64 million things in up to 97different languages of which 1.83 million are classified ina consistent ontology, including 416,000 persons, 526,000places, 106,000 music albums, 60,000 films, 17,500 videogames, 169,000 organizations, 183,000 species and 5,400diseases. Additionally, there are 6,300,000 links to externalweb pages, 2,724,000 links to images, 740,000 Wikipediacategories and 690,000 geographic coordinates for places.The alignment between Wikipedia infoboxes and the ontology is done via community-provided mappings that help tonormalize name variation in properties and classes. Heterogeneities in the Wikipedia infobox system, like using different infoboxes for the same type of entity (class) or usingdifferent property names for the same property, can be alleviated in this way. For example, ‘date of birth’ and ‘birthdate’ are both mapped to the same property birthDate,and infoboxes ‘Infobox Person’ and ‘Infobox FoundingPerson’ have been mapped by the DBpedia community to theclass Person. DBpedia Mappings currently exist for 23languages, which means that other infobox properties suchas ‘data de nascimento’ or ‘Geburtstag’ – date of birth inPortuguese and German, respectively – also get mapped tothe global identifier birthDate. That means, in turn,1813

that information from all these language versions of DBpedia can be merged. Knowledge bases for smaller languages can therefore be augmented with knowledge fromlarger sources such as the English edition. Conversely, thelarger DBpedia editions can benefit from more specializedknowledge from localized editions (Tacchini et al., 2009).2.2.345dbpedia:Alkane carbon alkanes atomsdbpedia:Astronaut space nasadbpedia:Apollo 8 first moon weekdbpedia:Actinopterygii fish species genusdbpedia:Anthophyta forests temperate plantsFigure 1: A snippet of the Topic Signatures Data Set.The Lexicalization Data SetDBpedia also provides extended data sets explicitly createdto support natural language processing tasks. The DBpedia Lexicalization Data Set provides access to alternativenames for entities and concepts, associated with severalscores estimating the association strength between nameand URI. Currently, it contains 6.6 million scores for alternative names.Three DBpedia data sets are used as sources of name variation: Titles, Redirects and Disambiguation Links3 . Labels of the DBpedia resources are created from Wikipediapage titles, which can be seen as community-approved surface forms. Redirects to URIs indicate synonyms or alternative surface forms, including common misspellings andacronyms. As redirects may point to other redirects, wecompute the transitive closure of a graph built from redirects. Their labels also become surface forms. Disambiguation Links provide ambiguous surface forms that are“confusable” with all resources they link to. Their labelsbecome surface forms for all target resources in the disambiguation page. Note that we erase trailing parenthesesfrom the labels when constructing surface forms. For example the label ‘Copyright (band)’ produces the surface form‘Copyright’. This means that labels of resources and ofredirects can also introduce ambiguous surface forms, additionally to the labels coming from titles of disambiguationpages. The collection of surface forms created as a result ofthis step constitutes an initial set of name variations for thetarget resources.We augment the name variations extracted from titles, redirects and disambiguations by collecting the anchor textsof page links on Wikipedia. Anchor texts are the visible,clickable text of wiki page links that are specified after apipe symbol in the MediaWiki syntax (e.g. [[AppleInc. Apple]]). By collecting all occurrences of pagelinks, we can create statistics of co-occurrence for entities and their name variations. We perform this task bycounting how many times a certain surface form sf hasbeen used to link to a page uri. We calculate the conditional probabilities p(uri sf ) and p(sf uri) using maximum likelihood estimates (MLE). The pointwise mutual information pmi(sf, uri) is also given as a measure of association strength. Finally, as a measure of the prominence ofa DBpedia resource within Wikipedia, p(uri) is estimatedby the normalized count of incoming page links of a uri inWikipedia.This data set can be used to estimate ambiguity of phrases,to help select unambiguous identifiers for ambiguousphrases, or to provide alternative names for entities, justto cite a few examples. The DBpedia Lexicalization DataSet has been used as one of the data sources for pedia Spotlight, a general-purpose entity disambiguation system (Mendes et al., 2011b).By analyzing the DBpedia Lexicalization Data Set, one cannote that approximately 4.4 million surface forms are unambiguous and 392,000 are ambiguous. The overall average ambiguity per surface form is 1.22 – i.e. the average number of possible disambiguations per surface form.Considering only the ambiguous surface forms, the averageambiguity per surface form is 2.52. Each DBpedia resourcehas an average of 2.32 alternative names. These statisticswere obtained from Wikipedia dumps using a script4 written in Pig Latin (Olston et al., 2008) which allows its execution in a distributed environment using Hadoop5 .2.3. The Topic Signatures Data SetThe Topic Signatures Data Set enables the description ofDBpedia Resources in a more unstructured fashion, ascompared to the structured factual data provided by theMapping-based properties. We extract paragraphs that contain wiki links to the corresponding Wikipedia page of eachDBpedia entity or concept. We consider each paragraph ascontextual information to model the semantics of that entity under the Distributional Hypothesis (Harris, 1954). Theintuition behind this hypothesis is that entities or conceptsthat occur in similar contexts tend to have similar meanings.We tokenize and aggregate all paragraphs in a Vector SpaceModel (Salton et al., 1975) of terms weighted by their cooccurrence with the target entity. In our VSM, each entityis represented by a vector, and each term is a dimension ofthis vector. Term scores are computed using the TF*IDFweight.We use those weights to select the strongest related termsfor each entity and build topic signatures (Lin and Hovy,2000). Figure 1 shows examples of topic signatures in ourdata set.Topic signatures can be useful in tasks such as Query Expansion and Document Summarization (Nastase, 2008).An earlier version of this data set has been successfully employed to classify ambiguously described images as gooddepictions of DBpedia entities (García-Silva et al., 2011).2.4. The Thematic Concepts Data SetWikipedia relies on a category system to capture the idea ofa ‘theme’, a subject that is discussed in its articles. Manyof the categories in Wikipedia are linked to an article thatdescribes the main topic of that category. We rely on thisinformation to mark DBpedia entities and concepts that ject/pignlproc5http://hadoop.apache.org1814

12345SELECT ?resourceWHERE {?resource dcterms:subject http://dbpedia.org/resource/Category:Biology .}Figure 2: SPARQL query demonstrating how to retrieveentities and concepts under a certain category.123456PREFIX dbpedia-owl: http://dbpedia.org/ontology/ PREFIX dbpedia: http://dbpedia.org/resource/ SELECT ?resourceWHERE {?resource dbpedia-owl:wikiPageWikiLink dbpedia:Biology .}Figure 3: SPARQL query demonstrating how to retrievepages linking to topical concepts.‘thematic’, that is, they are the center of discussion for acategory.A simple SPARQL query6 can retrieve all DBpedia resources within a given Wikipedia category (Figure 2). Avariation of this query can use the Thematic Concepts DataSet to retrieve other DBpedia resources related to certaintheme (Figure 3). The two queries can be combined withtrivial use of SPARQL UNION. This set of resources canbe used, for instance, for creating a corpus from Wikipediato be used as training data for topic classifiers.2.5. The Grammatical Gender Data SetDBpedia contains 416,000 instances of the class Person.We have created a DBpedia Extractor that uses a simpleheuristic to decide on a grammatical gender for each person extracted. While parsing an article in the EnglishWikipedia, if there is a mapping from an infobox in thisarticle to the class dbpedia-owl:Person, we record the frequency of gender-specific pronouns in their declined forms(Subject, Object, Possessive Adjective, Possessive Pronounand Reflexive) – i.e. he, him, his, himself (masculine) andshe, her, hers, herself (feminine).123456dbpedia:Aristotle foaf:gender "male"@en .dbpedia:Abraham Lincoln foaf:gender "male"@en .dbpedia:Ayn Rand foaf:gender "female"@en .dbpedia:Andre Agassi foaf:gender "male"@en .dbpedia:Anna Kournikova foaf:gender "female"@en .dbpedia:Agatha Christie foaf:gender "female"@en .Figure 4: A snippet of the Grammatical Gender Data Set.We assert the grammatical gender for the instance beingextracted if the number of occurrences of masculine pronouns is superior than the occurrence of feminine pronouns6Please note that the wikiPageLinks data set, that is used inFigures 3 and 6, is not loaded in the public SPARQL endpoint,but is available for download and local usage.by a margin, and vice-versa. In order to increase the confidence in the extracted grammatical gender, the current version of the data set requires that the difference in frequencyis 200%. Furthermore, we experimented with a minimumoccurrence of gender-specific pronouns on one page of 5,4 and 3. The resulting data covers 68%, 75% and 81%,respectively, of the known instances of persons in DBpedia. Our extraction process assigned the grammatical gender "male" to roughly 85% and "female" roughly 15% ofthe people. Figure 4 shows example data.2.6. RDF Links to other Data SetsDBpedia provides 6.2 million RDF links pointing at recordsin other data sets. For instance, links to Word Net Synsets(Fellbaum, 1998) were generated by relating Wikipedia infobox templates and Word Net synsets and adding a corresponding link to each entity that uses a specific template. DBpedia also includes links to other ontologies andknowledge bases, including Cyc (Lenat, 1995), Umbel.org,Schema.org and Freebase.com.Other useful linked sources are Project Gutenberg7 , whichoffers thousands of free e-books and New York Times,which began to publish its inventory of articles collectedover the past 150 years. As of January 2010, 10,000 subjectheadings had been shared. The links from DBpedia to authors and texts in Project Gutenberg could be used for backing author identification methods, for instance. Meanwhile,the links to concepts in the New York Times database, enable its usage as an evaluation corpus (Sandhaus, 2008) forNamed Entity Recognition and Disambiguation algorithms,amongst others.3.Use CasesIn this section, we outline four use cases of the DBpediaknowledge base in tasks related to computational linguistics and natural language processing.3.1.Reference Knowledge Base for DisambiguationTasksThe existence of a homogenized schema for describingdata in DBpedia, coupled with its origins on the largestsource of multilingual encyclopaedic text available, makesthis knowledge base particularly interesting as a resourcefor natural language processing. DBpedia can be used,for instance, as a reference knowledge base for EntityLinking (McNamee et al., 2010), and other Word SenseDisambiguation-related tasks.For example, the Entity Linking task at TAC-KBP 2011 (Jiet al., 2011) uses a target knowledge base that can beautomatically mapped to DBpedia via Wikipedia links.It has been shown that simple entity linking algorithmscan leverage this mapping to obtain a µAV G of 0.827in the TACKBP-2010 and 0.727 in TACKBP-2011 datasets (Mendes et al., 2011a). A number of academic andcommercial projects already perform Entity Linking directly to DBpedia (Mendes et al., 2011b; Ltd., 2009;Reuters, 2008), and others can be mapped to DBpedia viaWikipedia (Orchestr8 LLC, 2009; Giuliano et al., 2009;18157See: http://www.gutenberg.org/

123456PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX dbpedia-owl: http://dbpedia.org/ontology/ SELECT DISTINCT ?personWHERE {?person rdf:type dbpedia-owl:Person.}12345678910PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX dbpedia-owl: http://dbpedia.org/ontology/ PREFIX dbpedia: http://dbpedia.org/resource/ SELECT DISTINCT ?widowWHERE {?politician rdf:type dbpedia-owl:Person.?politician dbpedia-owl:occupation dbpedia:Politician.?politician dbpedia-owl:deathPlace dbpedia:Texas.?politician dbpedia-owl:spouse ?widow.}Figure 5: SPARQL query demonstrating how to select allinstances of type Person.1234567Figure 7: SPARQL query demonstrating how to select allpages linking at entities of type Person.PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX dbpedia-owl: http://dbpedia.org/ontology/ SELECT DISTINCT ?person ?linkWHERE {?person rdf:type dbpedia-owl:Person .?person dbpedia-owl:wikiPageWikiLink ?link .}strain on certain ontology properties (e.g. the gender andage of a person) and it can be beneficial to use the DBpediaontology. For example, Figure 7 shows the SPARQL queryfor the question Who is widow to a politician that died inTexas?.Figure 6: SPARQL query demonstrating how to select allpages linking at entities of type Person.3.3.Since the DBpedia knowledge base contains also structuredinformation extracted from infoboxes, it can be used as reference knowledge base for other tasks such as slot fillingand relationship extraction. Through mappings of severalinfobox fields to one ontology property, a more harmonizedview of the data is provided, allowing researchers to exploitWikipedia to a larger extent, e.g. attempting multilingualrelationship extraction.Ferragina and Scaiella, 2010; Ratinov et al., 2011; Han andSun, 2011).One advantage of using DBpedia over Wikipedia as targetknowledge base for evaluations is the DBpedia Ontology.By providing a hierarchical classification of concepts, DBpedia allows one to select a subset of classes on which tofocus a particular disambiguation task. With a simple Webquery (Figure 5) one can obtain a list of entities of typePerson or Organization (or even more specific types suchas Politician or School).Simple extensions to those queries can also retrieve a listof all Wikipedia pages that link to entities matching thosequeries. An example of such a query is shown in Figure 6.These pages, along with the in-text links can be used astraining data for Named Entity Recognition or Entity Linking algorithms, for example. A similar approach is used byDBpedia Spotlight.3.2. Question Answering: World KnowledgeAutomatic answering of natural language questions gainsimportance as information needs of non-technical usersgrow in complexity. Complex questions have been traditionally approached through the usage of databases andquery languages. However, such query languages maynot be a viable option for non-technical users. Moreover,alongside structured information in databases, the amountof information available in natural language increases at afast pace. The complexity of retrieving required information and the complexity of interpreting results call for morethan classical document retrieval.DBpedia contains structured information about a variety offields and domains from Wikipedia. This information canbe leveraged in question answering systems, for example,to map natural language to a target query language. TheQALD-1 Challenge (qal, 2011) was an evaluation campaign where natural language questions were translated toSPARQL queries, aiming at retrieving factual answers forthose questions. As part of this task, it is necessary to con-Slot Filling and Relationship Extraction3.4.Information Retrieval: Query ExpansionUnderstanding keyword queries is a difficult task, especially due to the fact that such queries usually contain veryfew keywords that could be used for disambiguating ambiguous words. While users are typing keywords, currentsearch engines offer a drop-down box with suggestions ofcommon keyword combinations that relate to what the useris typing.For an ontology-based system that interfaces with the usersthrough keyword searches, such ‘auto-suggest’ functionality can be achieved through the use of the data in the DBpedia Lexicalization Data Set. Figure 8 shows how to retrieveall resources that are candidate disambiguations for a surface form, along with a score of association strength. Theavailable scores are described in Section 2.2.12345678PREFIX skos: http://www.w3.org/2004/02/skos/core# SELECT ?resource ?score WHERE {GRAPH ?g {?resource skos:altLabel ?label.}?g http://dbpedia.org/spotlight/score ?score.FILTER (REGEX(?label, "apple", "i"))}Figure 8: SPARQL query for retrieving candidate disambiguations for the string ‘apple’.1816

4.ConclusionDBpedia is a multilingual multidomain knowledge basethat can be directly used in many tasks in natural languageprocessing. The knowledge base as well as the extendedDBpedia data sets are freely available under the terms of theCreative Commons Attribution-ShareAlike 3.0 License andthe GNU Free Documentation License and can be downloaded from the project website8 . Furthermore, through theuse of W3C-recommended Web technologies, a subset ofthe DBpedia knowledge base is also available for onlineusage through Web queries9 .5.AcknowledgementsWe wish to thank Robert Isele and the developers of theDBpedia Extraction Framework, Paul Kreis and the international team of DBpedia Mapping Editors, as well asDimitris Kontokostas and the DBpedia Internationalizationteam for their invaluable work on the DBpedia project.This work was partially funded by the European Commission through FP7 grants LOD2 - Creating Knowledge out ofInterlinked Data (Grant No. 257943) and DICODE - Mastering Data-Intensive Collaboration and Decision Making(Grant No. 257184).6.ReferencesChristian Bizer, Jens Lehmann, Georgi Kobilarov, SörenAuer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. 2009. DBpedia - A crystallization pointfor the Web of Data. Journal of Web Semantics: Science,Services and Agents on the World Wide Web, (7):154–165.Christiane Fellbaum, editor. 1998. WordNet An ElectronicLexical Database. The MIT Press, Cambridge, MA ;London, May.Paolo Ferragina and Ugo Scaiella. 2010. Tagme: on-thefly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM internationalconference on Information and knowledge management,CIKM ’10, pages 1625–1628, New York, NY, USA.ACM.Andrés García-Silva, Max Jakob, Pablo N. Mendes, andChristian Bizer. 2011. Multipedia: enriching DBpediawith multimedia information. In Proceedings of the sixthinternational conference on Knowledge capture, K-CAP’11, pages 137–144, New York, NY, USA. ACM.Claudio Giuliano, Alfio Massimiliano Gliozzo, and CarloStrapparava. 2009. Kernel methods for minimally supervised wsd. Comput. Linguist., 35:513–528, December.Xianpei Han and Le Sun. 2011. A generative entitymention model for linking entities with knowledge base.In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human LanguageTechnologies, pages 945–954, Portland, Oregon, USA,June. Association for Computational Linguistics.Zellig Harris. 1954. Distributional structure. Word,10(23):146–162.89Heng Ji, Ralph Grishman, and Hoa Dang. 2011. Overviewof the TAC2011 Knowledge Base Population Track. InProceedings of the Text Analysis Conference (TAC 2011).Douglas Lenat. 1995. Cyc: A large-scale investment inknowledge infrastructure. Communications of the ACM,38(11):33–38, November.Chin-Yew Lin and Eduard H. Hovy. 2000. The automatedacquisition of topic signatures for text summarization. InCOLING, pages 495–501.Zemanta Ltd.2009.Zemanta api overview.http://www.zemanta.com/api/.Paul McNamee, Hoa Trang Dang, Heather Simpson,Patrick Schone, and Stephanie Strassel. 2010. An evaluation of technologies for knowledge base population. InLREC.Pablo N. Mendes, Joachim Daiber, Max Jakob, and Christian Bizer. 2011a. Evaluating dbpedia spotlight for thetac-kbp entity linking task. In Proceedings of the TACKBP 2011 Workshop.Pablo N. Mendes, Max Jakob, Andrés García-Silva, andChristian Bizer. 2011b. DBpedia Spotlight: SheddingLight on the Web of Documents. In Proceedings of the7th International Conference on Semantic Systems (ISemantics).Vivi Nastase. 2008. Topic-driven multi-document summarization with encyclopedic knowledge and spreading activation. In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing, EMNLP ’08,pages 763–772, Stroudsburg, PA, USA. Association forComputational Linguistics.Christopher Olston, Benjamin Reed, Utkarsh Srivastava,Ravi Kumar, and Andrew Tomkins. 2008. Pig latin:a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD ’08, pages1099–1110, New York, NY, USA. ACM.Orchestr8 LLC. 2009. AlchemyAPI. http://www.alchemyapi.com/, retrieved on 11.12.2010.2011. Proceedings of 1st Workshop on Question Answering over Linked Data (QALD-1), collocated with the 8thExtended Semantic Web Conference (ESWC 2011), Heraklion, Greece, 6.Lev-Arie Ratinov, Dan Roth, Doug Downey, and Mike Anderson. 2011. Local and global algorithms for disambiguation to wikipedia. In ACL, pages 1375–1384.Thomson Reuters. 2008. OpenCalais: Connect. Everything. http://www.opencalais.com/about,retrieved on 11.12.2010.G. Salton, A. Wong, and C. S. Yang. 1975. A vector spacemodel for automatic indexing. Communications of theACM, 18:613–620, November.Evan Sandhaus. 2008. The New York Times AnnotatedCorpus.Eugenio Tacchini, Andreas Schultz, and Christian Bizer.2009. Experiments with wikipedia cross-language datafusion. volume 449 of CEUR Workshop ProceedingsISSN 1613-0073, g/sparql1817

DBpedia: A Multilingual Cross-Domain Knowledge Base Pablo N. Mendes 1, Max Jakob2, Christian Bizer 1 Web Based Systems Group, Freie Universität Berlin, Germany 2 Neofonie GmbH, Berlin, Germany fi[email protected], fi[email protected] Abstract The DBpedia project extracts structured informatio