
Transcription
Methods in biomedical text miningRaul Rodriguez-EstebanSubmitted in partial fulfillment of theRequirements for the degreeof Doctor of Philosophyin the Graduate School of Arts and SciencesCOLUMBIA UNIVERSITY2008
c 2008Raul Rodriguez-EstebanAll Rights Reserved
AbstractMethods in biomedical text miningRaul Rodriguez-EstebanMethods to improve text mining of molecular biology interactions are needed tocapture a richer information space and qualify the quality of extraction. Simpleinteraction models fail to describe contextual and confidence information that wouldhelp with more fine-grained analyses. Herein a method is presented to streamlinecuration of text-mined data and a way to improve text mining of biomedical termsthat can be adapted to other domains using different machine learning techniques.These advances can be integrated into more powerful text-mining systems to meetuser demand and to further promote the adoption of text-mining tools. Additionally,three studies on the nature of biomedical publications are presented: their noveltyhinges on the fact that each asks questions that had not been posed before. Theycover the phenomena of retraction, ways to improve the impact of research, and thewriting style used in biomedical literature. Retraction is a hot topic in recent timesbut it has not been heeded in an analytical fashion. Measuring the impact ofscientific publications has brought heated debate on which are best at describing it.We propose a method not to measure impact, but to improve it. Finally, we analyzethe influence of scientific writing style on the priming of its reader from a sensorialpoint of view.
Contents1 Overview of text mining of biomedical interactions11.1Text mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.2Biomedical text mining . . . . . . . . . . . . . . . . . . . . . . . . . .61.3Interactions from text. . . . . . . . . . . . . . . . . . . . . . . . . .81.3.1GENIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111.3.2GeneWays . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12Curation and evaluation . . . . . . . . . . . . . . . . . . . . . . . . .131.42 Automatic curation of text-mined facts182.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .192.2Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .202.3Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .212.3.1Training data . . . . . . . . . . . . . . . . . . . . . . . . . . .21Mathematical background . . . . . . . . . . . . . . . . . . . . . . . .262.4.1Machine-learning algorithms . . . . . . . . . . . . . . . . . . .262.4.2Features used in our analysis . . . . . . . . . . . . . . . . . . .382.4.3Separating data into training and testing: Cross-validation . .382.4.4Comparison of methods: Receiver operating characteristic (ROC)2.42.5scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41i
2.6Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Overview of biomedical term recognition and classification3.13.24456Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .563.1.1Term recognition . . . . . . . . . . . . . . . . . . . . . . . . .573.1.2Named entity expressions. . . . . . . . . . . . . . . . . . . .583.1.3Term classification . . . . . . . . . . . . . . . . . . . . . . . .593.1.4Biomedical term recognition and classification . . . . . . . . .62Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . .723.2.1Recognition and classification framework . . . . . . . . . . . .723.2.2Conditional random fields . . . . . . . . . . . . . . . . . . . .744 Biomedical term recognition and classification using large corporaand search engines764.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .764.2Term recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . .794.2.1Text pre-processing and indexing . . . . . . . . . . . . . . . .794.2.2Syntactical model . . . . . . . . . . . . . . . . . . . . . . . . .804.2.3Term recognition process . . . . . . . . . . . . . . . . . . . . .82Term classification . . . . . . . . . . . . . . . . . . . . . . . . . . . .854.3.1Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .854.3.2Local, regional, global: Word sense disambiguation . . . . . .884.35 Six senses: the bleak sensory landscape of biomedical texts916 A recipe for high impact996.1Ingredients of a scholarly study . . . . . . . . . . . . . . . . . . . . .996.2Information flow through publication-type niches . . . . . . . . . . .1016.3Additional information . . . . . . . . . . . . . . . . . . . . . . . . . . 102ii
6.3.1Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.3.2Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027 How many scientific papers should be retracted?1057.1Analyzing retraction patterns . . . . . . . . . . . . . . . . . . . . . . 1057.2Mathematical model to calculate the number of articles that shouldhave been retracted . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.3Retraction rates are on the rise . . . . . . . . . . . . . . . . . . . . . 1108 Future work and conclusions1138.1Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1138.2Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148.3Papers that resulted from the work in this thesis . . . . . . . . . . . . 1159 Bibliography116iii
List of Figures2.1Cocaine: the predicted accuracy of individual text-mined facts involvingsemantic relation stimulate. . . . . . . . . . . . . . . . . . . . . . . .2.2Accuracy of the raw (non-curated) extracted relations in the GeneWays6.0 database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.31922Accuracy and abundance of the extracted and automatically curatedrelations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .232.4Sentence Evaluation Tool . . . . . . . . . . . . . . . . . . . . . . . . .252.5The correlation matrix for the features used by the classification algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .502.6A hypothetical three-layered feed-forward neural network. . . . . . . .512.7Receiver-operating characteristic (ROC) curves for the classificationmethods that we used in the present study. . . . . . . . . . . . . . . .522.8Correlation matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . .532.9Ranks of all classification methods used in this study in 10 crossvalidation experiments. . . . . . . . . . . . . . . . . . . . . . . . .542.10 Values of precision, recall and accuracy of the MaxEnt 2 classifier5.1plotted against the corresponding log-scores provided by the classifier.55Analysis of the frequencies of sensory words in six large corpora . . .94iv
6.1Contributions of topic- and method-specific estimates of temperatureand novelty to a journal’s impact factor . . . . . . . . . . . . . . . . . 1006.2Number of articles, MeSH terms and chemical names mentioned inPubMed since 1950 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027.1Dataset, model and estimation of the number of flawed articles inscientific literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1067.2Number of articles and the percentage of articles retracted since 1950as recorded in Medline. . . . . . . . . . . . . . . . . . . . . . . . . . . 112v
List of Tables1.1Criteria for Evaluating Performance of NLP Systems. . . . . . . . . .152.1Sentence examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . .272.2List of annotation choices available to the evaluators. . . . . . . . . .282.3Parameter values used for various SVM classifiers in this study. . . .362.4Machine learning methods used in this study and their implementations. 372.5List of the features that we used in the present study. . . . . . . . . .2.6Comparison of the performance of human evaluators and of the MaxEnt2 algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.73945Comparison of human evaluators and a program that mimicked theirwork. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .462.8ROCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .493.1Definition of token classes with differing semantic significance. . . . .663.2Morphologic feature values with examples. . . . . . . . . . . . . . . .724.1Examples of word labels for nested terms. . . . . . . . . . . . . . . .844.2Term classification performance. . . . . . . . . . . . . . . . . . . . . .90vi
AcknowledgementsI would like to thank Andrey Rzhetsky for the support and trust he has shown meduring my doctorate studies. His ideas inform this thesis and he shares a great deal ofthe credit for what is written here. Murat Cokol led the work described in Chapters 6and 7. Murat has been a continuous influence both in terms of advice and insight.Ivan Iossifov collaborated in the studies reported in Chapters 2 and 7 and was helpfulin teaching me to navigate the GeneWays system. I also would like to thank othermembers of the Rzhetsky lab: Igor Feldman, Chani Weinreb, Ilya Mayzus, Lixia Yao,and Pauline Kra.vii
To my family.viii
1Chapter 1Overview of text mining ofbiomedical interactionsThe purpose of this introduction is to give a historical overview and background tothe project of automatic curation of text-mined data described in Chapter 2. Thisintroduction describes the beginnings of text mining as a discipline and its arrival tothe biomedical domain. It also describes the first interaction text mining projects,which precede and set a path to the development of GeneWays. This introduction isnecessary to understand the architectural and structural choices made for the designof GeneWays. Finally, this introduction reviews the different efforts made inevaluating text-mined interaction data before the automatic curation project wasdeveloped. The aim is not only to contextualize the state of the art and decisionsmade during the project but also to present the basis on which it stood and thechallenges it faced.1.1Text miningThe field of text mining is a relatively new discipline born of the knowledge discoveryin databases (KDD) and data mining (DM) community. As it is often the case when
2a discipline is born, it borrowed techniques and approaches from similar, moreestablished fields before establishing its own identity 1 .Alessandro Zanasi claims that the first time he heard the term “text mining” waswhen it was spoken by Charles Huot in 1994, during the IBM-ECAM (EuropeanCentre for Applied Mathematics) [1] workshop in Paris. Whether it was used withthe same meaning that it has today, in the context of applications such asinformation extraction [2] or document classification [3], is unclear. In 1995 and 1996,Ronen Feldman and colleagues offered the first contributions to the field that can becalled text mining with more certainty [4, 5, 6], originally called knowledge discoveryin text (KDT). The word mining was soon introduced in 1996 in the context of KDT,followed by the coinage of the name “text mining”, a variation of the name datamining. By 1997 the expression text mining had become an accepted name for thenew discipline. The new discipline quickly spawned courses, workshops, and booksand opened new avenues of research and notable subfields, such as web text mining(1998) and biomedical text mining (1998). Text mining brought together researchersfrom the KDD and DM communities and from the fields of natural languageprocessing (NLP), automatic knowledge acquisition, information retrieval, andinformation extraction, to name a few. Text mining became the predominant namefor the discipline, widely replacing other names such as KDT, KDT and text mining,textual data mining, and text data mining. That some of these names are still in usereflects not only a stylistic choice but also, in some cases, differences in understandingof aims, scope, and methods.Marti Hearst [7] was one of the first to summarize the state of the nascent disciplinein 1999, attempting to define its scope with respect to other fields such as datamining, computational linguistics, or information retrieval. Hearst stressed that thedefining quality of text mining is that its goal is to discover novel information, unlike1Think of the first cars having the shape of horse carts, or the first films looking like theater plays
3fields such as information retrieval and data mining. In this respect, text mining isindebted to literature-based discovery, a field championed by Don Swanson beginningwith his seminal paper in 1986, “Undiscovered public knowledge” [8, 9].Literature-based discovery was intended to be a systematic search for pieces ofknowledge that could be combined to create a novel discovery. Originally,literature-based discovery was largely a non-automated process. Swanson recalledstumbling onto the idea through a serendipitous finding of two unrelated articles thatcould be combined to answer a question that no other single article answered. Hisacceptance speech upon receiving the American Society for Information Science andTechnology (ASIST) 2000 Award of Merit is worth quoting because it addresses coreprinciples of the text-mining field:“More than 40 years ago the fragmentation of scientific knowledge was aproblem actively discussed but without much visible progress toward asolution; perhaps people then had the consummate wisdom to know thatno problem is so big that you can’t run away from it. Three aspects of thecontext and nature of this fragmentation seem notable:1. The disparity between the total quantity of recorded knowledge,however it might be measured, and the limited human capacity toassimilate it, is not only enormous now but grows unremittingly. Exactlyhow the limitations of the human intellect and life span affect the growthof knowledge is unknown. Metaphorically, how can the frontiers of sciencebe pushed forward if, someday, it will take a lifetime just to reach them?[.]2. In response to the information explosion, specialties are somehowspontaneously created, then grow too large and split further intosubspecialties without even a declaration of independence. Oneunintended result is the fragmentation of knowledge owing to inadequatecross-specialty communication. And as knowledge continues to grow,fragmentation will inevitably get worse because it is driven by the humanimperative to escape inundation.3. Of particular interest to me is the possibility that information in onespecialty might be of value in another without anyone becoming aware ofthe fact. Specialized literatures, or other “units” of knowledge, that donot intercommunicate by citing one another may nonetheless have manyimplicit textual interconnections based on meaning. Indeed the number of
4unintended or implicit text-based connections within the literature ofscience may greatly exceed the number that are explicit, because there arefar more possible combinations of units (that potentially could be related)than there are units. The connection explosion may be more portentousthan the information explosion.”Heart’s opinion is shared by Ananiadou and McNaught [10] and others [11]: “Theprimary goal of text mining is to retrieve knowledge that is hidden in text, and topresent the distilled knowledge to users in a concise form”. However, a more commonpoint of view, first proposed by Ronen Feldman, defines text mining as different fromdata mining only because it deals with data that by its nature is unstructured, unlikedata organized in databases, which are the primary source for data mining[4, 12, 13, 14, 15]. Kao and Poteet [16] go even further, stating that “Text mining isthe discovery and extraction of interesting, non-trivial knowledge from free orunstructured text. This encompasses everything from information retrieval (i.e.,document or web site retrieval) to text classification and clustering, to (somewhatmore recently) entity, relation, and event extraction.” In practice, this expansive viewof text mining is not shared by many others, especially considering that informationretrieval or text classification predate text mining by many years. Kao and Poteet’sopinion implies that text mining is an umbrella term covering a laundry list of textualprocessing methods. A more common view seems to be that the aim of text mining isto find interesting, useful, or valuable patterns—that are not necessary novel—in textcollections. This perspective places text mining closer to knowledge acquisition andinformation extraction.Given the fuzzy lines that separate text mining from similar fields, it is not clearwhether it can be defined meaningfully beyond a mix of different conceptions held bydifferent researchers. The confusion is compounded further because applications fromrelated fields may be regarded as necessary processing steps for effective text mining.In other words, text-mining projects might require sub-tasks from other fields.
5Therefore, text mining in some contexts might be used for the sole purpose ofindicating the scientific agenda in which the study should be considered, not fordefining the task itself as “text mining”. Furthermore, as other fields have built onadvances in text mining, text mining also has become an intermediate step in projectsof different nature.Related disciplines such as semantic analysis, text analysis, information retrieval,information extraction, and knowledge acquisition have a much older pedigree withinthe computation and information sciences than does text mining. Like text mining,they derive from activities that originally could be handled by human intellect andrudimentary record-keeping but became more complex with the progressiveaccumulation of knowledge and information. Fielden [17] plotted the evolution of thesize of information repositories over the course of human history, showing anexponential growth in the last decades. More comprehensively, Peter Lyman and HalVarian led a study designed to estimate the quantity of information producedworldwide every year [18, 19]; they estimated a grand total of 5 exabytes2 , or 800megabytes per person per year, of which 92% were in magnetic storage. Printed textrepresented 33 terabytes, whereas the “surface internet” accounted for 167 terabytesand the “deep internet” (or database, dynamically-generated pages) for about 92petabytes). This unparalleled growth has been accompanied by extraordinaryimprovements in the devices and methods in the different computation andinformation sciences. Text mining, a late arrival, has the advantage of drawing froman extensive set of diverse techniques developed not only in the related disciplines,but also in other fields such as machine learning, artificial intelligence, probabilistic2Clearly, not all those bytes are useful. The problem is not confined to sorting large amounts ofdata but also to seeing through the “information pollution” that clouds data analysis. It may beworth quoting T. S. Eliot again:“Where is the wisdom we have lost in knowledge? Where is the knowledge we have lostin information?”
6analysis, statistics, pattern recognition, data management, and information theory.While other disciplines, like information retrieval, fledged out before the currentpervasive use and availability of electronic text, text mining was born in a seeminglylimitless and growing frontier of resources and opportunities. Text miners, in turn,have acted like they have a hammer and see a nail in everything. Perhaps this is thebest explanation for the success of text mining: Applications have driven its evolution[1].Given the fragmentary state of the field, it is not surprising that there is not currentlya journal that specializes in text mining. The door is open for further transformationof the text-mining domain, whether in terms of its buzz or its consolidation in thespectrum of computation and information sciences.1.2Biomedical text miningThe first attempts at text mining of the biomedical literature date back to 1998. Asexplained in Section 1.1, the label “text mining” may have consumed some areas thatformerly went by a different name, such as knowledge acquisition and informationextraction. Text mining builds on previous informatics and computational work onsemantic analysis, dictionary creation, knowledge acquisition, classification, etc. Itsapplication to the biomedical realm is a natural extension given the existingopportunities: exponential growth of the literature—both in size and in electronicavailability—; the gradual shift to electronic medical records; the on-going work inannotated resources (e.g., Gene Ontology (GO), Online Mendelian Inheritance inMan (OMIM), Swissprot); and the increasing need for integration betweeninformation sources of disparate origin, also known as integromics [20]. The internetis the main engine that has fuelled this growth. Even though computers andelectronic communications long predated the internet, it is the internet that has
7crystallized change because it has dramatically lowered the cost of information accessand exchange and brought to the social forefront the challenges and opportunities ofbiomedical electronic information (e.g., the Health Insurance Portability andAccountability Act of 1996; the Open Access movement).Integromics is proving to be of crucial importance in current developments as moredata are becoming available in different formats in electronic and on-line form,including supplementary information tables, genome linkage maps, DNA sequences,taxonomies, ontologies, hospital medical records, and semi-structured forms (e.g.,questionnaires), etc. An example is Medline [21], an exponentially growing biomedicalbibliography that accounts for upwards of 16 million articles. In many cases, Medlinehas references to full-text articles that may be retrieved with the appropriate licenses.However, information related to those articles, like on-line repositories orsupplementary text and tables, is harder to access. Medline’s growth can beconsidered even more dramatic if we include the “deep Medline” trove of additionalresources that are ready for mining.Biomedical text miners may claim the superiority of text-mined data over otherresources, especially over manually curated data. Text mining casts a wide net overthe biomedical spectrum, allowing individual researchers to deal with Swanson’s threearguments for library knowledge discovery (see Section 1.1). The resulting catch islarger than can typically be gathered manually. As of October 2007, the hand-curatedDatabase of Interacting Proteins (DIP, [22]) held 56,186 interactions. Perhaps themost extensive effort in manual literature-derived extraction of interactions isBioGRID [23], with 70,000 interactions. In comparison, some text-mining interactionrepositories hold over half a million interactions (see Section 1.3.2). The NCBI GeneExpression Omnibus repository of microarray expression datasets contains about halfa billion data samples [24]. Hence, text-mined biomedical data has a place within thesuite of tools available to biomedical informaticians and researchers. For the examples
8given, this place lies somewhere between high throughput methods and manuallyhand-curated sets, each with its own niche applications. The challenge for biomedicaltext mining is to assert its usefulness both for acquiring information with quality thatapproaches (or surpasses) hand-curated data and for reaching the widest coverage forsystem-wide analysis (e.g., characterizing complex diseases [25]).Some applications in biomedical text mining have mirrored those of text mining atlarge, like document classification, data integration, literature-based discovery, andliterature analysis (e.g., scientific trends and emerging topics [26]). Others have beenmore specific to biomedicine, such as biomedical annotation, phenome/phenotypeanalysis, public health informatics (e.g., news analysis [27], hospital rankings [28]),clinical informatics, and nursing informatics. The most flourishing areas, however,may be loosely defined as those closely linked to systems biology [29, 30] and medicaltext mining. The former deals with such topics as biomedical interaction extraction,functional analysis, or genome annotation among others (see a list of main tools andrepositories in [31]). The latter deals with the range of narratives found in the textualsupports associated with clinical settings, from the ICU bedside to the clinical trialsdesk. Biomedical text-mining articles are published mostly in journals andconferences in biomedical informatics and computational biology, and sometimes innon-informatics journals like Genome Biology.1.3Interactions from textSystems biology has been a hotbed for developments in biomedical text mining, asmentioned in Section 1.2. One of the focuses has been on interactions betweendifferent types of molecules, especially proteins (i.e., PPI, protein-proteininteractions). The success of PPI can be seen in its application for integrative studies,the popularity of its tools, and its use as support for public databases like DIP [32],
9MINT [33], and BIND [34]. The interactions, taken from a functional genomics pointof view, range from physical (e.g. protein binding) to indirect (e.g., proteins in thesame pathway but not physically interacting) interactions to other phenomena, suchas co-expression. The interaction triplet has been an important text-mininginteraction model since its inception. This triplet consists of the two elements thatare involved in the interaction and the verb or action word that relates them. In theGeneWays ontology [35], the elements of the triplet are called the upstream term,downstream term, and action. Triplets usually are taken from single sentences. Forexample, in the sentence “Gene A activates gene B.”, “gene A” is the upstream term,“gene B” is the downstream term, and “activate” is the action. This model was firstintroduced in a preliminary study by Sekimizu and colleagues [36], in which theysought verbs that could characterize gene-gene interactions. Rindflesch and colleagues[37] experimented with sentences that included the verb “bind”.An alternative model to triplets is used in co-occurrence studies. Stapley and Benoit[38], for example, used co-occurrence in a study of selected PubMed abstracts,followed later by the larger-scale project PubGene [39]. With this method,interactions are inferred from co-occurrence statistics of two terms in documents. Ifthe terms co-occur in text more often than could be expected, it is argued, that thereis basis to suggest that they may be related. Co-occurrence is a statistical methodused early in information retrieval. It has been used in different types of analyses but,due to its limitations, it has not become a method of general use in the text mining ofinteractions. Co-occurrence, for example, is of limited help in distinguishinginteractions of very low frequency. Another drawback of this method is that thenature of the interaction is lost.Some of the problems that biomedical interaction extraction entails are common tobiomedical texts at large, such as extensive and open-ended vocabulary, erraticabbreviations, word sense ambiguity, and convoluted sentences. Others are more
10specific. For example, negative particles or words with negative meaning maycompletely change the meaning of an interaction, e.g., “We could not find anyinteraction between gene A and gene B” (for a study on a negative interactome, see[40]). Anaphora is another challenging problem that rarely is tackled (although see[41]). Anaphora refers to situations in which the name of an object is elided,generally because a pronoun is used to avoid repetition, e.g. “It activates gene B.”The challenge is to identify the object to which “it” refers. More generally speaking,biomedical interaction extraction faces the hurdles of the different pre-processingsteps plus the complexity of identifying the interactions themselves.Blaschke and colleagues [42] proposed an early rule-based model for interactionextraction that tried to capture a simple lexical pattern in sentences: “protein A action - protein B”. The names of the proteins and the action were identified usingcontrolled vocabularies. Thomas and colleagues [43] used syntactic instead of lexicalpatterns (an example of a syntactic pattern is noun phrase - verb - noun phrase) tofind triplet candidates that were then narrowed down through a hand-crafted scoringsystem. The syntactic analysis performed was of the shallow type, which can be donemore quickly than deep or full parsing and it only identifies units at the syntagmalevel of the sentence (e.g., noun phrases and verb phrases). Proux and colleagues [44]developed the approaches used in [43] and [42] by using first syntactic parsing andthen applying lexical patterns to find interactions. Similar approaches were exploredby [45] and [46], although they did not report to have fully implemented them.Blaschke and colleagues created a generalized pattern approach, calling these patterns“frames” [47, 48]. Frames are flexible patterns that may include additional informationto enrich the analysis (e.g., the distance in words between the interaction terms of thesentence). Park and colleagues [49] and Yakushiji and colleagues [50] went further byincluding full syntactic parsing in their systems. Full parsing allows for categorizationof all syntactic dependencies among the words of a sentence. The GENIES parser [51]
11was born within this context of incipient improvement and tests of new approaches.1.3.1GENIESGENIES [51] evolved from MedLee [52], a medical natural-language processingapplication in use at the Clinical Information System (CIS) of New Yo
called text mining with more certainty [4, 5, 6], originally called knowledge discovery in text (KDT). The word mining was soon introduced in 1996 in the context of KDT, followed by the coinage of the name "text mining", a variation of the name data mining. By 1997 the expression text mining had become an accepted name for the new discipline.