
Transcription
LANGUAGE IN INDIAStrength for Today and Bright Hope for TomorrowVolume 11 : 9 September 2011ISSN 1930-2940Managing Editor: M. S. Thirumalai, Ph.D.Editors: B. Mallikarjun, Ph.D.Sam Mohanlal, Ph.D.B. A. Sharada, Ph.D.A. R. Fatihi, Ph.D.Lakhan Gusain, Ph.D.Jennifer Marie Bayer, Ph.D.S. M. Ravichandran, Ph.D.G. Baskaran, Ph.D.L. Ramamoorthy, Ph.D.A Hybrid POS Tagger for Indian LanguagesM. Mohamed Yoonus M.Sc., M.Phil., PGDNLP and Samar Sinha, M.A., M.Phil.AbstractThis paper describes the work on building Part-of-Speech (POS) tagger for 12 Indian Languagesusing hybrid approach, and presents the performance of the tagger for each Indian language. Unlike themost of the previous POS taggers for Indian languages which are designed to annotate few languages, thepresent tagger called 'POS Tagger' is an attempt to facilitate annotation of several Indian languagesfollowing a computational approach. The POS Tagger is trained on 80K to 85K tagged corpora for eachlanguage from the LDC-IL corpus. Finally, this paper highlights the performance of the tagger and the needof language specific resources required for obtaining optimal result.1IntroductionThe basic objective of Natural Language Processing (henceforth, NLP) is to facilitate humanmachine interaction through the means of natural human language. Research on NLP has focused on variousintermediate tasks that make partial sense of language structure without requiring complete understandingwhich, in turn, contributes to develop a successful system. Part-Of-Speech (henceforth, POS) tagging is oneof the processes in which grammatical categories are assigned to each token in its context from a given setof tags called POS tagset. It serves wide number of applications like speech synthesis, and recognition,information extraction, partial parsing, machine translation, lexicography, Word Sense Disambiguation(WSD), question-answering etc.Language in India www.languageinindia.com11 : 9 September 2011M. Mohamed Yoonus M.Sc., M.Phil., PGDNLP and Samar Sinha, M.A., M.Phil.A Hybrid POS Tagger for Indian Languages317
Although various automatic POS taggers have been developed worldwide using linguistic rules,stochastic models and hybrid approaches, but each approach has its own merits and demerits. In thiscontext, Indian languages further present a challenge in developing an automatic POS tagger as thelanguages are highly inflectional and morphologically rich. Hence, we need to consider text processing priorto POS tagging in order to achieve high performance, more reliability, and to incorporate most of the Indianlanguages into a single framework of POS annotation.2POS Tagger for Indian Languages: An OverviewIn the last few decades of NLP initiatives on Indian languages and in India, different research groupsworking on various languages have developed POS tagger for respective or a group of Indian languages. Inthis paper, a brief summary of POS tagger for Indian languages is provided language family-wise asoverview information.Assamese POS tagger [8] has been developed using HMM and provides an average tagging accuracyof 87%. A word based hybrid model (Dandapat 2004) for Bengali POS tagging uses the HMM in whichprobabilities of words are updated using both tagged as well as untagged corpus. In the case of untaggedcorpus the Expectation Maximization algorithm has been used to update the probabilities. Another taggerfor Bengali [10] follows an approach suitable for morphologically rich languages in a poor resourcescenario. For Gujarati, machine learning algorithm has been developed [11] following the CRF model inwhich the features given to CRF are considered with respect to the linguistic aspects of Gujarati. Scutt andBrants (1998) has developed a POS tagger for Hindi based on the HMM. This tagger, however, fails toaccount for the language specific features and context to address the partial free word order characteristicsof Hindi. The other POS tagger developed by Aniket Dalal, et.al (Aniket Dalal, 2006) is based on MaximumEntropy Model. In this POS tagger for Hindi, the main POS tagging features are word based context, onelevel suffix and dictionary-based features.In addition to these taggers, a simple HMM based POS tagger [12] for Hindi employs a naïve(longest suffix matching) stemmer as a pre-processor and achieves reasonably good accuracy of 91.57%.Unlike for other languages, Punjabi has an online POS tagger developed by AGLSoft [21]. But it is notefficient to tag large size corpora. The TnT POS Tagger for Nepali [18] has an accuracy of 56% forunknown words and 97% for known words. Along with it, Unitag by Andrew Hardie [19] is designed forPOS-tagging of Nepali text. Sajjad and Schmid [26] reports that the existing Trigram and Tag (TnT) andPENN Treebank for Urdu has an accuracy of 93.40% and 93.02%, respectively.Malayalam POS tagger [14] is designed to capture finer morphological analysis; and consequently,generates the most suitable POS tag using statistical approach. It has an accuracy rate of 80% for thesequence generated automatically for the test case. SVM based POS tagger for Malayalam has also beendeveloped [15]. Tamil Morpheme Components based POS Tagging [22] has an overall accuracy of 95.92%.Similarly, POS Tagging for Tamil using Linear Programming [24] provides an overall accuracy of 95.63%.Apart from these two approaches, the hybrid POS tagger using HMM and a rule based system is alsodeveloped for Tamil [23].Language in India www.languageinindia.com11 : 9 September 2011M. Mohamed Yoonus M.Sc., M.Phil., PGDNLP and Samar Sinha, M.A., M.Phil.A Hybrid POS Tagger for Indian Languages318
For morphologically rich Tibeto-Burman languages like Manipuri, a morphologically driven POStagger [16] is developed; however, the accuracy is limited due to lack of morphologically defined rules. Onthe other hand, CRF and SVM approaches based POS tagger for Manipuri [17] provide a promising resultof 72.04% and 74.38%, respectively.It is important to note that these POS taggers use different POS tagsets, which are developed to caterspecific needs of the individual project. In other words, the basis of each POS tagset is different. Similarly,different computational approaches are used on these POS tagsets yielding different performance result.Such lack of a common basis for POS tagsets along with the different computational approaches which areused in developing POS tagger, cumulatively, has made the basis of comparison heterogeneous.Consequently, it serves in creating an unequal ground to access the performance of an approach as well asof the computational approach across Indian languages.This paper primarily attempts to address such an issue regarding Indian languages. We have designed atagger for labelling POS called POS Tagger, primarily for the twelve Indian languages following the POStagset based on the ILPOSTS Framework. In this paper, we present the performance of the POS Taggerbased on the hybrid approach.3Training Data PreparationIn the preparation of the training data, we have used in-house developed POS AnnTool v0.3, amanual annotation tool, and Simple Pattern Matching Tagger (SPMT) Tool. The former is used to annotatethe 10K corpora and the latter is used subsequently for annotating 70K – 75K data following patternmatching and partial manual annotation. The size of the training corpus, therefore, is 80K to 85K for twelveIndian languages. These stages are described in detail.3.1Stage 1With the help of the POS AnnTool v0.3, minimum of two annotators in each language annotated therandomly sampled data text containing approximately 10K words. In the process of annotation, theannotators are advised not to discuss the issues so that the mutual decisions do not influence the assignmentof tags. Later on, the Inter Annotator Agreement based on the disagreement on the assigned tags is carriedon to examine the variations in tags assigned among the annotators [7]. The 10K words annotated corpus issanctified as a Gold Standard (GS) Corpus.3.2Stage 2The GS 10K tokens are trained on untagged corpus of 25K using SPMT Tool. We observed that anapproximately 30% of tokens are tagged and the remaining tokens are untagged. In this stage too, theuntagged tokens are manually tagged and validated.3.3Stage 3Language in India www.languageinindia.com11 : 9 September 2011M. Mohamed Yoonus M.Sc., M.Phil., PGDNLP and Samar Sinha, M.A., M.Phil.A Hybrid POS Tagger for Indian Languages319
After stage 2, the size of the training data increases to 35K, which is trained on untagged corpus of50K. The result shows that an approximately 60% to 65% of tokens are tagged and the remaining tokens areleft untagged. Once again, the untagged tokens are manually annotated and validated.In general, this process can be iterated until good reasonable training corpora for developing a POStagger can be satisfied. The structure of training data development process is shown in Fig 1.GS Tagged CorpusNormalizationUntaggedCorpusTraining ModuleDatabasePattern MatchingModuleValidationTaggedCorpusFig 1: Training Data Development Structure4POS Tagging IssuesA POS annotation process encounters several issues regarding normalization, ambiguity andunknown words among others. In our POS Tagger, we have incorporated the modules to facilitate tagging.Some of these issues that are incorporated are discussed in detail providing illustrations from Indianlanguages.4.1NormalizationA process of organising data to tokens from a given corpus is called normalisation. In Indianlanguages, normalisation plays an important role since a wide variety of scripts and orthographicconventions and practices are followed which also differ language-wise as well as within categories in alanguage. The tagging algorithm, hence, needs to be designed to handle such cases optimally. For examplein Nepali, भा'थ्यो (bhA'thyO) or भा'-थ्यो is contracted form of भएको थियो. The contracted form भा is acontraction of a participial भएको (bhaEkO) which is different from a dubitative particle भा [7].In such cases, apostrophe is not considered as a delimiter in the process of normalization in theconcerned languages. As a result, it also retains single quote marker as it is in the text. To resolve the issueof normalization in these languages, single quote marker is normalized when an apostrophe comes withboundaries of token.4.2AmbiguityLanguage in India www.languageinindia.com11 : 9 September 2011M. Mohamed Yoonus M.Sc., M.Phil., PGDNLP and Samar Sinha, M.A., M.Phil.A Hybrid POS Tagger for Indian Languages320
In computational linguistics, ambiguity refers to a state where there is a choice of tag to a giventoken. Interestingly, it is observed that ambiguity varies from language to language and also from corpus tocorpus. Although the most of the words in a language are unambiguous and can be tagged straight forwardlybut there is also a good size of words that are ambiguous. Consider the following sentence in Tamil and itscorresponding POS tag.1. இன்று\NC என்ன\DWH கிழமை\NC ?\PU'What day is today?'(in'Ru en'n'a kiZamai?)2. அவர்\PPR இன்று\ALC வரலாம்\V(avar in'Ru varalAm)'He may come today.'In (1) and (2), இன்று (in'Ru) has a lexical ambiguity either as an NC (Noun Common) or an ALC(Adverb Location) depending on its context. To resolve such ambiguity, the POS Tagger incorporatescontext based rules to disambiguate them.4.3Unknown WordsOne of the issues that a POS tagger encounters frequently in tagging new corpus is respect to newtokens that do not exist in the training data. Such tokens are generally known as unknown words. In ourPOS Tagger, we have tried to resolve the issue using context driven rules to tag them.5POS Tagger: Procedure and ArchitecturePOS tagger involves basically two tasks: learning or training task and tagging task. The former taskis also classified into base-level learning and context-rule learning. In developing the POS Tagger, first, wehave trained the validated tagged data into base-level training module. This module generates a databasethat provides statistics regarding the frequency and the status of ambiguity associated with the input data(i.e. token with its POS tag).On the basis of the database generated by the base-level module, an n-gram table is created. Thelearning algorithm, further, generates context-rules for disambiguation following the n-gram tables. Lateron, this table is utilized by the POS Tagger to assign the appropriate tag.In the tagging module, the input for the tagging algorithm is a token and the output is a POS taggedtoken. While assigning the appropriate tag to a token, the tagging process follows the following proceduralsteps:1. Text normalisation and Tokenisation2. Non-ambiguous tokens are assigned a POS tag through pattern matching method.3. Ambiguous tokens are assigned the most appropriate tag based on the context-rules for disambiguation.4. A common list containing names of common and important person, place, months, days, etc. is preparedfor Indian languages but following language specific script. This list is provided to the system to tag theuntagged tokens.5. The remaining untagged words are assigned tags following bi-gram, tri-gram and penta-gram.Language in India www.languageinindia.com11 : 9 September 2011M. Mohamed Yoonus M.Sc., M.Phil., PGDNLP and Samar Sinha, M.A., M.Phil.A Hybrid POS Tagger for Indian Languages321
In the initial step, POS Tagger normalizes the untagged corpus and the information is updated in themetadata of the file normalized. It is largely carried out to avoid the repetition of the normalisation processof the input file. In the following step, POS tags are assigned to the known words of the file normalised.Finally, the system assigns an appropriate tag to the unknown tokens. The flowchart can be illustrated asbelow.StartRead cleaned filesNormalisation & TokenisationprocessesIf word isPunct. MarkTWhiletoken[index] token [end]TAssign Punctuation tagFEndIf word is existon nonambiguity tableTFIf wordisnumeral,date orforeignFTAssign theappropriatetagSearch and assign tagfrom Name list &assign most probablePOS-tag to remaininglist of words using Ngram tableAssign theappropriatetagPass the list ofuntagged words intoambiguity look uptable, and then assignthe appropriate tagIncrement indexTemp Hash tableWrite TaggedCorpora to filesStopFig 2: Flowchart diagram for POS TaggerLanguage in India www.languageinindia.com11 : 9 September 2011M. Mohamed Yoonus M.Sc., M.Phil., PGDNLP and Samar Sinha, M.A., M.Phil.A Hybrid POS Tagger for Indian Languages322
Our POS Tagger is built upon a hybrid approach that puts together the stochastic approach and therule-based approach. The architecture of POS Tagger consists of three layers such as Data Layer (DL),Business Layer (BL) and Presentation Layer (PL). The system architecture is shown schematically in Fig 3.Data LayerMasterAmbiguityPunctuationN-gramsRulesName ListBusiness tation LayerFig 3: POS Tagger ArchitectureThe DL is prepared at the time training. It encapsulates all information related to data from thetagging module. The BL contains logic for retrieving persistent data from the DL and placing it intobusiness objects. The PL gives graphical user interface (GUI) environment.66.1ExperimentSet-upIn this section, we are describing the POS annotation experiment carried out on twelve Indianlanguages in eight different scripts that each language uses using the POS Tagger written in C# usingMicrosoft Visual Studio 2008.The LDC-IL tagset is a hierarchical tag set based on the EAGLES Guidelines which is designed totag the maximum morpho-syntactic features of the Indian languages. It contains Category, Type and theirAttributes. For this experiment, we have removed Attribute level from this tagset in order to test theefficiency of the POS tagger with respect to the Category and the Type levels. Such a strategy is designed toachieve objective of the larger research project of which this experiment is a part.The size of tagset in each of the twelve languages is presented in Table 1 and their details inappendix 1.Language in India www.languageinindia.com11 : 9 September 2011M. Mohamed Yoonus M.Sc., M.Phil., PGDNLP and Samar Sinha, M.A., M.Phil.A Hybrid POS Tagger for Indian Languages323
The LDC-IL team has carried out manual POS tagged with help of the POS AnnTool v0.3. Thismanually tagged data is used as our training and test set data. The training set consists of tagged corpora of70K to 75K and the test set consists of 10K of the GS corpus for 12 Indian languages.TABLE 1LDC-IL TAGSET SIZELDC-IL Tagset - Category and SubcategoryLanguageTagset SizeMalayalam, Manipuri37Tamil38Assamese39Bengali40Hindi, Punjabi, Urdu43Bodo, Oriya44Gujarati, Nepali506.2ResultOur experiment with the hybrid POS Tagger has two sets of data. The training set has approximately70K to 75K and the test set contains approximately 10K. In this particular experiment, we have merged thetraining and the test data, and the resultant data was equally divided into seven data sets. Of the seven datasets, the six data sets were used for training the POS Tagger, and the seventh data set was used to test theperformance. Similar test was carried on all the data sets in which one of the data sets was used for theperformance testing. Finally, the average result was calculated from the results obtained from the sevendata sets. We have used the standard Information Retrieval (IR) metrics of Precision, Recall and F-Score toevaluate the system.The Precision, Recall and F-score evaluation results as shown below in Table 2.S No123456789101112TABLE 2EVALUATION OF THE SYSTEM FOR TWELVE age in India www.languageinindia.com11 : 9 September 2011M. Mohamed Yoonus M.Sc., M.Phil., PGDNLP and Samar Sinha, M.A., M.Phil.A Hybrid POS Tagger for Indian Languages324
6.3Error analysisThe performance error analysis provides information about the nature of error that the systemmakes. In our experiment, to ascertain the nature of error with respect to the POS tags that our hybrid POSTagger assigns, we have used confusion matrix method. Table 3 shows the result for Punjabi data.Actual VMVMVMVMVAPPCSBTABLE 3ERROR ANALSIS RESULT (Punjabi)Assigned V0.24VA1.1PP0.48VM1.07CCD0.28CCD0.28The error figures can be reduced if we can find some mechanisms to handle the significant numberof unknown words.On the basis of the confusion matrix, it was found that the most of the errors occur with respect toNoun, Verb and Adjective categories in the twelve Indian languages. It is often the case that, in theselanguages, Common Noun and Proper Noun are often tagged reverse. The similar misappropriation of tag iswitnessed between Main Verb and Auxiliary Verb, and Adjective and Noun.Language in India www.languageinindia.com11 : 9 September 2011M. Mohamed Yoonus M.Sc., M.Phil., PGDNLP and Samar Sinha, M.A., M.Phil.A Hybrid POS Tagger for Indian Languages325
7ConclusionThe above experiment on twelve Indian language regarding POS tagging based on the LDC-IL POStagset v0.3 using hybrid POS Tagger shows that the most frequent errors occur with respect to Noun, Verband Adjective categories. On the computational front, it is also observed that due to the unknown tokens thetime taken to assign the tag is more as the system undergoes through several modules to assign the mostappropriate tag. However, its efficiency and accuracy increase as we increase the training data size.In our future work, we would like to introduce a finer algorithm that reduces the margin of errorregarding POS category tags. However, developing such an algorithm is not an easy task as the Indianlanguages do have free word ordering property. In other words, the n-gram statistics based on individualtoken as used in this experiment may not be adequate enough and reasonable to account for the property.Further, these facts point to develop an algorithm that accounts for such property of Indian languages withrespect to the above mentioned categories. Similarly, to accelerate the computational speed, developing aNamed Entity Recognizer (NER) and a module to identify category based on the morphological/morphsyntactic cues of the unknown token is at the forefront of our endeavour to develop a generic toolkit for allIndian languages. AcknowledgementWe would like to thank Dr. L. Ramamoorthy, Dr. B. Mallikarjun and Er. M. Venkatesan for valuable adviceand support. We would also like to thank A. Vadivel and LDC-IL team for valuable technical and academicsuggestion. References[1] Brill Eric,”Transformation-Based Error-Driven Learning and Natural Language Processing: A CaseStudy in Part of Speech Tagging. Computational Linguistics, vol. 21, pp. 543--565, 1995.[2] D. Jurafsky and J.H. Martin, “Chapter 8: Word classes and Part-Of-Speech Tagging”, Speech andLanguage Processing, Prentice Hall, 2000.[3]http://www.au-kbc.org/research areas/nlp/ projects/postagger.html[4] Xunlei Rose Hu and Eric Atwell, “A Survey of Machine Learning Approaches to Analysis of LargeCorpora”,School of Computing, University of Leeds, U.K. LS2 9JT[5] A. Voutilainen, The Oxford handbook of computational linguistics, ch. 11: Part-of-Speech Tagging, pp.219-232. Oxford University Press, 2005.[6] R. Garside and N. Smith, “A hybrid grammatical tagger: Claws4," in Corpus Annotation: LinguisticInformation from Computer Text Corpora, pp. 102-121, Longman, 1997.Language in India www.languageinindia.com11 : 9 September 2011M. Mohamed Yoonus M.Sc., M.Phil., PGDNLP and Samar Sinha, M.A., M.Phil.A Hybrid POS Tagger for Indian Languages326
[7] Mallikarjun B, Mohamed Yoonus M, Samar Sinha and Vadivel A, "Indian Languages and Part-ofSpeech Annotation", CIIL Publication No.598, in press.[8] Navanath Saharia, Dhrubajyoti Das, Utpal Sharma and Jugal Kalita, “Part of Speech Tagger forAssamese Text”, Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 33–36, Suntec,Singapore, 4 August 2009.[9] Sandipan Dandapat, Sudeshna Sarkar and Anupam Basu, “A Hybrid Model for Part-of-Speech taggingand its application to Bengali”, Transaction on Engineering, Computing and Technology VI Dec 2004.[10] Dandapat S., Sarkar S. and Basu A. “Automatic Part-of-Speech Tagging for Bengali: An approach forMorphologically Rich Languages in a Poor Resource Scenario”, In Proceedings of the Association ofComputational Linguistics (ACL 2007), Prague, Czech Republic. 221-224,2007[11] Chirag Patel and Karthik Gali “Part-Of-Speech Tagging for Gujarati Using Conditional RandomFields”, Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, 2008[12] Manish Shrivastava, Pushpak Bhattacharyya, “Hindi POS Tagger Using Naive Stemming: HarnessingMorphological Information without Extensive Linguistic Knowledge”, International Conference on Naturallanguage processing, 2008[13] Smriti Singh, Kuhoo Gupta, Manish Shrivastava, and Pushpak Bhattacharyya, “Morphologicalrichness offsets resource demand – experiences in constructing a pos tagger for hindi”, In Proceedings ofthe COLING/ACL 2006 Main Conference Poster Sessions, pages 779–786, Sydney, Australia, July.Association for Computational Linguistics, 2006.[14] Manju K & Soumya S, "Parts Of Speech Tagger for Malayalam", A project report on Master ofTechnology in Computer and Information Science, Cochin University of Science & Technology, Kochi,May 2009.[15] Antony P.J, Santhanu P. Mohan and Soman K.P, "SVM Based Part of Speech Tagger for Malayalam",itc, pp.339-341, International Conference on Recent Trends in Information, Telecommunication andComputing, 2010.[16] Thoudam Doren Singh, Sivaji Bandyopadhyay, "Morphology Driven Manipuri POS Tagger”, Inproceeding of IJCNLP NLPLPL 2008, IIIT Hyderabad pp 91-97. 2008.[17] Thoudam, Asif and Bandyopadhyay, "Manipuri POS Tagging using CRF and SVM: A LanguageIndependent Approach", International Conference on Natural language processing, 2008.[18] Bal Krishna Bal and Madan Puraskar Pustakalaya, "Parts of Speech Tagger for Nepali", RegionalConference on Localized ICT Development and Dissemination across Asia.PAN 7 LocalizationProject.12th- 16th January, 2009, Novotel Hotel, Vientiane, LaosLanguage in India www.languageinindia.com11 : 9 September 2011M. Mohamed Yoonus M.Sc., M.Phil., PGDNLP and Samar Sinha, M.A., M.Phil.A Hybrid POS Tagger for Indian Languages327
[19] http://www.ling.lancs.ac.uk/staff/hardie[20] Linda Van Guilder, “Automated Part of Speech Tagging: A Brief Overview”, Handout for LING361,Georgetown University, Fall 1995.[21] http://punjabi.aglsoft.com/?show tagger[22] S. Lakshmana Pandian and T.V. Geetha, “Morpheme Components based Part of Speech tagging”,International Conference on Natural language processing, 2008[23] Arulmozhi. P, Sobha L, “A hybrid POS Tagger for a Relatively Free Word Order Language”, Inproceedings of MSPIL-2006, Indian Institute of Technology, Bombay. Pp 79-85, 2006[24] Dhanalahsmi V, Anand Kumar, Shivapratap G, Soman KP and Rajendran S, “Tamil POS Taggingusing Linear Programming”, International Journal of Recent Trends in Engineering, Vol. 1, Vol. 2, May2009[25] Ahmed Muaz, Aasim Ali and Sarmad Hussain, “Analysis and Development of Urdu POS TaggedCorpus”, Proceedings of the 7th Workshop on Asian Language Resources, ACL-IJCNLP 2009, pages 24–31, Suntec, Singapore, 6-7 August 2009.[26] Sajjad H and Schmid H, “Tagging Urdu Text with Parts Of Speech: A Tagger Comparison”, 12thConference of the European chapter of the association for computational Linguistics, 2009[27] Jesse Liberty and Donald Xie, ”Programming C# 3.0”, O’Reilly 5th edition, 2007. APPENDIX (LDC-IL Tagset without nguage in India www.languageinindia.com11 : 9 September 2011M. Mohamed Yoonus M.Sc., M.Phil., PGDNLP and Samar Sinha, M.A., M.Phil.A Hybrid POS Tagger for Indian PRLPWHDABDRLDWHJJJQJINTVMVAAMNPPCCD
NKPUDescriptionNoun (N)Nominal Modifier(J)Common(NC)Proper(NP)Adjective (JJ)Quantifier (JQ)Verbal(NV)Intensifier (JINT)Spatio-temporal(NST)Pronoun (P)Pronominal (PPR)Verb (V)Emphatic (CEMP)Topic(CTOP)Delimiting(CDLIM)Honorific (CHON)Reciprocal (PRC)Co-ordinating(CCD)Negative (CNEG)Exclusive(CEXCL)Terminative(CTERM)Dubitative (CDUB)Relative (PRL)Subordinating (CSB)Similative (CSIM)Wh-pronoun(PWH)Interjection (CIN)Inclusive (CINCL)Reflexive (PRF)Main Verb (VM)Auxiliary Verb (VA)(Dis)Agreement(CAGR)Particle (C)Comparative(CCOM)Classifier (CCON)Cumulative(CCUM)Evidential (CEVID)Clusive (CCLU)Partitive guage in India www.languageinindia.com11 : 9 September 2011M. Mohamed Yoonus M.Sc., M.Phil., PGDNLP and Samar Sinha, M.A., M.Phil.A Hybrid POS Tagger for Indian LanguagesOthers (CX)Adverb (A)Manner (AMN)Location(ALC)Numeral (NUM)Real (NUMR)Serial (NUMS)Calendric (NUMC)Ordinal (NUMO)Reduplication(RDP)329
Demonstrative (D)Absolutive H)Postposition (PP)Case (PPC)Non-Case (PPNC)Participle (L)Relative (LRL)Verbal (LV)Nominal (LN)Future (LFU)Perfective(LPFV)Conditional (LC)Perfect(LPRF)Present (LPR)Past (LPS)Imperfective(LIPFV)Infinite (LNF)Residual (RD)Foreign Word RDF)Symbol (RDS)Unknown (UNK)Punctuation (PU) M. Mohamed Yoonus, M.Sc., M.Phil., PGDNLPLecturer cum Resource PersonLDC-IL Project,Central Institute of Indian LanguagesMysore 570006Karnataka, [email protected] Sinha, M.A., M.Phil.Senior Lecturer cum Junior Research OfficerLDC-IL ProjectCentral Institute of Indian LanguagesMysore 570006Karnataka, [email protected] in India www.languageinindia.c
Unlike for other languages, Punjabi has an online POS tagger developed by AGLSoft [21]. But it is not efficient to tag large size corpora. The TnT POS Tagger for Nepali [18] has an accuracy of 56% for unknown words and 97% for known words. Along with it, Unitag by Andre