Vol. 2(7) Oct. 2016, pp. 481-487PTokenizer: POS Tagger TokenizerSaeed Rahmani, Seyyed Mostafa Fakhrahmad and Mohammad Hadi SadrediniDepartment of Computer and IT Engineering, Shiraz University, Shiraz, Iran*Corresponding Author's E-mail: [email protected] the advent of new information sources and the expansion of text data, natural languageprocessing (NLP) has become one of the key parts of all the systems dealing with humanwritten texts, and part of speech (POS) tagging is an inseparable part of all NLP tasks. As aresult, it is of the paramount importance to enhance the accuracy of POS tagging. In this paper,applying language model and statistical information, we introduce a new approach to tokenizesentences and prepare them to be labeled by POS taggers. An evaluation shows that the proposedmethod yields a precision of 98 percent for tokenizing, and applying it to a Maximum Likelihood andTnT POS taggers achieve improvement in the accuracy of Persian POS tagging.Keywords: Tokenizer, Part of Speech Tagging, Probabilistic Model, Compound Tokens.1. IntroductionPart of speech (POS) Tagging is the process of annotating each word with the most appropriatesyntactic role in a sentence [1]. POS tagging is one of the most important tasks in natural languageprocessing (NLP) which has been used in many fields such as speech recognition, text to speech,semantic processing, building parse trees, machine translation, and information retrieval systems. Byannotating words with POS tags, language structure can be used to extract very useful features forNLP tasks. Basically, POS tagging process consists of two stages: in the first stage, all possible tagsfor an input token are found, and in the second stage, the best tag is chosen from a set of possible tagsconsidering the context in which the word appears in. POS taggers are classified into three mainclasses: rule based, statistical based, and hybrid. Rule based methods rely on a big training set ofgrammar rules, while statistical methods make a probabilistic model based on annotated corpus andchoose the POS tags with highest probability for each token. The last class of POS taggers utilizesboth grammar rules and probabilistic models to tag words with their appropriate POS tags. One of thewell-studied statistical approach for tagging is Hidden Markov Model which chooses a tag sequencefor each sentence in a way that it tries to maximize a score function derived from combination ofoccurrence probability of tags and accuracy of sequence tag [1]. TnT Tagger another statistical model[2]. This model uses Markov’s chains to estimate the probabilities of assigning tags to words based ontheir context. On the other hand, rule based approaches use rules to resolve the tag ambiguity. Rulescan be determined manually or using learning methods. In the transformation-based methods, errordriven approach is used [3]. There are other machine learning methods that are used for tagging, suchas memory-based learning, decision trees, transformation-based learning, maximum entropy, and otherlog-linear models [4, 5, 6]. Comparison of different approaches has shown that, in most cases,statistical approaches yield better results than finite-state, rule-based, or memory-based taggers [7].However, among the statistical approaches, the Maximum Entropy framework is distinguishable.481Article History:JKBEI DOI: 649123/101163Received Date: 13 May. 2016Accepted Date: 16 Sep. 2016Available Online: 03 Oct. 2016

Saeed Rahmani et al. / Vol. 2(7) Oct. 2016, pp. 481-487JKBEI DOI: 649123/101163TABLE 1. Bijankhan corpus statistical informationFeatureDistinct CountTotal FrequencyNumber of characters14912,531,035Number of tokens without considering their tags56,6242,756,192Number of tokens76,9672,586,708Number of tags used in the corpus402,586,708The Research conducted in [8] shows that the combination of Markov model and a proper smoothingtechnique yields the best performance. However, handling of unknown words must be inspectedspecifically, because these kinds of words do not occur in the training set, and there are no rules forthem. So, In order to tag unknown words, it’s necessary to make use of known or less ambiguousword sequences.In all proposed POS tagging methods, text preprocessing and extracting correct tokens have asignificant effect on final accuracy. For Example, as an application in machine translation, correcttokens are used for translation in parallel corpus [9].In a POS tagging system, determining correct tokens in a sentence depends on the languagestructure. In some languages, such as Persian, adjacent tokens in a sentence are tagged together, andthis is problematic for POS tagger systems. They are affected by compound word, word prefix, andpostfix in the language.This problem gets even more drastic by use of multiple syntaxes for some words. In this paper, aprobabilistic model is created to determine compound words, and then tokens are annotated with POStags1. Finally, we will evaluate the proposed method.2. CorpusTo create model and evaluate our method, we employ Bijankhan2 corpus [10]. This corpus, whichis created by Dr. Bijankhan in university of Tehran, is manually tagged.Statistical information about characters, tokens and tags of the corpus are presented in the table 1.An inspection of corpus shows that this corpus has been normalized. Therefore, it is not necessary tonormalize the text once again. The difference between second and third rows of table 1 suggests sometokens have been used as more than one token (sub-token or part). In other words they have overlaps.This corpus is tagged using a hierarchical set of tags. This set can be presented as a tree, and leavesof this tree are moreTABLE 2. Compound token percentage in Bijankhan corpusToken TypeFrequencyTotal FrequencySubheadSubheadOne-part tokens2,425,98293.7988.01Two-part tokens153,1685.9211.12Three-part tokens6,7490.260.73More than three-part tp:// of Knowledge-Based Engineering and Innovation (JKBEI)Universal Scientific Organization, 2413-6794 (Online) and ISSN: 2518-0479 (Print)

Saeed Rahmani et al. / Vol. 2(7) Oct. 2016, pp. 481-487JKBEI DOI: 649123/101163Specific than their parents. In this structure, there are many tags unique in different levels of the treewhich cannot be used for creating model. Therefore, in order to work with annotated data, usually,tagging gets done by the help of pruning appropriate levels of the hierarchy. Because of the bignumber and types of tags, in order to analyze, hierarchical tags are usually pruned and 40 tags arechosen [10]. In the Figure 1 you can see POS distribution of these tags in the Bijankhan corpus.According to Figure 1, by choosing 6 most repeated tags, POS can cover 82% of all of the tokensexisting in the corpus. In the POS tagging process, detecting and tagging compound tokens is veryimportant.In fact, compound tokens are adjacent tokens that are tagged together. As a result, the number ofPOS tags in the corpus will be smaller than the total number of tokens. Thus, before tagging,compound tokens should be determined so that the tagging can be done correctly. In the table 2, thenumber of compound tokens is presented.In table 2, regarding the uniform mode, a compound word is considered as only one token.However, in the weighted ratio column, number of parts of tokens is used as a coefficient incomputations. Maximum precision of POS tagger that can be achieved by unigram token in uniformstates is 93.8% and that of the weighted state is 88%.We implemented a Maximum Likelihood POS tagger and evaluated the previously explained method.All of the tokens in the corpus are considered unigrams. The evaluation results show 89.8% accuracyin uniform state and 84.3% accuracy in weighted state.Figure 1. Bijankhan corpus tag distributionIn this paper, our intention is to create a component as POS tagger tokenizer for preprocessing in aspecial way. The main goal of this tokenizer is detecting bisection and trisection tokens. It should bementioned that based on table 2, tokens with more than three sections compose a very slightpercentage of the corpus. For this reason, they are neglected to obtain efficient implementationperformance with the bit operation.3. POS tokenizerIn POS tagging, the multi part tokens should be detected before tagging operation in first stage. Aprobabilistic function is defined here to determine that tokens have more than one part.𝑓𝑠𝑐𝑜𝑟𝑒 (𝑘 𝑡𝑜𝑘𝑒𝑛) �𝑑𝐶𝑜𝑢𝑛𝑡(𝑘 �� 𝑡𝑜𝑘𝑒𝑛)483Journal of Knowledge-Based Engineering and Innovation (JKBEI)Universal Scientific Organization, 2413-6794 (Online) and ISSN: 2518-0479 (Print)(1)

Saeed Rahmani et al. / Vol. 2(7) Oct. 2016, pp. 481-487JKBEI DOI: 649123/101163In this function k-token indicates k adjacent tokens, TogetherTaggedCount represents count of ktokens provided that k-tokens are tagged together in the corpus, LanguageModelCount represents ktoken count without considering their tags (k-gram). This function estimates probability for every ktoken to decide k-token should be tagged together or not.By determination of a suitable threshold for fscore for each k, in tokens that have more than one partin tagging time, these tokens can be recognized accurately.Indeed, tokenizer with fscore function and regulated threshold operate as classifier, and their mainpurpose is to classify tokens in k classes. In this paper, using the explained function we want todetermine multi-part tokens which are special to POS tagging system.The selection processes of k as below:For current token, a window with size k is considered, including k consequent tokens starting fromcurrent one.For k tokens inside the window, fscore value is evaluated.If fscore is lower than k tokens, then one unit is subtracted from k and process returns to step 1;otherwise k tokens in the window is identified as k-section tokens.After selection of k-section token, current situation of token moves k unit forward and reaches to firsttoken after previous window, and the process returns to step 1.These steps continue until all of the tokens in the input sentence are processed. In the Figure 3 sudocode of these steps is shown.To achieve a high accuracy, fscore threshold should be determined in a way that maximizes tokenizerprecision, and also distinct thresholds should be determined for each k. As mentioned before, themaximum value of k is 3. Therefore, we have two thresholds: one for trisection tokens and another forbisection tokens. One section tokens do not require any threshold because when a token has only onesection, then it has one part inevitably.We evaluate tokenizer for different thresholds to achieve optimized values. In the best case,obtained thresholds for bisection and trisection were 0.67 and 0.54 respectively.It should be noted that considered thresholds for different values of k have effects on each other. InFigure 3, tokenizer accuracy for different values of bisection and trisection thresholds is shown.Functionget postagged tokens(String input){current token index 1;k max-size-k;while(current token index get length(input)){while( k 0 ){fscore get fscore(sub string(input,current token index,k));if (fscore fscore treshold(k))k--;elsebreak;}result token.add(sub string(input,current token index,k));k max-size-k;if (current token index k get length(input))k get length(input)- current token index;current token index current token index k;}returnresult token;}Figure 2.Sudo code of selecting value of k for k-section token484Journal of Knowledge-Based Engineering and Innovation (JKBEI)Universal Scientific Organization, 2413-6794 (Online) and ISSN: 2518-0479 (Print)

Saeed Rahmani et al. / Vol. 2(7) Oct. 2016, pp. 481-487JKBEI DOI: 649123/101163Figure 3.Tokenizer accuracy for different values of bisection and trisection thresholdsIn Figure 3, the x- and y-axis are for bisection and trisection thresholds, and the z-axis is for tokenizeraccuracy in uniform state. According to this Figure, fscore for bisection and trisection tokens is 0.67 and0.54 respectively. After determination of thresholds, tokenizer was evaluated. For evaluation, 10 foldscross-validation method was used. In order to have a careful evaluation process, the one part,bisection, and trisection tokens were evaluated separately and precision level for each of them wasassessed. Evaluation result is shown in table 3.According to table 3, the accuracy of tokenizer in uniform mode equals 98.84 and in weightedmode equals 98.01. Tokenizer accuracy for one-part tokens is 99.66, for bisection tokens is 87.75, andfor trisection tokens is 55.98. This way, we can use the POS tagger tokenizer to improve POS taggerperformance.4. Evaluation of effects of the tokenizer on the POS tagger performanceIf in the POS tagging system, tokens for an input sentence are recognized correctly, it is possible toachieve high accuracy. In the POS tagging process, after determination of the input tokens at the firststage, the possible tags for each token of sentence will be achieved.Finally, the best tag among them will be chosen. Many POS tagging methods have been introduced sofar, and each methods tries to use different features to select tags with the highest correctnessprobabilities.In this paper, to determine the role of tokenizing in the accuracy of POS tagger, we use MaximumLikelihood (ML) and Trigrams’n’Tags (TnT) methods. We implemented POS taggers using threetypes of tokenizers, including initial tokenizer, ideal tokenizer, and proposed tokenizer. In defaulttokenizer, we assumed all tokens have only one part and in ideal tokenizer, we assumed all multi parttokens are detected before POS tagging.After tokenizing corpus with every method, we used the ML and TnT models to determine POStags for all tokens. TnT uses second order Markov models (2) to part-of-speech tagging [2].𝑇argmax𝑡1 𝑡𝑇 [ 𝑃(𝑡𝑖 𝑡𝑖 1 , 𝑡𝑖 2 )𝑃(𝑤𝑖 𝑡𝑖 )] 𝑃(𝑡𝑇 1 𝑡𝑇 )(2)𝑖 1Finally, we evaluated the results to calculate the accuracy of the proposed method. In fact, mainpurpose of our approach is to increase POS tagging precision up to the results of ideal tokenizer.485Journal of Knowledge-Based Engineering and Innovation (JKBEI)Universal Scientific Organization, 2413-6794 (Online) and ISSN: 2518-0479 (Print)

Saeed Rahmani et al. / Vol. 2(7) Oct. 2016, pp. 481-487JKBEI DOI: 649123/101163TABLE 3.Tokenizer cross-validation result for 10 foldTokenCorrect token frequencyIncorrect token frequencyAccuracy (%)Single-section tokens2,417,783819999.66Bisection tokens134,43218,76687.75Trisection tokens3,7782,97155.98All tokens in uniform mode2,555,99329,93698.84All tokens in weighted mode2,697,98154,64498.01For evaluation, we use Bijankhan corpus and 10 folds cross validation method that are introduced inprevious sections. The accuracy of the POS tagging is evaluated for the three types of tokenizers:initial, ideal and proposed method. The final results are shown in table 4.According to table 4, obtained precision for proposed tokenizer is close to that of ideal tokenizer.Also, in comparison with initial tokenizer, the proposed tokenizer improves ML POS tagger by 3.01percent in uniform mode and 6.06 percent in weighted mode and improves TnT POS tagger by 3.37percent in uniform mode and 4.54 percent in weighted mode.Many researches have done todetermine multi-part tokens in Persian language [11, 12, 13, 14]. All ofthem, have used heuristic methods, based on prefixes to detect multi-part tokens with a high accuracy.However, our proposed method assigns probability to all multi-part tokens.Then, by regulation of proper thresholds, it will be possible to distinguish all of the multi-parttokens obtained by heuristic methods and also many other tokens that have not occurred in theprevious methods.Proposed tokenizer will improve MaximumLikelihood accuracy. Therefore, it willdecrease the distance between accuracy of this method and others.TABLE 4.Tokenizer’s effect on ML and TnT POS tagger accuracyPOS Tagger Accuracy 7.0895.7398.4896.75TokenizerConclusionPart of speech (POS) tagging is one of the most important stages in text processing systems, andimproving its accuracy will lead to improvement in accuracy wide range of fields. In this paper, weproposed a probabilistic method for tokenizing text and preparing text for POS tagging. The proposedmethod achieved 0.98 accuracy, and improves the Maximum Likelihood POS tagger by 6.06% and theTnTPOS tagger by 4.54% comparing to previous methods. Applying the proposed tokenizer to theMaximum Likelihood POS tagger, we decreased the distance between its accuracy and those of othermethods, such as TnT, Markov Hidden Model or memory based methods.486Journal of Knowledge-Based Engineering and Innovation (JKBEI)Universal Scientific Organization, 2413-6794 (Online) and ISSN: 2518-0479 (Print)

Saeed Rahmani et al. / Vol. 2(7) Oct. 2016, pp. 481-487JKBEI DOI: 649123/101163References[1]Manning, Christopher D. Foundations of statistical natural language processing. Ed. Hinrich Schütze. MIT press, 1999.[2]Mostafa Mohammadpour, MohammadKazem Ghorbanian, and Saeed Mozaffari. AdaBoost Performance ImprovementUsing PSO Algorithm. Eighth International Conference on Information and Knowledge Technology (IKT), 2016.[3]Brants, Thorsten. "TnT: a statistical part-of-speech tagger." Proceedings of the sixth conference on applied naturallanguage processing. Association for Computational Linguistics, 2000.[4]Brill, Eric. "A simple rule-based part of speech tagger." Proceedings of the workshop on Speech and Natural Language.Association for Computational Linguistics, 1992.[5]Daelemans, Walter, et al. "MBT: A memory-based part of speech tagger-generator." arXiv preprint cmplg/9607012 (1996).[6]Schmid, Helmut. "Probabilistic part-of-speech tagging using decision trees. "Proceedings of the internationalconference on new methods in language processing. Vol. 12. 1994.[7]Ratnaparkhi, Adwait. "A maximum entropy model for part-of-speech tagging. "Proceedings of the conference onempirical methods in natural language processing. Vol. 1. 1996.[8]Halácsy, Péter, AndrásKornai, and CsabaOravecz. "HunPos: an open source trigram tagger." Proceedings of the 45thannual meeting of the ACL on interactive poster and demonstration sessions. Association for ComputationalLinguistics, 2007.[9]Zavrel, Jakub, and Walter Daelemans. "Evaluatie van part-of-speech taggers voor het Corpus Gesproken,Nederlands." Rapport CGN: werkgroepcorpusannotatie, Tilburg University (1999).[10] Chung, Tagyoung, and Daniel Gildea. "Unsupervised tokenization for machine translation." Proceedings of the 2009Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2. Association forComputational Linguistics, 2009.[11] Oroumchian, Farhad, et al. "Creating a feasible corpus for Persian POS tagging." Department of Electrical andComputer Engineering, University of Tehran (2006).[12] Azimizadeh, Ali, Mohammad Mehdi Arab, and Saeid Rahati Quchani. "Persian part of speech tagger based on HiddenMarkov Model." JADT: 9es Journées internationalezd’ Analyses tatistique des Données Textuelles (2008).[13] Amiri, Hadi, et al. "A survey of part of speech tagging in Persian." Data base Research Group (2007).[14] Sagot, Benoît, et al. "A new morphological lexicon and a POS tagger for the Persian Language." Internationalconference in Iranian linguistics. 2011.[15] SMR.Hashemi, M. Hajighorbani, Mohammad Mahdi Deramgozin and B. Minaei-Bidgoli, " Evaluation of theAlgorithms of Face Identification ", International Journal of Mechatronics, Electrical and Computer Technology, Vol.6(19), Jan. 2016.[16] Shamsfard, Mehrnoush, and HakimehFadaei. "A Hybrid Morphology-Based POS Tagger for Persian." LREC. 2008.[17] SMR.Hashemi, Mohammad Mahdi Deramgozin, M.Hajighorbani and Ali Broumandnia, " Methods of Image Re-targeting Algorithm Using Markov Random Field ", International Journal of Mechatronics, Electrical and ComputerTechnology Vol. 6(20), Apr. to Jul., 2016[18] SMR. Hashemi, Mahdi Saadati, Mohsen Haji Ghorbani and M. Madadpour Inallou, " The Comparison of FaceDetection Methods in Angled Mode",Journal of Engineering and Applied Sciences, Vol.11( Issue: 4) 2016, Page No.:915-919 DOI: 10.3923/jeasci.2016.915.919.487Journal of Knowledge-Based Engineering and Innovation (JKBEI)Universal Scientific Organization, 2413-6794 (Online) and ISSN: 2518-0479 (Print)

Maximum precision of POS tagger that can be achieved by unigram token in uniform states is 93.8% and that of the weighted state is 88%. We implemented a Maximum Likelihood POS tagger and evaluated the previously explained method. All of the tokens in the corpus are consid