Transcription

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersDesign and Development of Part-of-Speech-TaggingResources for WolofCheikh M. Bamba DioneJonas KuhnSina ZarrießDepartment of Linguistics, University of Potsdam (Germany)Institute for Natural Language Processing (IMS), University of Stuttgart (Germany)Dione,Kuhn,ZarrießPart-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS Taggers1Introduction: Wolof, a Low Resource Language2Starting from Scratch: Tagset Design3Fast Gold Standard Annotation4Experiments with State-of-the-art PoS TaggersDione,Kuhn,ZarrießPart-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersWolofSpoken in SenegalLingua franca for 80% of Senegals population (9 million speakers)4 million native speakersWest-Atlantic languageDione,Kuhn,ZarrießPart-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersWolof LanguageEx. Object vs. Subjec focusComplex system ofinflectionalmarkers/pronouns(almost no verbalinflection)Very productivederivation morphology(1) Maalekk mburu.FOC-Subj.1SG eat bread.It was me who ate bread.(2) Mburu laalekk.Bread FOC-Obj.1SG eat.It was bread that I ate.Ex. Applicative(3) Togg-alnaa xale biceeb.Cook-APPL 1SG child DET rice.I cooked rice for the child.Dione,Kuhn,ZarrießPart-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersWolof ResourcesNo NLP tools or resources available for Wolof!Linguistically quite well documented(some descriptive grammars, recent work on specific aspects of the grammar)Some online resourcesWolof Wikipedia: 1065 articles(Problem: inconsistent orthography)We used the Wolof BibleConsistent orthographyAvailable as a parallel corpus (e.g. English,French, Arabic ging for Wolof

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersMotivationLow resource languages are .investigated in theoretical linguistics, annotated corpora are missingUniversity of Potsdam: research programme on information structure,NLP resources support corpus-based, cross-lingual investigations of ofinformation structurea test-bed for NLP techniques existing for well-resourced languagesoften simulated by using small sets from well-resourced languages (e.g. inresearch on bootstrapping, unsupervised learning techniques, .)Dione,Kuhn,ZarrießPart-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersStarting from Scratch: Tagset DesignNo established Part-of-Speech inventory for Wolof(not even on the level of coarse-grained lexical categories)Debate about adjectives in WolofInconsistent glosses/categorisations in the theoretical literatureInconsistencies for verb categoriesWhat is the appropriate level of tagset granularity?Should the tagset capture e.g. nominal classes?Dione,Kuhn,ZarrießPart-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersTagset Design: General StrategyGeneral desiderata for a tagset:Capture interesting linguistic categoriesBe predictable/learnable for automatic taggersEAGLES guidelines, Leech and Wilson [1996]Interleaving tagset design and annotation experimentsDistinguishing various granularity levelsDione,Kuhn,ZarrießPart-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersEstablishing Tagset GranularityStarted out with fairly detailed tagset (200 tags)Experiments with tagset reductionsFinal “standard tagset” includes theoretically interesting distinctions that canbe reasonably made by automatic PoS taggersGranularity levelsDefinite lass/sent. focusSG/w-class/sent. focusDetailed200 arrießMedium44 tagsATDsATDpATDSFATDSFGeneral14 tagsATATATATPart-of-Speech-Tagging for WolofStandard80 tagsARTDARTDARTFARTF

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersInterleaving Tagset Design and AnnotationPoS categories for Wolof verbsProblem:theoretical work onWolof establishes 3verb finitenesscategories: VVFIN,VVINF, VVNFN(Zribi-Hertz andDiagne [2002])automaticPoS-Taggers do notlearn the distinctionTen most frequent errors on tagset with 3 verbfiniteness categories(incorr.)system r ratiowrt. gold t-of-Speech-Tagging for .23%0.23%

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersInterleaving Tagset Design and AnnotationPoS categories for Wolof verbsSolution:one tag for overtlynon-inflected verbs(VV)several fine-grainedtags fortoken-internallyinflected verbs (e.g.VN for negated verbs)Ten most frequent errors made on tagset with 1verb category(incorr.)system CVVPERSNCATNCVVAPerror ratiowrt. gold -Speech-Tagging for .23%0.15%

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersCapturing Linguistically Interesting CategoriesPoS categories for focus markersStandard tagset captures different focus typesIt should allow for corpus-based investigations of information structureEvaluate focus identification based on automatic taggingQuality of automatic POS-based focus identification on 100 sentencesFocus TypeSubject (ISuF)Verb (IVF)Object (ICF)Sentence (ISF)EvaluationPrecision Kuhn,ZarrießAbs.Freq inTest set39111116Abs. Freq inCorpus11197599106353423 focus instances(predicted)Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersCreating Gold Standard DataAnnotated data: ca. 27,000 tokens from the New TestamentAnnotation effort: 1 month for 1 personAutomatic pre-annotation reduced the effort (by more than 50%)Implementation includes:Tokeniser and sentence splitter (based on the GATE environment)Heuristics for stemming and ng for Wolof

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersAutomatic Pre-AnnotationSuffix guessing on entire corpus(4)generation of a full formlexicon based on .closed-class lexemes (1700entries)suffix-guessing foropen-class lexemes (25000entries)pre-annotated each tokenwith all options found in thefull form lexiconDione,Kuhn,Zarrieß. gis-leen !. look!“-leen” is an imperative suffixindicates a verbal categoryadd “gis” as a verb to the lexiconPre-annotation(5)man de ab kanaara la fi gis.“I can only see a turkey here.” (6)man PERS DWQ de IJab ARTI kanaara NCla PRO ICF ARTD fi AVgis VVBPPart-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersComparing State-of-the-art PoS TaggersCan our gold standard data be used for training reliable automatic taggers?1TnT tagger: Brants [2000]trigram Hidden Markov model96.7% accuracy on NEGRA2TreeTagger: Schmid [1994]decision tree model96.06% on NEGRA3SVMTool: Giménez and Màrquez [2004]support vector machine classifier (very rich, lexical feature model)97.1% on the Wall Street JournalDione,Kuhn,ZarrießPart-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersComparing State-of-the-art PoS TaggersResults from ten-fold cross-validation26,846 training tokens2650 test tokensaverage number of ambiguities: 5.173 per word (on fine-grained tagset)Tagset sizeBaselineTnTTreeTaggerSVM 4.2%94.8%93.6%94.5%95.3% -of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersComparing State-of-the-art PoS TaggersResults are comparable to state-of-the art (given the size of the training data)Standard tagset seems to be appropriate for automatic taggingEven the fine-grained tagset allows for quite accurate automatic analysisOpen question: do these results scale to other text types?Tagset sizeBaselineTnTTreeTaggerSVM 4.2%94.8%93.6%94.5%95.3% -of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource LanguageStarting from Scratch: Tagset DesignFast Gold Standard AnnotationExperiments with State-of-the-art PoS TaggersConclusionIssues:How to deal with under-studied, theoretically controversial phenomena?How to satisfy theoretical and computational requirements on tagset design?How to establish appropriate granularity of the tagset?Experience:Even simple word lists are very useful for fast pre-annotationInterleaving tagset design and annotation experimentsAutomatic testing on different granularity levelsDione,Kuhn,ZarrießPart-of-Speech-Tagging for Wolof

Experiments with Crosslingual ProjectionReferencesTowards Systematic BootstrappingThere is a lot of NLP research on bootstrapping resources for low resourcelanguages (mostly “simulated”)Classic: annotation projection paradigm, Yarowsky and Ngai [2001]Is it useful in a realistic scenario?English-French projection e,Kuhn,ZarrießPart-of-Speech-Tagging for Wolof

Experiments with Crosslingual ProjectionReferencesCrosslingual Projection ExperimentsAdded information from parallel corpus?Wolof-English parallel exampleData seems very noisy fordirection PoS projectionEnglish tagset cannot bedirectly adopted for Wolof,some manual annotation isrequired anyway“Light projection” scenario:use parallel PoS informationas additional features in thetraining processNPVVBPPRO . (VVIMPEPROPROAVDEM . ii.”Part-of-Speech-Tagging for BTOPPSENT”

Experiments with Crosslingual ProjectionReferencesComparing Taggers with and without Parallel InformationResults from HMM-Tagging, ten-fold cross-validationParallel info based on GIZA word alignmentsEnglish and French PoS annotation produced with TreeTaggerno parallel informationinformation from Englishinformation from English and FrenchTraining41859.7%62.6%63.6%data size124968.3%70.2%70.6%Improvement only significant on smallest training setDione,Kuhn,ZarrießPart-of-Speech-Tagging for Wolof(tokens)496882.7%84.0%84.1%

Experiments with Crosslingual ProjectionReferencesThorsten Brants. TnT – a statistical part-of-speech tagger. In Proceedings of theSixth Applied Natural Language Processing (ANLP-2000), Seattle, WA, 2000.Jesús Giménez and Lluı́s Màrquez. SVMTool: A general pos tagger generatorbased on support vector machines. In Proceedings of the 4th LREC, 2004.Geoffrey Leech and Andrew Wilson. EAGLES. Recommendations for theMorphosyntactic Annotation of Corpora. Technical report, Expert AdvisoryGroup on Language Engineering Standards, 1996. EAGLES DocumentEAG-TCWG-MAC/R.Helmut Schmid. Probabilistic part-of-speech tagging using decision trees. InProceedings of International Conference on New Methods in LanguageProcessing, 1994.David Yarowsky and Grace Ngai. Inducing multilingual pos taggers and npbracketers via robust projection across aligned corpora. In NAACL ’01: Secondmeeting of the North American Chapter of the Association for ComputationalLinguistics on Language technologies 2001, pages 1–8, Morristown, NJ, USA,2001. Association for Computational Linguistics.Anne Zribi-Hertz and Lamine Diagne. Clitic placement after syntax: Evidencefrom Wolof person and locative markers. Natural Language and LinguisticTheory, 20(4):823–884, 2002.Dione,Kuhn,ZarrießPart-of-Speech-Tagging for Wolof

Experiments with State-of-the-art PoS Taggers Design and Development of Part-of-Speech-Tagging Resources for Wolof Cheikh M. Bamba Dione Jonas Kuhn Sina Zarrieß Department of Linguistics, University of Potsdam (Germany) Institute for Natural Language Processing (IMS), University of Stuttgart (Germany) Dione,Kuhn,Zarrieß Part-of-Speech-Tagging .