APPLICATION OF HYBRID APPROACH FOR WOLAITA LANGUAGE PART OF SPEECH TAGGING
No Thumbnail Available
Date
2020-03-24
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Hawassa University
Abstract
The aim of this research is to develop part-of-speech tagger for Wolaita Language using hybrid
approach. Part of speech tagger is one of the subtasks in natural language processing (NLP)
applications which is vital for other NLP tasks, like parser, machine translator, speech recognizer
and search engines. It is a process of labeling a corresponding part of speech (PoS) tag for a
word that defines how the word is used in a sentence. The PoS tagging for Wolaita language is
not sufficient yet to be used as one important component in other natural language processing
(NLP) applications. In this thesis, the development of part of speech tagger using hybrid
approach that combines HMM and transformation based learning approaches is conducted for
Wolaita language. In general, HMM model needs large data to increase the performance and the
transformation based learning model learn rule based on the language features. The HMM
model tags the words based on the optimal path for a given sequence of words and
transformation based learning is a rule based model that tag the words based on rules; it learns
rule directly from the training corpus without expert knowledge. The developed hybrid approach
of Wolaita language PoS tagger uses HMM tagger as initial annotators and the rule based tagger
as a corrector based on fixed threshold values. For implementation and experiment, the author
used python programming and NLTK. For training and testing the model, 14,358 untagged
Wolaita language words are collected from three different categories (Bible, Social media in
Wolaita language (Wogetta FM 96.6) and Wolaita language department). The annotation of
corpus performed manually by two language experts. For tagging purpose 26 PoS tag are
identified based on the work of Berhanu H., work of wakasa (2008) and with help of language
experts. From the entire corpus, 90% is used for training and the remaining 10% is used for
testing purpose. The performance of the three taggers is tested by using different experiments.
After experiment the researcher found that the performance of HMM, rule based and hybrid
taggers shows 88.14%, 92.96% and 94.82% respectively. Generally, hybrid approach showed
the better performance to assigning part of speech tag for Wolaita language sentences
Description
Keywords
NLP, HMM, TBL, NLTK and Hybrid
