Developing Koorete Part of Speech (POS) Tagger: an Empirical Evaluation of Neural Word Embedding and N-Gram Based Statistical Approaches
No Thumbnail Available
Date
2021-12-12
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Hawassa University
Abstract
The Koorete language is spoken by the Koore people in Amaro Kele Special Woreda and in
four Kebeles of Burji Special Woreda, Southern regional state. Koorete is written with Latin
alphabets (or called ‗Diizo Beyta’ in Koorete language). This means, Latin alphabet is adopted
to the language by adding additional combinations of letters for peculiar sounds totaling to 31-
consonants (‗Artaxita’ in Koorete), 5-vowels (‗Arxaxita’ in Koorete), and one more symbol.
The syntax of Koorete sentence structure is “Subject (Zeere utaade) + Object (efaxe) + Verb
(Hanta beyiisaxe)”.
This study develops Koorete POS Tagger using the empirical evaluation of Neural Word
Embedding and N-gram based statistical POS tagging approaches. Parts-of-speech (POS)
tagging is the process of assigning part-of-speech labels/tags to each word from Koorete POS
tagset. Neural word embedding are distributed representations of words into vectors applying
Bi-LSTM RNN model. N-gram based statistical approach uses probability frequencies of
sequence labeling of words from the KPT corpus. Words having similar meanings can be
represented similarly, which enable deep learning methods. The behavior of having similar
representation orients to the reduction of out-of-vocabulary impact. This means, binary vector
|V| dimension reduction. In simple language, word embedding is a language modeling
technique which maps words to vectors using Word2Vec package, and would be computed in
RNN. This Word2Vec package converts words to arrays of real numbers, and concatenate the
original corpus word categories to the generated vectors. Word2Vec has a capability of
capturing context of a word (semantic and syntactic similarity) in a document in relation with
other words.
For the purpose of sequence labeling method and distributed representation, this study uses Bi LSTM RNN by achieving the state-of-the-art POS tagging accuracy and N-gram based statistics
approaches in contrast to the more classic approaches. Bi-LSTM handles or adds letter case
functions to keep the original letter case information of word. This study applies skip-gram
algorithm to encode words into a limited vector space. Because skip-gram model is efficient
method for learning high quality vector representations of words from large amounts of
unstructured text data.
So experiments were practiced on Bi-LSTM RNN model, and N-gram tagger statistical
approach. For this, KPT corpus is used about size of 1718 sentences (33220 words), and then
divided this corpus into 90% training data and 10% testing data. The experiment on Bi-LSTM
RNN word embedding POS tagging approach did better than the N-gram statistical POS
tagging approach with the accuracy of 98.53%.
Hence, this study solves the problems of (1) no rich resource in NLP applications, (2) Koorete
language not having its own KPT corpus and tagsets for NLP applications, (3) the state-of-the art tagging performance algorithms accuracy with other relative languages POS tagging model.
