Developing Koorete Part of Speech (POS) Tagger: an  Empirical Evaluation of Neural Word Embedding and  N-Gram Based Statistical Approaches

Agegnehu Ashenafi

Developing Koorete Part of Speech (POS) Tagger: an Empirical Evaluation of Neural Word Embedding and N-Gram Based Statistical Approaches

Files

Agegnehu Ashenafi Msc Thesis in CS.pdf (2.71 MB)

Date

2021-12-12

Authors

Agegnehu Ashenafi

Publisher

Hawassa University

Abstract

The Koorete language is spoken by the Koore people in Amaro Kele Special Woreda and in four Kebeles of Burji Special Woreda, Southern regional state. Koorete is written with Latin alphabets (or called ‗Diizo Beyta’ in Koorete language). This means, Latin alphabet is adopted to the language by adding additional combinations of letters for peculiar sounds totaling to 31- consonants (‗Artaxita’ in Koorete), 5-vowels (‗Arxaxita’ in Koorete), and one more symbol. The syntax of Koorete sentence structure is “Subject (Zeere utaade) + Object (efaxe) + Verb (Hanta beyiisaxe)”. This study develops Koorete POS Tagger using the empirical evaluation of Neural Word Embedding and N-gram based statistical POS tagging approaches. Parts-of-speech (POS) tagging is the process of assigning part-of-speech labels/tags to each word from Koorete POS tagset. Neural word embedding are distributed representations of words into vectors applying Bi-LSTM RNN model. N-gram based statistical approach uses probability frequencies of sequence labeling of words from the KPT corpus. Words having similar meanings can be represented similarly, which enable deep learning methods. The behavior of having similar representation orients to the reduction of out-of-vocabulary impact. This means, binary vector |V| dimension reduction. In simple language, word embedding is a language modeling technique which maps words to vectors using Word2Vec package, and would be computed in RNN. This Word2Vec package converts words to arrays of real numbers, and concatenate the original corpus word categories to the generated vectors. Word2Vec has a capability of capturing context of a word (semantic and syntactic similarity) in a document in relation with other words. For the purpose of sequence labeling method and distributed representation, this study uses Bi LSTM RNN by achieving the state-of-the-art POS tagging accuracy and N-gram based statistics approaches in contrast to the more classic approaches. Bi-LSTM handles or adds letter case functions to keep the original letter case information of word. This study applies skip-gram algorithm to encode words into a limited vector space. Because skip-gram model is efficient method for learning high quality vector representations of words from large amounts of unstructured text data. So experiments were practiced on Bi-LSTM RNN model, and N-gram tagger statistical approach. For this, KPT corpus is used about size of 1718 sentences (33220 words), and then divided this corpus into 90% training data and 10% testing data. The experiment on Bi-LSTM RNN word embedding POS tagging approach did better than the N-gram statistical POS tagging approach with the accuracy of 98.53%. Hence, this study solves the problems of (1) no rich resource in NLP applications, (2) Koorete language not having its own KPT corpus and tagsets for NLP applications, (3) the state-of-the art tagging performance algorithms accuracy with other relative languages POS tagging model.

URI

https://etd.hu.edu.et/handle/123456789/201

Collections

Computer Science

Full item page

Developing Koorete Part of Speech (POS) Tagger: an Empirical Evaluation of Neural Word Embedding and N-Gram Based Statistical Approaches

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By