FOR SIDAMA LANGUAGE USING THE HIDDEN MARKOV MODEL WITH VITERBI ALGORITHM
No Thumbnail Available
Date
2022-04-07
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Hawassa University
Abstract
The Parts of Speech (POS) tagger is an essential low-level tool in many natural language
processing (NLP) applications. POS tagging is the process of assigning a corresponding part
of a speech tag to a word that describes how it is used in a sentence. There are different
approaches to POS tagging. The most common approaches are rule-based, stochastic, and
hybrid POS tagging. In this paper, the stochastic approach, particularly the Hidden Markov
Model (HMM) approach with the Viterbi algorithm, was applied to develop the part of the
speech tagger for Sidaama. The HMM POS tagger tags the words based on the most probable
sequence of words. For training and testing the model, 9,660 Sidaama sentences containing
130,847 tokens (words, punctuation, and symbols) were collected, and 4 experts in the
language undertook the POS annotation. Thirty-one (31) POS tags were used in the
annotation. The source of the corpus is fables, news, reading passages, and some scripts from
the Bible. 90% of the corpus is used for training and the remaining 10% is used for testing.
The POS tagger was implemented using the Python programming language (python 3.7.0) and
the Natural Language Toolkit (NLTK 3.0.0). The performance of the Sidaama POS tagger was
tested and validated using a ten-fold cross-validation technique. In the performance analysis
experiment, the model achieved an accuracy of 91.25% for HMM model and 98.46% with the
Viterbi algorithm
Description
Keywords
Hidden Markov model, Natural Language Processing, Part of Speech Tagger, Rule-based Tagger, Stochastic Tagging, Hybrid POS tagger, Viterbi algorithm
