Computer Science

Permanent URI for this collectionhttps://etd.hu.edu.et/handle/123456789/76

Browse

Search Results

Now showing 1 - 2 of 2
  • Item
    FOR SIDAMA LANGUAGE USING THE HIDDEN MARKOV MODEL WITH VITERBI ALGORITHM
    (Hawassa University, 2022-04-07) BELACHEW KEBEDE ESHETU
    The Parts of Speech (POS) tagger is an essential low-level tool in many natural language processing (NLP) applications. POS tagging is the process of assigning a corresponding part of a speech tag to a word that describes how it is used in a sentence. There are different approaches to POS tagging. The most common approaches are rule-based, stochastic, and hybrid POS tagging. In this paper, the stochastic approach, particularly the Hidden Markov Model (HMM) approach with the Viterbi algorithm, was applied to develop the part of the speech tagger for Sidaama. The HMM POS tagger tags the words based on the most probable sequence of words. For training and testing the model, 9,660 Sidaama sentences containing 130,847 tokens (words, punctuation, and symbols) were collected, and 4 experts in the language undertook the POS annotation. Thirty-one (31) POS tags were used in the annotation. The source of the corpus is fables, news, reading passages, and some scripts from the Bible. 90% of the corpus is used for training and the remaining 10% is used for testing. The POS tagger was implemented using the Python programming language (python 3.7.0) and the Natural Language Toolkit (NLTK 3.0.0). The performance of the Sidaama POS tagger was tested and validated using a ten-fold cross-validation technique. In the performance analysis experiment, the model achieved an accuracy of 91.25% for HMM model and 98.46% with the Viterbi algorithm
  • Item
    AMHARIC EXTRACTIVE TEXT SUMMARIZATION USING AmRoBERTa –BiLSTM MODEL
    (Hawassa University, 2024-05) EDEN AHMED
    Extractive text summarization is a crucial task in natural language processing, allowing users to quickly grasp the main ideas of lengthy documents. The manual summarization process is often labor-intensive and time-consuming. As the volume of information in the Amharic language continues to grow, the need for effective summarization systems has become essential. While various summarization techniques have been developed for multiple languages, research specifically focused on Amharic remains limited. Most existing studies rely on traditional methods that often lack of contextual embeddings, which are crucial for understanding the meaning within the text. Additionally, current approaches often struggle to capture long-range dependencies among sentences and none of the existing studies have utilized hybrid deep models, which have demonstrated state of-the-art performance in summarization tasks for other languages. This study addresses the challenge of extractive text summarization for Amharic news articles by proposing a hybrid deep learning model that combines the contextual understanding of AmRoBERTa with the sequential processing capabilities of Bidirectional Long Short-Term Memory. A dataset of 1,200 Amharic news articles, covering a variety of topics, was collected. Each article was segmented into sentences and labeled by experts to indicate their relevance for summarization. Preprocessing was conducted, including normalization and tokenization using AmRoBERTa, to prepare the data for modeling. The proposed model was trained using various hyperparameter configurations and optimization techniques. Its effectiveness was evaluated using ROUGE metrics. The results demonstrate that our model achieved significant performance, with a ROUGE-1 score of 44.48, a ROUGE-2 score of 34.73, and a ROUGE-L score of 44.47.