EXPLORING A BETTER FEATURE EXTRACTION METHOD FOR AMHARIC HATE SPEECH DETECTION
No Thumbnail Available
Date
2021-10-08
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Hawassa University
Abstract
Hate speech is a speech that causes people to be attacked, discriminated, and hated because of
their personal and collective identities. When hate speech grows, it will cause death and
displacement of peoples from their homes and properties. Social media has the ability of
widely spreading hate speech. To solve this problem, various researchers have studied many
ways to detect social media hate speeches that are spreading in international and local
languages. Because the problem is so serious, it needs to be carefully studied and better
addressed in a variety of solutions.
The previous studies detect a speech as hate speech, based on the frequency (occurrence) of a
word in a given dataset; this means it does not consider the role of each word in a given
sentence. The main purpose of this study is to design a method that can generate hate speech
features from a given text by identifying the role of a word in a given sentence, so that hate
speech can easily be distinguished from other forms of speech in a better way. To do this,
various researches related to this study have been studied and reviewed.
This study created a new feature extraction method for Amharic hate speech detection. The
created model needs a training and testing dataset, so that posts and comments, which are
posted on 25 popular Facebook pages, have been collected to build the dataset.
Whether a speech is hateful or not, should be determined by the law that prohibits hate speech.
So that, using different filtration methods, datasets that contain religious, ethnic, and hate
words are collected and given to law experts, to annotate it manually. The law experts labeled
2590 datasets into three classes; Religion-hate, Ethnic-hate, and Non-hate. After dataset
preparation, a new feature extraction method, which can distinguish hate speech from other
speech, is developed.
The new feature extraction method and other feature extraction methods that are used in other
related studies are implemented and computed with three machine learning classification
algorithms: SVM, NB, and RF. The result in different evaluation metrics shows that the new
feature extraction method performed better in all combinations of classification algorithms. By
using 80% of 2590 labeled datasets as a training set and the rest as a test set, 96.2% average
accuracy is achieved using the combination of SVM with the new feature extraction method.
Description
Keywords
Amharic hate speech, annotation, new feature extraction method, machine learning
