TY - JOUR
T1 - Enhancing racism classification
T2 - an automatic multilingual data annotation system using self-training and CNN
AU - El Miqdadi, Ikram
AU - Hourri, Soufiane
AU - El Idrysy, Fatima Zahra
AU - Hayati, Assia
AU - Namir, Yassine
AU - Nikolov, Nikola S.
AU - Kharroubi, Jamal
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2024.
PY - 2024/11
Y1 - 2024/11
N2 - Accurate racism classification is crucial on social media, where racist and discriminatory content can harm individuals and society. Automated racism detection requires gathering and annotating a wide range of diverse and representative data as an essential source of information for the system. However, this task proves to be highly demanding in both time and resources, resulting in a significantly costly process. Moreover, racism can appear differently across languages because of the distinct cultural subtleties and vocabularies linked to each language. This necessitates having information resources in native languages to effectively detect racism, which further complicates constructing a database explicitly designed for identifying racism on social media platforms. In this study, an automated data annotation system for racism classification is presented, utilizing self-training and a combination of the Sentence-BERT (SBERT) transformers-based model for data representation and a Convolutional Neural Network (CNN) model. The system aids in the creation of a multilingual racism dataset consisting of 26,866 instances gathered from Facebook and Twitter. This is achieved through a self-training process that utilizes a labeled subset of the dataset to annotate the remaining unlabeled data. The study examines the impact of self-training on the system’s performance, revealing significant enhancements in model effectiveness. Especially for the English dataset, the system achieves a noteworthy accuracy rate of 92.53% and an F-score of 88.26%. The French dataset reaches an accuracy of 93.64% and an F-score of 92.68%. Similarly, for the Arabic dataset, the accuracy reaches 91.03%, accompanied by an F-score value of 92.15%. The implementation of self-training results in a remarkable 8–12% improvement in accuracy and F-score, as demonstrated in this study.
AB - Accurate racism classification is crucial on social media, where racist and discriminatory content can harm individuals and society. Automated racism detection requires gathering and annotating a wide range of diverse and representative data as an essential source of information for the system. However, this task proves to be highly demanding in both time and resources, resulting in a significantly costly process. Moreover, racism can appear differently across languages because of the distinct cultural subtleties and vocabularies linked to each language. This necessitates having information resources in native languages to effectively detect racism, which further complicates constructing a database explicitly designed for identifying racism on social media platforms. In this study, an automated data annotation system for racism classification is presented, utilizing self-training and a combination of the Sentence-BERT (SBERT) transformers-based model for data representation and a Convolutional Neural Network (CNN) model. The system aids in the creation of a multilingual racism dataset consisting of 26,866 instances gathered from Facebook and Twitter. This is achieved through a self-training process that utilizes a labeled subset of the dataset to annotate the remaining unlabeled data. The study examines the impact of self-training on the system’s performance, revealing significant enhancements in model effectiveness. Especially for the English dataset, the system achieves a noteworthy accuracy rate of 92.53% and an F-score of 88.26%. The French dataset reaches an accuracy of 93.64% and an F-score of 92.68%. Similarly, for the Arabic dataset, the accuracy reaches 91.03%, accompanied by an F-score value of 92.15%. The implementation of self-training results in a remarkable 8–12% improvement in accuracy and F-score, as demonstrated in this study.
KW - CNN
KW - Racism classification
KW - SBERT
KW - Self-training
KW - Transformers-based model
UR - http://www.scopus.com/inward/record.url?scp=85198067792&partnerID=8YFLogxK
U2 - 10.1007/s10618-024-01059-2
DO - 10.1007/s10618-024-01059-2
M3 - Article
AN - SCOPUS:85198067792
SN - 1384-5810
VL - 38
SP - 3805
EP - 3830
JO - Data Mining and Knowledge Discovery
JF - Data Mining and Knowledge Discovery
IS - 6
ER -