TY - JOUR
T1 - Benchmark Arabic news posts and analyzes Arabic sentiment through RMuBERT and SSL with AMCFFL technique
AU - Mhamed, Mustafa
AU - Sutcliffe, Richard
AU - Feng, Jun
N1 - Publisher Copyright:
© 2025
PY - 2025/3
Y1 - 2025/3
N2 - Sentiment analysis aims to extract emotions from textual data; sentiment analysis and text recognition are two of the most common tasks associated with natural language processing. Emergent technologies have been developed and employed in various fields, including marketing, health care, and policy making. However, with the growth of social media platforms and the flow of data, especially in the Arabic language, substantial difficulties have emerged that call for the creation of new frameworks to address problems, such as the lack of datasets related to news platforms, the complicated formation of the Arabic language, and complications with classifying, and system challenges, whether in machine learning, deep learning, or online analysis tools. This paper provides a new framework that helps address ASA challenges and work on various tasks based on the state-of-the-art ASA. First, it presents a new collection named (ANP5) from Arabic news posts from several Arabic platforms, then uses SSL with AMCFFL technique to analyze the Arabic sentiment and generate a second dataset (ANPS2). Next, applied ML classifiers, RF and SVM, do the best among the other classifiers, with an accuracy of 82.00%; however, the measurement distributions for each class are different (Experiment 1). Following that, DL models, BIGRU, CNN-LSTM, LSTM, and CNN, had accuracies of 88.10%, 89.30%, 89.85%, and 90.10% (Experiment 2). Experiments 1 and 2 represent the initial benchmark classification as the first baseline. Afterward, a new RMuBERT Model was developed and compared with four transformers on the two datasets: ANPS2 accuracy (90.87%) and ANP5 (90.33%). RMuBERT performed better than the baselines (Experiment 3). Further testing of RMuBERT on various Arabic corpora with different classes, lengths, and sizes: ArSarcasm (3C), STD (2C), AJGT (2C), and AAQ (2C), revealed accuracies of 77.76%, 91.79%, 94.07%, and 93.48%, respectively. Still, RMuBERT performed better than the baselines (Experiment 4). Finally, on the largest Arabic sentiment corpora with six million Arabic tweets, the performance is up to (91.12%); RMuBERT works efficiently with less training time (Experiment 5).
AB - Sentiment analysis aims to extract emotions from textual data; sentiment analysis and text recognition are two of the most common tasks associated with natural language processing. Emergent technologies have been developed and employed in various fields, including marketing, health care, and policy making. However, with the growth of social media platforms and the flow of data, especially in the Arabic language, substantial difficulties have emerged that call for the creation of new frameworks to address problems, such as the lack of datasets related to news platforms, the complicated formation of the Arabic language, and complications with classifying, and system challenges, whether in machine learning, deep learning, or online analysis tools. This paper provides a new framework that helps address ASA challenges and work on various tasks based on the state-of-the-art ASA. First, it presents a new collection named (ANP5) from Arabic news posts from several Arabic platforms, then uses SSL with AMCFFL technique to analyze the Arabic sentiment and generate a second dataset (ANPS2). Next, applied ML classifiers, RF and SVM, do the best among the other classifiers, with an accuracy of 82.00%; however, the measurement distributions for each class are different (Experiment 1). Following that, DL models, BIGRU, CNN-LSTM, LSTM, and CNN, had accuracies of 88.10%, 89.30%, 89.85%, and 90.10% (Experiment 2). Experiments 1 and 2 represent the initial benchmark classification as the first baseline. Afterward, a new RMuBERT Model was developed and compared with four transformers on the two datasets: ANPS2 accuracy (90.87%) and ANP5 (90.33%). RMuBERT performed better than the baselines (Experiment 3). Further testing of RMuBERT on various Arabic corpora with different classes, lengths, and sizes: ArSarcasm (3C), STD (2C), AJGT (2C), and AAQ (2C), revealed accuracies of 77.76%, 91.79%, 94.07%, and 93.48%, respectively. Still, RMuBERT performed better than the baselines (Experiment 4). Finally, on the largest Arabic sentiment corpora with six million Arabic tweets, the performance is up to (91.12%); RMuBERT works efficiently with less training time (Experiment 5).
KW - ANP5
KW - ANPS2
KW - Arabic sentiment analysis
KW - Natural language processing
KW - RMuBERT
KW - SSL
UR - http://www.scopus.com/inward/record.url?scp=85214922526&partnerID=8YFLogxK
U2 - 10.1016/j.eij.2024.100601
DO - 10.1016/j.eij.2024.100601
M3 - Article
AN - SCOPUS:85214922526
SN - 1110-8665
VL - 29
JO - Egyptian Informatics Journal
JF - Egyptian Informatics Journal
M1 - 100601
ER -