TY - JOUR
T1 - A New Amharic Speech Emotion Dataset and Classification Benchmark
AU - Retta, Ephrem Afele
AU - Almekhlafi, Eiad
AU - Sutcliffe, Richard
AU - Mhamed, Mustafa
AU - Ali, Haider
AU - Feng, Jun
N1 - Publisher Copyright:
© 2023 Association for Computing Machinery.
PY - 2023/5/26
Y1 - 2023/5/26
N2 - In this article we present the Amharic Speech Emotion Dataset (ASED), which covers four dialects (Gojjam, Wollo, Shewa, and Gonder) and five different emotions (neutral, fearful, happy, sad, and angry). We believe it is the first Speech Emotion Recognition (SER) dataset for the Amharic language. Sixty-five volunteer participants, all native speakers of Amharic, recorded 2,474 sound samples, 2 to 4 seconds in length. Eight judges (two for each dialect) assigned emotions to the samples with high agreement level (Fleiss kappa = 0.8). The resulting dataset is freely available for download. Next, we developed a four-layer variant of the well-known VGG model, which we call VGGb. Three experiments were then carried out using VGGb for SER, using ASED. First, we investigated which features work best for Amharic, FilterBank, Mel Spectrogram, or Mel-frequency Cepstral Coefficient (MFCC). This was done by training three VGGb SER models on ASED, using FilterBank, Mel Spectrogram, and MFCC features, respectively. Four forms of training were tried, standard cross-validation and three variants based on sentences, dialects, and speaker groups. Thus, a sentence used for training would not be used for testing, and the same for a dialect and speaker group. MFCC features were superior under all four training schemes. MFCC was therefore adopted for Experiment 2, where VGGb and three well-known existing models were compared on ASED: RESNet50, AlexNet, and LSTM. VGGb was found to have very good accuracy (90.73%) as well as the fastest training time. In Experiment 3, the performance of VGGb was compared when trained on two existing SER datasets - RAVDESS (English) and EMO-DB (German) - as well as on ASED (Amharic). Results are comparable across these languages, with ASED being the highest. This suggests that VGGb can be successfully applied to other languages. We hope that ASED will encourage researchers to explore the Amharic language and to experiment with other models for Amharic SER.
AB - In this article we present the Amharic Speech Emotion Dataset (ASED), which covers four dialects (Gojjam, Wollo, Shewa, and Gonder) and five different emotions (neutral, fearful, happy, sad, and angry). We believe it is the first Speech Emotion Recognition (SER) dataset for the Amharic language. Sixty-five volunteer participants, all native speakers of Amharic, recorded 2,474 sound samples, 2 to 4 seconds in length. Eight judges (two for each dialect) assigned emotions to the samples with high agreement level (Fleiss kappa = 0.8). The resulting dataset is freely available for download. Next, we developed a four-layer variant of the well-known VGG model, which we call VGGb. Three experiments were then carried out using VGGb for SER, using ASED. First, we investigated which features work best for Amharic, FilterBank, Mel Spectrogram, or Mel-frequency Cepstral Coefficient (MFCC). This was done by training three VGGb SER models on ASED, using FilterBank, Mel Spectrogram, and MFCC features, respectively. Four forms of training were tried, standard cross-validation and three variants based on sentences, dialects, and speaker groups. Thus, a sentence used for training would not be used for testing, and the same for a dialect and speaker group. MFCC features were superior under all four training schemes. MFCC was therefore adopted for Experiment 2, where VGGb and three well-known existing models were compared on ASED: RESNet50, AlexNet, and LSTM. VGGb was found to have very good accuracy (90.73%) as well as the fastest training time. In Experiment 3, the performance of VGGb was compared when trained on two existing SER datasets - RAVDESS (English) and EMO-DB (German) - as well as on ASED (Amharic). Results are comparable across these languages, with ASED being the highest. This suggests that VGGb can be successfully applied to other languages. We hope that ASED will encourage researchers to explore the Amharic language and to experiment with other models for Amharic SER.
KW - Amharic dataset
KW - classifiers
KW - feature extraction
KW - Speech emotion recognition
UR - https://www.scopus.com/pages/publications/105005065815
U2 - 10.1145/3529759
DO - 10.1145/3529759
M3 - Article
AN - SCOPUS:105005065815
SN - 2375-4699
VL - 22
JO - ACM Transactions on Asian and Low-Resource Language Information Processing
JF - ACM Transactions on Asian and Low-Resource Language Information Processing
IS - 1
ER -