TY - JOUR
T1 - A deep learning approach to integrate convolutional neural networks in speaker recognition
AU - Hourri, Soufiane
AU - Nikolov, Nikola S.
AU - Kharroubi, Jamal
N1 - Publisher Copyright:
© 2020, Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2020/9/1
Y1 - 2020/9/1
N2 - We propose a novel usage of convolutional neural networks (CNNs) for the problem of speaker recognition. While being particularly designed for computer vision problems, CNNs have recently been applied for speaker recognition by using spectrograms as input images. We believe that this approach is not optimal as it may result in two cumulative errors in solving both a computer vision and a speaker recognition problem. In this work, we aim at integrating CNNs in speaker recognition without relying on images. We use Restricted Boltzmann Machines (RBMs) to extract speakers models as matrices and introduce a new way to model target and non-target speakers, in order to perform speaker verification. Thus, we use a CNN to discriminate between target and non-target matrices. Experiments were conducted with the THUYG-20 SRE corpus under three noise conditions: clean, 9 db, and 0 db. The results demonstrate that our method outperforms the state-of-the-art approaches by decreasing the error rate by up to 60%.
AB - We propose a novel usage of convolutional neural networks (CNNs) for the problem of speaker recognition. While being particularly designed for computer vision problems, CNNs have recently been applied for speaker recognition by using spectrograms as input images. We believe that this approach is not optimal as it may result in two cumulative errors in solving both a computer vision and a speaker recognition problem. In this work, we aim at integrating CNNs in speaker recognition without relying on images. We use Restricted Boltzmann Machines (RBMs) to extract speakers models as matrices and introduce a new way to model target and non-target speakers, in order to perform speaker verification. Thus, we use a CNN to discriminate between target and non-target matrices. Experiments were conducted with the THUYG-20 SRE corpus under three noise conditions: clean, 9 db, and 0 db. The results demonstrate that our method outperforms the state-of-the-art approaches by decreasing the error rate by up to 60%.
KW - Convolutional neural network
KW - Deep learning
KW - MFCC
KW - Restricted Boltzmann Machine
KW - Speaker recognition
UR - http://www.scopus.com/inward/record.url?scp=85086004480&partnerID=8YFLogxK
U2 - 10.1007/s10772-020-09718-7
DO - 10.1007/s10772-020-09718-7
M3 - Article
AN - SCOPUS:85086004480
SN - 1381-2416
VL - 23
SP - 615
EP - 623
JO - International Journal of Speech Technology
JF - International Journal of Speech Technology
IS - 3
ER -