TY - JOUR
T1 - Convolutional neural network vectors for speaker recognition
AU - Hourri, Soufiane
AU - Nikolov, Nikola S.
AU - Kharroubi, Jamal
N1 - Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC part of Springer Nature.
PY - 2021/6
Y1 - 2021/6
N2 - Deep learning models are now considered state-of-the-art in many areas of pattern recognition. In speaker recognition, several architectures have been studied, such as deep neural networks (DNNs), deep belief networks (DBNs), restricted Boltzmann machines (RBMs), and so on, while convolutional neural networks (CNNs) are the most widely used models in computer vision. The problem is that CNN is limited to the computer vision field due to its structure which is designed for two-dimensional data. To overcome this limitation, we aim at developing a customized CNN for speaker recognition. The goal of this paper is to propose a new approach to extract speaker characteristics by constructing CNN filters linked to the speaker. Besides, we propose new vectors to identify speakers, which we call in this work convVectors. Experiments have been performed with a gender-dependent corpus (THUYG-20 SRE) under three noise conditions : clean, 9db, and 0db. We compared the proposed method with our baseline system and the state-of-the-art methods. Results showed that the convVectors method was the most robust, improving the baseline system by an average of 43%, and recording an equal error rate of 1.05% EER. This is an important finding to understand how deep learning models can be adapted to the problem of speaker recognition.
AB - Deep learning models are now considered state-of-the-art in many areas of pattern recognition. In speaker recognition, several architectures have been studied, such as deep neural networks (DNNs), deep belief networks (DBNs), restricted Boltzmann machines (RBMs), and so on, while convolutional neural networks (CNNs) are the most widely used models in computer vision. The problem is that CNN is limited to the computer vision field due to its structure which is designed for two-dimensional data. To overcome this limitation, we aim at developing a customized CNN for speaker recognition. The goal of this paper is to propose a new approach to extract speaker characteristics by constructing CNN filters linked to the speaker. Besides, we propose new vectors to identify speakers, which we call in this work convVectors. Experiments have been performed with a gender-dependent corpus (THUYG-20 SRE) under three noise conditions : clean, 9db, and 0db. We compared the proposed method with our baseline system and the state-of-the-art methods. Results showed that the convVectors method was the most robust, improving the baseline system by an average of 43%, and recording an equal error rate of 1.05% EER. This is an important finding to understand how deep learning models can be adapted to the problem of speaker recognition.
KW - Convolutional neural network
KW - Deep learning
KW - MFCC
KW - Restricted Boltzmann machine
KW - Speaker recognition
UR - http://www.scopus.com/inward/record.url?scp=85099970830&partnerID=8YFLogxK
U2 - 10.1007/s10772-021-09795-2
DO - 10.1007/s10772-021-09795-2
M3 - Article
AN - SCOPUS:85099970830
SN - 1381-2416
VL - 24
SP - 389
EP - 400
JO - International Journal of Speech Technology
JF - International Journal of Speech Technology
IS - 2
ER -