首页 | 官方网站   微博 | 高级检索  
     

基于角度间隔嵌入特征的端到端声纹识别模型
引用本文:王康,董元菲.基于角度间隔嵌入特征的端到端声纹识别模型[J].计算机应用,2019,39(10):2937-2941.
作者姓名:王康  董元菲
作者单位:南京烽火天地通信科技有限公司,南京,210019;南京烽火天地通信科技有限公司,南京210019;武汉邮电科学研究院,武汉430074
基金项目:国家重点研发计划项目(2017YFB1400704)。
摘    要:针对传统身份认证矢量(i-vector)与概率线性判别分析(PLDA)结合的声纹识别模型步骤繁琐、泛化能力较弱等问题,构建了一个基于角度间隔嵌入特征的端到端模型。该模型特别设计了一个深度卷积神经网络,从语音数据的声学特征中提取深度说话人嵌入;选择基于角度改进的A-Softmax作为损失函数,在角度空间中使模型学习到的不同类别特征始终存在角度间隔并且同类特征间聚集更紧密。在公开数据集VoxCeleb2上进行的测试表明,与i-vector结合PLDA的方法相比,该模型在说话人辨认中的Top-1和Top-5上准确率分别提高了58.9%和30%;而在说话人确认中的最小检测代价和等错误率上分别减小了47.9%和45.3%。实验结果验证了所设计的端到端模型更适合在多信道、大规模的语音数据集上学习到有类别区分性的特征。

关 键 词:声纹识别  端到端模型  损失函数  卷积神经网络  深度说话人嵌入
收稿时间:2019-05-05
修稿时间:2019-07-06

Angular interval embedding based end-to-end voiceprint recognition model
WANG Kang,DONG Yuanfei.Angular interval embedding based end-to-end voiceprint recognition model[J].journal of Computer Applications,2019,39(10):2937-2941.
Authors:WANG Kang  DONG Yuanfei
Affiliation:1. Nanjing Fiber Home World Communication Technology Company Limited, Nanjing Jiangsu 210019, China;2. Wuhan Research Institute of Posts and Telecommunications, Wuhan Hubei 430074, China
Abstract:An end-to-end model with angular interval embedding was constructed to solve the problems of complicated multiple steps and weak generalization ability in the traditional voiceprint recognition model based on the combination of identity vector (i-vector) and Probabilistic Linear Discriminant Analysis (PLDA). A deep convolutional neural network was specially designed to extract deep speaker embedding from the acoustic features of voice data. The Angular Softmax (A-Softmax), which is based on angular improvement, was employed as the loss function to keep the angular interval between the different classes of features learned by the model and make the clustering of the similar features closer in the angle space. Compared with the method combining i-vector and PLDA, it shows that the proposed model has the identification accuracy of Top-1 and Top-5 increased by 58.9% and 30% respectively and has the minimum detection cost and equal error rate reduced by 47.9% and 45.3% respectively for speaker verification on the public dataset VoxCeleb2. The results verify that the proposed end-to-end model is more suitable for learning class-discriminating features from multi-channel and large-scale datasets.
Keywords:voiceprint recognition                                                                                                                        end-to-end model                                                                                                                        loss function                                                                                                                        convolutional neural network                                                                                                                        deep speaker embedding
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号