基于知识蒸馏和生成对抗网络的远场语音识别 Distant Speech Recognition Based on Knowledge Distillation and Generative Adversarial Network期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于知识蒸馏和生成对抗网络的远场语音识别

引用本文：	邬龙,黎塔,王丽,颜永红.基于知识蒸馏和生成对抗网络的远场语音识别[J].软件学报,2019,30(S2):25-34.

作者姓名：	邬龙黎塔王丽颜永红

作者单位：	语言声学与内容理解重点实验室(中国科学院声学研究所), 北京 100190;中国科学院大学, 北京 100049,语言声学与内容理解重点实验室(中国科学院声学研究所), 北京 100190,语言声学与内容理解重点实验室(中国科学院声学研究所), 北京 100190,语言声学与内容理解重点实验室(中国科学院声学研究所), 北京 100190;中国科学院大学, 北京 100049;新疆民族语音语言信息处理实验室(中国科学院新疆理化技术研究所), 乌鲁木齐 830011

基金项目：	国家自然科学基金（11590774，11590770）；新疆维吾尔自治区重大科技专项（2016A03007-1）；中国科学院声学研究所青年英才计划（QNYC201602）

摘要：	为了进一步利用近场语音数据来提高远场语音识别的性能，提出一种基于知识蒸馏和生成对抗网络相结合的远场语音识别算法.该方法引入多任务学习框架，在进行声学建模的同时对远场语音特征进行增强.为了提高声学建模能力，使用近场语音的声学模型（老师模型）来指导远场语音的声学模型（学生模型）进行训练.通过最小化相对熵使得学生模型的后验概率分布逼近老师模型.为了提升特征增强的效果，加入鉴别网络来进行对抗训练，从而使得最终增强后的特征分布更逼近近场特征.AMI数据集上的实验结果表明，该算法的平均词错误率（WER）与基线相比在单通道的情况下，在没有说话人交叠和有说话人交叠时分别相对下降5.6%和4.7%.在多通道的情况下，在没有说话人交叠和有说话人交叠时分别相对下降6.2%和4.1%.TIMIT数据集上的实验结果表明，该算法获得了相对7.2%的平均词错误率下降.为了更好地展示生成对抗网络对语音增强的作用，对增强后的特征进行了可视化分析，进一步验证了该方法的有效性.
关键词：	远场语音识别知识蒸馏生成对抗式网络多任务学习语音增强
收稿时间：	2019/7/15 0:00:00
Distant Speech Recognition Based on Knowledge Distillation and Generative Adversarial Network

WU Long,LI T,WANG Li and YAN Yong-Hong.Distant Speech Recognition Based on Knowledge Distillation and Generative Adversarial Network[J].Journal of Software,2019,30(S2):25-34.

Authors:	WU Long LI T WANG Li and YAN Yong-Hong

Affiliation:	Key Laboratory of Speech Acoustics and Content Understanding(Institute of Acoustics, Chinese Academy of Sciences), Beijing 100190, China;University of Chinese Academy of Sciences, Beiing 100049, China,Key Laboratory of Speech Acoustics and Content Understanding(Institute of Acoustics, Chinese Academy of Sciences), Beijing 100190, China,Key Laboratory of Speech Acoustics and Content Understanding(Institute of Acoustics, Chinese Academy of Sciences), Beijing 100190, China and Key Laboratory of Speech Acoustics and Content Understanding(Institute of Acoustics, Chinese Academy of Sciences), Beijing 100190, China;University of Chinese Academy of Sciences, Beiing 100049, China;Xinjiang Laboratory of Minority Speech and Language Information Processing(Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences), Urumqi 830011, China

Abstract:	In order to further utilize near-field speech data to improve the performance of far-field speech recognition, this paper proposes an approach to integrate knowledge distillation with the generative adversarial network. In this work, a multi-task learning structure is firstly proposed to jointly train the acoustic model with feature mapping. To enhance the acoustic modeling, the acoustic model trained with far-field data (student model) is guided by an acoustic model trained with near-field data (teacher model). Such training process makes the student model mimics the behavior of the teacher model by minimizing the Kullback-Leibler Divergence. To improve the speech enhancement, an additional discriminator network is introduced to distinguish the enhanced features from the real clean ones. The distribution of the enhanced features is further pushed towards that of the clean features through this adversarial multi-task training. Evaluated on AMI single distant microphone data, the method achieves 5.6% relative non-overlapped word error rate (WER) and 4.7% relative overlapped WER decrease over the baseline model. Evaluated on AMI multi-channel distant microphone data, the method achieves 6.2% relative non-overlapped WER and 4.1% relative overlapped WER decrease over the baseline model. Evaluated on the TIMIT data, the method can reach 7.2% WER reduction. To better demonstrate the effects of generative adversarial network on speech enhancement, the enhanced features is visualized and the effectiveness of this method is verified.

Keywords:	distant speech recognition knowledge distillation generative adversarial network multi-task learning speech enhancement

	点击此处可从《软件学报》浏览原始摘要信息
	点击此处可从《软件学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏