融合单语语言模型的汉越伪平行语料生成 Chinese-Vietnamese pseudo-parallel corpus generation based on monolingual language model期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

融合单语语言模型的汉越伪平行语料生成

引用本文：	贾承勋,赖华,余正涛,文永华,于志强.融合单语语言模型的汉越伪平行语料生成[J].计算机应用,2021,41(6):1652-1658.

作者姓名：	贾承勋赖华余正涛文永华于志强

作者单位：	1. 昆明理工大学信息工程与自动化学院, 昆明 650504;2. 云南省人工智能重点实验室(昆明理工大学), 昆明 650500

基金项目：	国家自然科学基金资助项目（61672271，61732005，61761026，61762056，61866020）；国家重点研发计划项目（2019QY1801）。

摘要：	神经机器翻译在资源丰富的语种上取得了良好的翻译效果，但是由于数据稀缺问题在汉语-越南语这类低资源语言对上的性能不佳。目前缓解该问题最有效的方法之一是利用现有资源生成伪平行数据。考虑到单语数据的可利用性，在回译方法的基础上，首先将利用大量单语数据训练的语言模型与神经机器翻译模型进行融合，然后在回译过程中通过语言模型融入语言特性，以此生成更规范质量更优的伪平行数据，最后将生成的语料添加到原始小规模语料中训练最终翻译模型。在汉越翻译任务上的实验结果表明，与普通的回译方法相比，通过融合语言模型生成的伪平行数据使汉越神经机器翻译的BLEU值提升了1.41个百分点。
关键词：	汉越神经机器翻译数据增强伪平行数据单语数据语言模型
收稿时间：	2020-07-13
修稿时间：	2021-01-27
Chinese-Vietnamese pseudo-parallel corpus generation based on monolingual language model

JIA Chengxun,LAI Hua,YU Zhengtao,WEN Yonghua,YU Zhiqiang.Chinese-Vietnamese pseudo-parallel corpus generation based on monolingual language model[J].journal of Computer Applications,2021,41(6):1652-1658.

Authors:	JIA Chengxun LAI Hua YU Zhengtao WEN Yonghua YU Zhiqiang

Affiliation:	1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650504, China;2. Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology), Kunming Yunnan 650500, China

Abstract:	Neural machine translation achieves good translation results on resource-rich languages, but due to data scarcity, it performs poorly on low-resource language pairs such as Chinese-Vietnamese. At present, one of the most effective ways to alleviate this problem is to use existing resources to generate pseudo-parallel data. Considering the availability of monolingual data, based on the back-translation method, firstly the language model trained by a large amount of monolingual data was fused with the neural machine translation model. Then, the language features were integrated into the language model in the back-translation process to generate more standardized and better quality pseudo-parallel data. Finally, the generated corpus was added to the original small-scale corpus to train the final translation model. Experimental results on the Chinese-Vietnamese translation tasks show that compared with the ordinary back-translation methods, the Chinese-Vietnamese neural machine translation has the BiLingual Evaluation Understudy (BLEU) value improved by 1.41 percentage points by fusing the pseudo-parallel data generated by the language model.

Keywords:	Chinese-Vietnamese neural machine translation data augmentation pseudo-parallel data monolingual data language model
本文献已被万方数据等数据库收录！
	点击此处可从《计算机应用》浏览原始摘要信息
	点击此处可从《计算机应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏