首页 | 官方网站   微博 | 高级检索  
     

融合单语语言模型的汉越伪平行语料生成
引用本文:贾承勋,赖华,余正涛,文永华,于志强.融合单语语言模型的汉越伪平行语料生成[J].计算机应用,2021,41(6):1652-1658.
作者姓名:贾承勋  赖华  余正涛  文永华  于志强
作者单位:1. 昆明理工大学 信息工程与自动化学院, 昆明 650504;2. 云南省人工智能重点实验室(昆明理工大学), 昆明 650500
基金项目:国家自然科学基金资助项目(61672271,61732005,61761026,61762056,61866020);国家重点研发计划项目(2019QY1801)。
摘    要:神经机器翻译在资源丰富的语种上取得了良好的翻译效果,但是由于数据稀缺问题在汉语-越南语这类低资源语言对上的性能不佳。目前缓解该问题最有效的方法之一是利用现有资源生成伪平行数据。考虑到单语数据的可利用性,在回译方法的基础上,首先将利用大量单语数据训练的语言模型与神经机器翻译模型进行融合,然后在回译过程中通过语言模型融入语言特性,以此生成更规范质量更优的伪平行数据,最后将生成的语料添加到原始小规模语料中训练最终翻译模型。在汉越翻译任务上的实验结果表明,与普通的回译方法相比,通过融合语言模型生成的伪平行数据使汉越神经机器翻译的BLEU值提升了1.41个百分点。

关 键 词:汉越神经机器翻译  数据增强  伪平行数据  单语数据  语言模型  
收稿时间:2020-07-13
修稿时间:2021-01-27

Chinese-Vietnamese pseudo-parallel corpus generation based on monolingual language model
JIA Chengxun,LAI Hua,YU Zhengtao,WEN Yonghua,YU Zhiqiang.Chinese-Vietnamese pseudo-parallel corpus generation based on monolingual language model[J].journal of Computer Applications,2021,41(6):1652-1658.
Authors:JIA Chengxun  LAI Hua  YU Zhengtao  WEN Yonghua  YU Zhiqiang
Affiliation:1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650504, China;2. Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology), Kunming Yunnan 650500, China
Abstract:Neural machine translation achieves good translation results on resource-rich languages, but due to data scarcity, it performs poorly on low-resource language pairs such as Chinese-Vietnamese. At present, one of the most effective ways to alleviate this problem is to use existing resources to generate pseudo-parallel data. Considering the availability of monolingual data, based on the back-translation method, firstly the language model trained by a large amount of monolingual data was fused with the neural machine translation model. Then, the language features were integrated into the language model in the back-translation process to generate more standardized and better quality pseudo-parallel data. Finally, the generated corpus was added to the original small-scale corpus to train the final translation model. Experimental results on the Chinese-Vietnamese translation tasks show that compared with the ordinary back-translation methods, the Chinese-Vietnamese neural machine translation has the BiLingual Evaluation Understudy (BLEU) value improved by 1.41 percentage points by fusing the pseudo-parallel data generated by the language model.
Keywords:Chinese-Vietnamese neural machine translation  data augmentation  pseudo-parallel data  monolingual data  language model  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号