首页 | 官方网站   微博 | 高级检索  
     

适用于特定领域机器翻译的汉语分词方法
引用本文:苏晨,张玉洁,郭振,徐金安.适用于特定领域机器翻译的汉语分词方法[J].中文信息学报,2013,27(5):184-191.
作者姓名:苏晨  张玉洁  郭振  徐金安
作者单位:北京交通大学 计算机与信息技术学院,北京 100044
摘    要:在特定领域的汉英机器翻译系统开发过程中,大量新词的出现导致汉语分词精度下降,而特定领域缺少标注语料使得有监督学习技术的性能难以提高。这直接导致抽取的翻译知识中出现很多错误,严重影响翻译质量。为解决这个问题,该文实现了基于生语料的领域自适应分词模型和双语引导的汉语分词,并提出融合多种分词结果的方法,通过构建格状结构(Lattice)并使用动态规划算法得到最佳汉语分词结果。为了验证所提方法,我们在NTCIR-10的汉英数据集上进行了评价实验。实验结果表明,该文提出的融合多种分词结果的汉语分词方法在分词精度F值和统计机器翻译的BLEU值上均得到了提高。

关 键 词:汉语分词  领域适应  双语引导  Lattice  机器翻译  

Chinese Word Segmentation Method for Domain-Special Machine Translation
SU Chen , ZHANG Yujie , GUO Zhen , XU Jin'an.Chinese Word Segmentation Method for Domain-Special Machine Translation[J].Journal of Chinese Information Processing,2013,27(5):184-191.
Authors:SU Chen  ZHANG Yujie  GUO Zhen  XU Jin'an
Affiliation:1. School of Computer and Information technology, Beijing Jiaotong University, Beijing 100044, China
Abstract:In developing a domain-specific Chinese-English machine translation system, the accuracy of Chinese word segmentation in large-scale training corpus often decreases because of unknown words. The lack of domain-specific annotated corpus makes supervised learning approaches unable to adapt. This problem results in many errors in translation knowledge extraction and therefore seriously affects translation quality. To resolve the domain adaptation problem, we implemented Chinese word segmentation by exploiting n-gram statistical features in raw corpus and bilingually motivated word segmentation information in parallel corpus, respectively. We further propose a lattice-based method to combine multiple results and use dynamic programming algorithm to get the best word segmentation result. For evaluation, we conducted experiments of Chinese word segmentation and Chinese-English machine translation using the data of NTCIR-10 Chinese-English patent task. The experimental results show that the proposed method brought about improvements both in F-measure of the Chinese word segmentation and in BLEU score of the Chinese-English statistical machine translation system.
Key wordsChinese word segmentation; domain adaptation; bilingual motivation; Lattice; machine translation
Keywords:Chinese word segmentation  domain adaptation  bilingual motivation  Lattice  machine translation
本文献已被 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号