首页 | 官方网站   微博 | 高级检索  
     

中文分词模型词典融入方法比较
引用本文:冯雪.中文分词模型词典融入方法比较[J].计算机应用研究,2019,36(1).
作者姓名:冯雪
作者单位:北京信息科技大学计算机学院,北京,100192
基金项目:北京市教委科技计划面上项目(KM201411232012)
摘    要:目前比较流行的中文分词方法为基于统计模型的机器学习方法。基于统计的方法一般采用人工标注的句子级的标注语料进行训练,但是这种方法往往忽略了已有的经过多年积累的人工标注的词典信息。这些信息尤其是在面向跨领域时,由于目标领域句子级别的标注资源稀少,从而显得更加珍贵。因此如何充分而且有效的在基于统计的模型中利用词典信息,是一个非常值得关注的工作。最近已有部分工作对它进行了研究,按照词典信息融入方式大致可以分为两类:一类是在基于字的序列标注模型中融入词典特征,而另一类是在基于词的柱搜索模型中融入特征。对这两类方法进行比较,并进一步进行结合。实验表明,这两类方法结合之后,词典信息可以得到更充分的利用,最终无论是在同领域测试和还是在跨领域测试上都取得了更优的性能。

关 键 词:中文分词  条件随机场  柱搜索  领域自适应
收稿时间:2017/5/18 0:00:00
修稿时间:2018/11/28 0:00:00

Comparison of methods for integrating lexicon information in Chinese word segmentation
Feng Xue.Comparison of methods for integrating lexicon information in Chinese word segmentation[J].Application Research of Computers,2019,36(1).
Authors:Feng Xue
Affiliation:School of Computer,Beijing Information Science and Technology University
Abstract:Chinese word segmentation is a fundamental task in Chinese natural language processing. Currently the mainstream methods for Chinese word segmentation exploit statistical machine learning models. These methods usually require manual-annotated segmented sentences as training corpus, yet have neglected the annotated large-scale lexicon resources which have been built before, where these resources can be highly valuable when cross-domain evaluation is conducted, as the gold-standard sentence-level annotations arerare. Recently, the integration of lexicon formation into word segmentation models has gained increasing interest. As a whole, the integration methods can be classified into two categories: one being based on character-based models that cast word segmentation problem as sequence labeling, and the other being based on word-based models that use beam-search to decode. In this paper, we compare these two models, and combine them. Experimental results on benchmark data sets show that lexicon information can be more fully explored after combination, and finally the combined model can achieve better performances with both in- and cross-domain settings.
Keywords:Chinese word segmentation  conditional random field  beam-search  domain adaption
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号