中文分词模型词典融入方法比较 Comparison of methods for integrating lexicon information in Chinese word segmentation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

中文分词模型词典融入方法比较

引用本文：	冯雪.中文分词模型词典融入方法比较[J].计算机应用研究,2019,36(1).

作者姓名：	冯雪

作者单位：	北京信息科技大学计算机学院,北京,100192

基金项目：	北京市教委科技计划面上项目(KM201411232012)

摘要：	目前比较流行的中文分词方法为基于统计模型的机器学习方法。基于统计的方法一般采用人工标注的句子级的标注语料进行训练，但是这种方法往往忽略了已有的经过多年积累的人工标注的词典信息。这些信息尤其是在面向跨领域时，由于目标领域句子级别的标注资源稀少，从而显得更加珍贵。因此如何充分而且有效的在基于统计的模型中利用词典信息，是一个非常值得关注的工作。最近已有部分工作对它进行了研究，按照词典信息融入方式大致可以分为两类：一类是在基于字的序列标注模型中融入词典特征，而另一类是在基于词的柱搜索模型中融入特征。对这两类方法进行比较，并进一步进行结合。实验表明，这两类方法结合之后，词典信息可以得到更充分的利用，最终无论是在同领域测试和还是在跨领域测试上都取得了更优的性能。
关键词：	中文分词条件随机场柱搜索领域自适应
收稿时间：	2017/5/18 0:00:00
修稿时间：	2018/11/28 0:00:00
Comparison of methods for integrating lexicon information in Chinese word segmentation

Feng Xue.Comparison of methods for integrating lexicon information in Chinese word segmentation[J].Application Research of Computers,2019,36(1).

Authors:	Feng Xue

Affiliation:	School of Computer,Beijing Information Science and Technology University

Abstract:	Chinese word segmentation is a fundamental task in Chinese natural language processing. Currently the mainstream methods for Chinese word segmentation exploit statistical machine learning models. These methods usually require manual-annotated segmented sentences as training corpus, yet have neglected the annotated large-scale lexicon resources which have been built before, where these resources can be highly valuable when cross-domain evaluation is conducted, as the gold-standard sentence-level annotations arerare. Recently, the integration of lexicon formation into word segmentation models has gained increasing interest. As a whole, the integration methods can be classified into two categories: one being based on character-based models that cast word segmentation problem as sequence labeling, and the other being based on word-based models that use beam-search to decode. In this paper, we compare these two models, and combine them. Experimental results on benchmark data sets show that lexicon information can be more fully explored after combination, and finally the combined model can achieve better performances with both in- and cross-domain settings.

Keywords:	Chinese word segmentation conditional random field beam-search domain adaption
本文献已被万方数据等数据库收录！
	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏