一种基于提取上下文信息的分词算法 Segmentation algorithm for Chinese based on extraction of context information期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于提取上下文信息的分词算法

引用本文：	曾华琳,李堂秋,史晓东.一种基于提取上下文信息的分词算法[J].计算机应用,2005,25(9):2025-2027.

作者姓名：	曾华琳李堂秋史晓东

作者单位：	厦门大学,计算机科学系,福建,厦门,361005

基金项目：	国家863计划资助项目（2002AA117010）

摘要：	汉语分词在汉语文本处理过程中是一个特殊而重要的组成部分。传统的基于词典的分词算法存在很大的缺陷，无法对未登录词进行很好的处理。基于概率的算法只考虑了训练集语料的概率模型，对于不同领域的文本的处理不尽如人意。文章提出一种基于上下文信息提取的概率分词算法，能够将切分文本的上下文信息加入到分词概率模型中，以指导文本的切分。这种切分算法结合经典n元模型以及EM算法，在封闭和开放测试环境中分别取得了比较好的效果。
关键词：	中文分词 n元模型上下文信息
文章编号：	1001-9081（2005）09-2025-03
收稿时间：	2005-03-18
修稿时间：	2005-03-182005-05-28
Segmentation algorithm for Chinese based on extraction of context information

ZENG Hua-lin,LI Tang-qiu,SHI Xiao-dong.Segmentation algorithm for Chinese based on extraction of context information[J].journal of Computer Applications,2005,25(9):2025-2027.

Authors:	ZENG Hua-lin LI Tang-qiu SHI Xiao-dong

Abstract:	Chinese segmentation is a special and important issue in Chinese texts processing. The traditional segmentation methods based on an existing dictionary have an obvious defect when they are used to segment texts which may contain words unknown to the dictionary. And the probabilistic methods those consider the probabilistic model of the training set only also do a bad job on the texts of a specific domain. In this paper, a probabilistic segmentation method based on extracting context information was proposed, which adds the context information of the segmenting text into the segmentation probabilistic model so as to guide the processing. The method combining n-gram model and EM algorithm achieves a good effect in the close and opening test.

Keywords:	Chinese segmentation n-gram model context information
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏