首页 | 官方网站   微博 | 高级检索  
     

一种基于提取上下文信息的分词算法
引用本文:曾华琳,李堂秋,史晓东.一种基于提取上下文信息的分词算法[J].计算机应用,2005,25(9):2025-2027.
作者姓名:曾华琳  李堂秋  史晓东
作者单位:厦门大学,计算机科学系,福建,厦门,361005
基金项目:国家863计划资助项目(2002AA117010)
摘    要:汉语分词在汉语文本处理过程中是一个特殊而重要的组成部分。传统的基于词典的分词算法存在很大的缺陷,无法对未登录词进行很好的处理。基于概率的算法只考虑了训练集语料的概率模型,对于不同领域的文本的处理不尽如人意。文章提出一种基于上下文信息提取的概率分词算法,能够将切分文本的上下文信息加入到分词概率模型中,以指导文本的切分。这种切分算法结合经典n元模型以及EM算法,在封闭和开放测试环境中分别取得了比较好的效果。

关 键 词:中文分词  n元模型  上下文信息
文章编号:1001-9081(2005)09-2025-03
收稿时间:2005-03-18
修稿时间:2005-03-182005-05-28

Segmentation algorithm for Chinese based on extraction of context information
ZENG Hua-lin,LI Tang-qiu,SHI Xiao-dong.Segmentation algorithm for Chinese based on extraction of context information[J].journal of Computer Applications,2005,25(9):2025-2027.
Authors:ZENG Hua-lin  LI Tang-qiu  SHI Xiao-dong
Abstract:Chinese segmentation is a special and important issue in Chinese texts processing. The traditional segmentation methods based on an existing dictionary have an obvious defect when they are used to segment texts which may contain words unknown to the dictionary. And the probabilistic methods those consider the probabilistic model of the training set only also do a bad job on the texts of a specific domain. In this paper, a probabilistic segmentation method based on extracting context information was proposed, which adds the context information of the segmenting text into the segmentation probabilistic model so as to guide the processing. The method combining n-gram model and EM algorithm achieves a good effect in the close and opening test.
Keywords:Chinese segmentation  n-gram model  context information
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号