基于词频统计的中文分词的研究 Chinese Word Segmentation Research Based on Statistic the Frequency of the Word期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于词频统计的中文分词的研究

引用本文：	费洪晓,康松林,朱小娟,谢文彪.基于词频统计的中文分词的研究[J].计算机工程与应用,2005,41(7):67-68,100.

作者姓名：	费洪晓康松林朱小娟谢文彪

作者单位：	中南大学信息科学与工程学院,长沙,410075

基金项目：	国家自然科学基金资助(编号:60173041)，湖南省自然科学基金资助(编号:02JJY2094)

摘要：	论文介绍了一个基于词频统计的中文分词系统的设计和实现。通过这个系统,可以将输入的连续汉字串进行分词处理,输出分割后的汉语词串,一般是二字词串,并得到一个词典。词典中不重复地存储了每次处理中得到的词语,以及这些词语出现的频率。这个系统选用了三种统计原理分别进行统计:互信息,N元统计模型和t-测试。文中还对这三种原理的处理结果进行了比较,以分析各种统计原理的统计特点,以及各自所适合的应用场合。
关键词：	中文分词词频统计互信息 N元统计模型 t-测试
文章编号：	1002-8331-(2005)07-0067-02
Chinese Word Segmentation Research Based on Statistic the Frequency of the Word

Fei Hongxiao,Kang Songlin,ZHU Xiaojuan,Xie Wenbiao.Chinese Word Segmentation Research Based on Statistic the Frequency of the Word[J].Computer Engineering and Applications,2005,41(7):67-68,100.

Authors:	Fei Hongxiao Kang Songlin ZHU Xiaojuan Xie Wenbiao

Abstract:	The paper introduces the design and implementation of Chinese word segmentation system,which is based on statistic the frequency of the word.Through this system,continuous character bunch input can be segmented,and then the cut apart word bunch output can be gotten,the cut apart word bunch usually is two character word bunch,and one dictionary can be gotten.The dictionary stores word and the frequency that the word appears in these disposal texts.The segmentation system selects three kinds of statistics principles to count separately:Mutual Information,N-Gram and t-test.The paper still compares the results of the three kinds of principles,analyzes the difference of statistics characteristics of the three counting principle,and finds each suitable situation.

Keywords:	Chinese word segmentation statistic the frequency of the word mutual information N-Gram t-test
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏