首页 | 官方网站   微博 | 高级检索  
     

基于词频统计的中文分词的研究
引用本文:费洪晓,康松林,朱小娟,谢文彪.基于词频统计的中文分词的研究[J].计算机工程与应用,2005,41(7):67-68,100.
作者姓名:费洪晓  康松林  朱小娟  谢文彪
作者单位:中南大学信息科学与工程学院,长沙,410075
基金项目:国家自然科学基金资助(编号:60173041),湖南省自然科学基金资助(编号:02JJY2094)
摘    要:论文介绍了一个基于词频统计的中文分词系统的设计和实现。通过这个系统,可以将输入的连续汉字串进行分词处理,输出分割后的汉语词串,一般是二字词串,并得到一个词典。词典中不重复地存储了每次处理中得到的词语,以及这些词语出现的频率。这个系统选用了三种统计原理分别进行统计:互信息,N元统计模型和t-测试。文中还对这三种原理的处理结果进行了比较,以分析各种统计原理的统计特点,以及各自所适合的应用场合。

关 键 词:中文分词  词频统计  互信息  N元统计模型  t-测试
文章编号:1002-8331-(2005)07-0067-02

Chinese Word Segmentation Research Based on Statistic the Frequency of the Word
Fei Hongxiao,Kang Songlin,ZHU Xiaojuan,Xie Wenbiao.Chinese Word Segmentation Research Based on Statistic the Frequency of the Word[J].Computer Engineering and Applications,2005,41(7):67-68,100.
Authors:Fei Hongxiao  Kang Songlin  ZHU Xiaojuan  Xie Wenbiao
Abstract:The paper introduces the design and implementation of Chinese word segmentation system,which is based on statistic the frequency of the word.Through this system,continuous character bunch input can be segmented,and then the cut apart word bunch output can be gotten,the cut apart word bunch usually is two character word bunch,and one dictionary can be gotten.The dictionary stores word and the frequency that the word appears in these disposal texts.The segmentation system selects three kinds of statistics principles to count separately:Mutual Information,N-Gram and t-test.The paper still compares the results of the three kinds of principles,analyzes the difference of statistics characteristics of the three counting principle,and finds each suitable situation.
Keywords:Chinese word segmentation  statistic the frequency of the word  mutual information  N-Gram  t-test
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号