首页 | 官方网站   微博 | 高级检索  
     

统计模型在中文文本挖掘中的应用
引用本文:王健,张俊妮.统计模型在中文文本挖掘中的应用[J].数理统计与管理,2017(4):609-619.
作者姓名:王健  张俊妮
作者单位:北京大学光华管理学院,北京,100871
摘    要:本文讨论了中文文本挖掘的三个问题:分词、关键词提取和文本分类。对分词问题,介绍了基于层叠隐马尔可夫模型的ICTCLAS分词法,以及将词与词之间的分隔视为缺失数据并用EM算法求解的WDM方法;对关键词提取问题,提出了贝叶斯因子法,并介绍了使用稀疏回归的CCS方法;对文本分类问题,介绍了根据关键词频率建立分类器的方法,以及先建立主题模型再根据主题概率建立分类器的方法。本文通过两组文本数据对上述方法进行比较,并给出使用建议。

关 键 词:中文分词  关键词提取  文本分类  贝叶斯因子  L1范数惩罚  主题模型

Applications of Statistical Models in Chinese Text Mining
WANG Jian,ZHANG Jun-ni.Applications of Statistical Models in Chinese Text Mining[J].Application of Statistics and Management,2017(4):609-619.
Authors:WANG Jian  ZHANG Jun-ni
Abstract:This paper discusses three problems in Chinese text mining,including word segmentation,keyword extraction and text classification.For the word segmentation problem,we introduce the ICTCLAS method that is based on a hierarchical hidden Markov model,and the WDM method that treats the segmentation between words as missing data and uses the EM algorithm to find the solution.For the keyword extraction problem,we propose a method based on Bayes Factor,and introduce the CCS method that uses sparse regression.For the text classification problem,we introduce a method that builds classifiers on keyword frequencies,and another method that first trains topic models and then builds classifiers on topic proportions.This paper then compares the above methods using two text datasets,and offers suggestions on their practical use.
Keywords:word segmentation  keyword extraction  text classification  Bayes factor  L1 penalization  topic model
本文献已被 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号