首页 | 官方网站   微博 | 高级检索  
     

基于BTM和K-means的微博话题检测
引用本文:李卫疆,王真真,余正涛.基于BTM和K-means的微博话题检测[J].计算机科学,2017,44(2):257-261, 274.
作者姓名:李卫疆  王真真  余正涛
作者单位:昆明理工大学信息工程与自动化学院 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500,昆明理工大学信息工程与自动化学院 昆明650500
基金项目:本文受地区科学基金项目:基于统计机器翻译和自动文摘的查询扩展研究(61363045),云南省自然科学基金重点项目(2013FA130),科技部中青年科技创新领军人才项目(2014HE001)资助
摘    要:近年来,微博等社交网络的发展给人们的沟通交流提供了方便。由于每条微博都限定在140字以内,因此产生了大量的短文本信息。从短文本中发现话题日渐成为一项重要的课题。传统的话题模型(如概率潜在语义分析(PLSA)、潜在狄利克雷分配(LDA)等) 在处理短文本方面都面临着严重的数据稀疏问题。另外,当数据集比较集中并且话题文档间的差别较明显时,K-means 聚类算法能够聚类出有区分度的话题。引入BTM话题模型来处理微博数据这样的短文本,以缓解数据稀疏的问题。同时,整合了K-means聚类算法来对BTM模型所发现的话题进行聚类。在新浪微博短文本集上进行的实验证明了此方法发现话题的有效性。

关 键 词:短文本  话题模型  话题发现  K-means聚类
收稿时间:2015/11/25 0:00:00
修稿时间:2016/3/29 0:00:00

Micro-blog Topic Detection Method Integrating BTM Topic Model and K-means Clustering
LI Wei-jiang,WANG Zhen-zhen and YU Zheng-tao.Micro-blog Topic Detection Method Integrating BTM Topic Model and K-means Clustering[J].Computer Science,2017,44(2):257-261, 274.
Authors:LI Wei-jiang  WANG Zhen-zhen and YU Zheng-tao
Affiliation:Department of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China,Department of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China and Department of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China
Abstract:Recently,the development of micro-blog provides people with convenient communication.Because every piece of micro-blog is limited in 140 words,large scale of short texts appear.In the meantime,discovering topics from short texts genuinely becomes an intractable problem.It is hard for traditional topic model to model short texts,such as probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA).They suffer from the severe data sparsity when disposing short texts.Moreover,K-means clustering algorithm can make topics discriminative when datasets is intensive and the difference between topic documents is distinct.In order to improve data sparsity,BTM topic model was employed to process short texts-micro-blog data for alleviating the problem of sparsity in this paper.At the same time,we integrated K-means clustering algorithm into BTM(Bi-term Topic Model) for topics discovery further.The results of experiments on Sina micro-blog short text collections demonstrate that our method can discover topics effectively.
Keywords:Short text  Topic model  Topic discovery  K-means clustering
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号