一种基于内容特性的文本聚类方法 Text Clustering Approach Based on Content Characteristics期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于内容特性的文本聚类方法

引用本文：	李晓光,宋宝燕,于戈,王大玲.一种基于内容特性的文本聚类方法[J].计算机工程,2007,33(14):24-26,32.

作者姓名：	李晓光宋宝燕于戈王大玲

作者单位：	[1]辽宁大学信息科学与技术学院,沈阳110036 [2]东北大学信息科学与工程学院,沈阳110004

基金项目：	辽宁省博士科研项目 , 国家自然科学基金

摘要：	在基于概率模型的聚类中，簇模型对数据分布的拟合性直接影响着聚类质量。基于内容的文本数据分布的复杂性导致单一因素的簇模型无法准确拟合文本数据的分布特征。该文认为文本基于内容的分布特性主要受主题内容和通用写作方式影响，给出了一种基于主题模型和通用模型的混合簇模型和基于该簇模型的文本聚类方法。实验表明该聚类方法较单一因素的簇模型具有更好的拟合性，聚类质量更好。
关键词：	聚类基于概率模型的聚类混合模型 EM子方法
文章编号：	1000-3428（2007）14-0024-03
修稿时间：	2006-08-04
Text Clustering Approach Based on Content Characteristics

LI Xiaoguang,SONG Baoyan,YU Ge,WANG Daling.Text Clustering Approach Based on Content Characteristics[J].Computer Engineering,2007,33(14):24-26,32.

Authors:	LI Xiaoguang SONG Baoyan YU Ge WANG Daling

Affiliation:	1. School of Information Science and Technology, Liaoning University, Shenyang 110036; 2. School of Information Science and Engineering, Northeastern University, Shenyang 110004

Abstract:	The fitness of cluster model to data distribution is critical to probabilistic-model-based clustering.The single-component model fails to capture the distribution of document data completely because of the complexity of content-based distribution of document.This paper considers the characteristics of document are influenced mainly by two components: topic and general writting style,proposes the content-based cluster model mixed by topic model and general model,and gives the document clustering algorithm.Experimental results indicate that the content-based cluster model shows better fitness than single-component model and gets better quality of clustering.

Keywords:	clustering probabilistic-model-based clustering mixture model EM algoritlim
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《计算机工程》浏览原始摘要信息
	点击此处可从《计算机工程》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏