首页 | 官方网站   微博 | 高级检索  
     

一种成对约束限制的半监督文本聚类算法
引用本文:王纵虎,刘速.一种成对约束限制的半监督文本聚类算法[J].计算机科学,2016,43(12):183-188.
作者姓名:王纵虎  刘速
作者单位:中国人民大学统计学院 北京100872;中国石油规划总院计算机信息中心 北京102206,中国石油规划总院计算机信息中心 北京102206
摘    要:半监督聚类能利用少量标记数据来提高聚类算法性能,但大部分文本聚类算法无法直接应用成对约束等先验信息。针对文本数据高维稀疏的特点,提出了一种半监督文本聚类算法。将成对约束信息扩展后嵌入文档相似度矩阵,在此基础上根据已划分与未划分文档之间的统计信息逐步找出剩余未划分文本集合中密集的且与已划分聚类中心集合相似度较小的K个初始聚类中心集合,然后将剩余的相对较难区分的文档结合成对约束限制信息划分到K个初始聚类中心集合,最后通过融合成对约束违反惩罚的收敛准则函数对聚类结果进行进一步优化。算法在聚类过程中自动确定初始聚类中心集合,避免了K均值算法对初始聚类中心选择的敏感性。在几个中英文数据集上的实验结果表明,所提算法能有效地利用少量的成对约束先验信息提高聚类效果。

关 键 词:聚类  半监督  向量空间模型  成对约束  文本
收稿时间:2016/1/20 0:00:00
修稿时间:2016/3/26 0:00:00

Pairwise Constrained Semi-supervised Text Clustering Algorithm
WANG Zong-hu and LIU Su.Pairwise Constrained Semi-supervised Text Clustering Algorithm[J].Computer Science,2016,43(12):183-188.
Authors:WANG Zong-hu and LIU Su
Affiliation:School of Statistics,Renmin University of China,Beijing 100872,China;Computer Information Center,Petrochina Planning and Engineering Institute,Beijing 102206,China and Computer Information Center,Petrochina Planning and Engineering Institute,Beijing 102206,China
Abstract:Semi-supervised clustering can use a small amount of tag data to improve the clustering performance,but most of the text clustering algorithms can not directly apply priori information such as pairwise constraints.As the characteristics of text data were high-dimensional and sparse,we proposed a semi-supervised document clustering algorithm.First,pairwise constraints were expanded and embedded in the document similarity matrix,then K density regions which have a small similarity with the already partitioned text collection were gradually searched in the remaining unpartitioned text collection as initial centroid.The remaining unpartitioned texts which are relatively difficult to distinguish were assigned to the K initial centroid according to the constraints.Finally,the clustering result was optimized by the convergence criterion function through integration of punish violations of pairwise constraints.In the clustering process,it can automatically determines the initial centroids to avoid the sensitivity to the initial centroids of K-means algorithm.Experimental results show that the proposed algorithm can effectively use a small amount of pairwise constraints to improve the clustering performance in Chinese and English text datasets.
Keywords:Clustering  Semi-supervised  VSM  Pairwise constraints  Text
点击此处可从《计算机科学》浏览原始摘要信息
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号