首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
一种结合主动学习的半监督文档聚类算法   总被引:1,自引:0,他引:1  
半监督文档聚类,即利用少量具有监督信息的数据来辅助无监督文档聚类,近几年来逐渐成为机器学习和数据挖掘领域研究的热点问题.由于获取大量监督信息费时费力,因此,国内外学者考虑如何获得少量但对聚类性能提高显著的监督信息.提出一种结合主动学习的半监督文档聚类算法,通过引入成对约束信息指导DBSCAN的聚类过程来提高聚类性能,得到一种半监督文档聚类算法Cons-DBSCAN.通过对约束集中所含信息量的衡量和对DBSCAN算法本身的分析,提出了一种启发式的主动学习算法,能够选取含信息量大的成对约束集,从而能够更高效地辅助半监督文档聚类.实验结果表明,所提出的算法能够高效地进行文档聚类.通过主动学习算法获得的成对约束集,能够显著地提高聚类性能.并且,算法的性能优于两个代表性的结合主动学习的半监督聚类算法.  相似文献   

2.
针对当前多文档聚合推导引起的敏感信息泄露问题存在风险大、隐蔽性高的特点,提出了一种基于半监督聚类的文档敏感信息推导方法。首先,为确保在较小的时间开销下获得高质量的约束信息,设计了一种新颖的二阶约束主动学习算法,它通过选择不确定性最大的样本点来生成信息量最大的约束闭包;然后,在引入约束信息的基础上结合DBSCAN提出一种新的半监督聚类算法,它能够有效解决DBSCAN算法存在的边界模糊问题,提高文档聚类准确性;最后,在半监督聚类结果的基础上,对相似文档进行敏感信息可能性测度。实验表明,半监督聚类算法准确率提升明显,推导方法能够有效推导出敏感信息。  相似文献   

3.
聚类问题是近几年来机器学习和数据挖掘领域研究的热点问题,由于获取大量监督信息费时费力,目前国内外研究的重点是如何获得少量但对聚类性能提高显著的监督信息,再加上实际问题中存在的动态模糊性,故本文提出一种结合主动学习的动态模糊聚类算法DF-DBSCAN,通过引入动态模糊等价关系、动态模糊信任测度和动态模糊似然测度这3个约束信息来指导DBSCAN的聚类过程,以提高聚类的性能。实验结果表明,DF-DBSCAN算法不仅解决了实际问题中存在的动态模糊性数据的描述和表示问题,而且能够高效地进行数据聚类,显著地提高聚类性能。   相似文献   

4.
一种基于限制的PAM算法   总被引:3,自引:1,他引:2  
利用数据对象间的关联限制可以改善聚类算法的效果,但对于关联限制与K中心点算法的结合策略则少有研究。由此研究了关联限制与PAM算法的结合方法,提出了算法CPAM。首先基于限制找到一个合适的初始分隔;在接下来反复地调整中心点的过程中,也考虑到了所给限制。实验结果显示:CPAM可以有效地利用关联限制来提高一些实际数据集的准确率。  相似文献   

5.
Data clustering is an important and frequently used unsupervised learning method. Recent research has demonstrated that incorporating instance-level background information to traditional clustering algorithms can increase the clustering performance. In this paper, we extend traditional clustering by introducing additional prior knowledge such as the size of each cluster. We propose a heuristic algorithm to transform size constrained clustering problems into integer linear programming problems. Experiments on both synthetic and UCI datasets demonstrate that our proposed approach can utilize cluster size constraints and lead to the improvement of clustering accuracy.  相似文献   

6.
Clustering problem is an unsupervised learning problem. It is a procedure that partition data objects into matching clusters. The data objects in the same cluster are quite similar to each other and dissimilar in the other clusters. Density-based clustering algorithms find clusters based on density of data points in a region. DBSCAN algorithm is one of the density-based clustering algorithms. It can discover clusters with arbitrary shapes and only requires two input parameters. DBSCAN has been proved to be very effective for analyzing large and complex spatial databases. However, DBSCAN needs large volume of memory support and often has difficulties with high-dimensional data and clusters of very different densities. So, partitioning-based DBSCAN algorithm (PDBSCAN) was proposed to solve these problems. But PDBSCAN will get poor result when the density of data is non-uniform. Meanwhile, to some extent, DBSCAN and PDBSCAN are both sensitive to the initial parameters. In this paper, we propose a new hybrid algorithm based on PDBSCAN. We use modified ant clustering algorithm (ACA) and design a new partitioning algorithm based on ‘point density’ (PD) in data preprocessing phase. We name the new hybrid algorithm PACA-DBSCAN. The performance of PACA-DBSCAN is compared with DBSCAN and PDBSCAN on five data sets. Experimental results indicate the superiority of PACA-DBSCAN algorithm.  相似文献   

7.
K-Hub聚类算法是一种有效的高维数据聚类算法,但是它对初始聚类中心的选择非常敏感,并且对于靠近类边界的实例往往不能正确聚类.为了解决这些问题,提出一种结合主动学习和半监督聚类的K-Hub聚类算法.运用主动学习策略学习部分实例的关联限制,然后利用这些关联限制指导K-Hub的聚类过程.实验结果表明,基于主动学习的K-Hub聚类算法能有效提升K-Hub的聚类准确率.  相似文献   

8.
成对约束的属性加权半监督模糊核聚类算法   总被引:1,自引:0,他引:1  
在机器学习和数据挖掘中,带约束的半监督聚类是一个活跃的研究领域。为了利用约束条件获得表现更优异的聚类效果,提出了一种成对约束的属性加权半监督聚类算法,该方法充分考虑了属性间的不平衡性,在传统模糊聚类算法中融合半监督学习机制并通过Mercer核把原始的观察空间映射到高维特征空间。实验结果表明,该算法优于相似的成对约束的竞争群算法(PCCA)。  相似文献   

9.
将监督信息引入到聚类算法中去,在先前提出的鲁棒联机聚类算法(ROC)的基础上,通过引入以样本类标号形式给出的监督信息,提出了一种半监督的鲁棒联机聚类算法(Semi-ROC).在算法的聚类精度和鲁棒性能上,算法Semi-ROC比ROC和AddC有着更好的性能,在人工数据集和UCI标准数据集上的实验结果表明,Semi-ROC能有效地利用少量的监督信息来提高算法的聚类性能,得到较优的结果.另外,在添加噪声的情况下,算法Semi-ROC比原始的联机聚类算法AddC和ROC都更加鲁棒.  相似文献   

10.
The sensitivity of the constrained K-means clustering algorithm (Cop-Kmeans) to the assignment order of instances is studied, and a novel assignment order learning method for Cop-Kmeans, termed as clustering Uncertainty-based Assignment order Learning Algorithm (UALA), is proposed in this paper. The main idea of UALA is to rank all instances in the data set according to their clustering uncertainties calculated by using the ensembles of multiple clustering algorithms. Experimental results on several real data sets with artificial instance-level constraints demonstrate that UALA can identify a good assignment order of instances for Cop-Kmeans. In addition, the effects of ensemble sizes on the performance of UALA are analyzed, and the generalization property of Cop-Kmeans is also studied.   相似文献   

11.
王纵虎  刘速 《计算机科学》2016,43(12):183-188
半监督聚类能利用少量标记数据来提高聚类算法性能,但大部分文本聚类算法无法直接应用成对约束等先验信息。针对文本数据高维稀疏的特点,提出了一种半监督文本聚类算法。将成对约束信息扩展后嵌入文档相似度矩阵,在此基础上根据已划分与未划分文档之间的统计信息逐步找出剩余未划分文本集合中密集的且与已划分聚类中心集合相似度较小的K个初始聚类中心集合,然后将剩余的相对较难区分的文档结合成对约束限制信息划分到K个初始聚类中心集合,最后通过融合成对约束违反惩罚的收敛准则函数对聚类结果进行进一步优化。算法在聚类过程中自动确定初始聚类中心集合,避免了K均值算法对初始聚类中心选择的敏感性。在几个中英文数据集上的实验结果表明,所提算法能有效地利用少量的成对约束先验信息提高聚类效果。  相似文献   

12.
Semi-supervised dimensionality reduction has attracted an increasing amount of attention in this big-data era. Many algorithms have been developed with a small number of pairwise constraints to achieve performances comparable to those of fully supervised methods. However, one challenging problem with semi-supervised approaches is the appropriate choice of the constraint set, including the cardinality and the composition of the constraint set which, to a large extent, affects the performance of the resulting algorithm. In this work, we address the problem by incorporating ensemble subspaces and active learning into dimensionality reduction and propose a new global and local scatter based semi-supervised dimensionality reduction method with active constraints selection. Unlike traditional methods that select the supervised information in one subspace, we pick up pairwise constraints in ensemble subspaces, where a novel active learning algorithm is designed with both exploration and filtering to generate informative pairwise constraints. The automatic constraint selection approach proposed in this paper can be generalized to be used with all constraint-based semi-supervised learning algorithms. Comparative experiments are conducted on four face database and the results validate the effectiveness of the proposed method.  相似文献   

13.
Traditional semi‐supervised clustering uses only limited user supervision in the form of instance seeds for clusters and pairwise instance constraints to aid unsupervised clustering. However, user supervision can also be provided in alternative forms for document clustering, such as labeling a feature by indicating whether it discriminates among clusters. This article thus fills this void by enhancing traditional semi‐supervised clustering with feature supervision, which asks the user to label discriminating features during defining (labeling) the instance seeds or pairwise instance constraints. Various types of semi‐supervised clustering algorithms were explored with feature supervision. Our experimental results on several real‐world data sets demonstrate that augmenting the instance‐level supervision with feature‐level supervision can significantly improve document clustering performance.  相似文献   

14.
Active learning algorithms allow neural networks to dynamically take part in the selection of the most informative training patterns. This paper introduces a new approach to active learning, which combines an unsupervised clustering of training data with a pattern selection approach based on sensitivity analysis. Training data is clustered into groups of similar patterns based on Euclidean distance, and the most informative pattern from each cluster is selected for training using the sensitivity analysis incremental learning algorithm in (Engelbrecht and Cloete, 1999). Experimental results show that the clustering approach improves on standard active learning as presented in (Engelbrecht and Cloete, 1999).  相似文献   

15.
提出一种基于非负矩阵分解(NMF)的双重约束文本聚类算法。在正交三重NMF模型中,加入文本空间的成对约束信息和词空间的类别约束信息,将不同的特征词项进行分类。利用迭代规则对原始的词-文档矩阵进行分解,获得文本聚类结果。与多种传统半监督文本聚类算法的对比结果表明,该算法具有较高的聚类精度,能提供更准确和有效的聚类结果。  相似文献   

16.
现有的半监督聚类集成方法能利用先验信息,使集成的准确性、鲁棒性和稳定性得到提高,但在集成阶段加入成对约束信息时,只考虑了给定的约束信息而忽视了约束点与被约束点的邻域点之间的关系.针对此问题,提出了一种基于数据相关性的半监督模糊聚类集成方法.该方法首先利用半监督模糊聚类算法建立集成信息矩阵,并将其转换为相似性矩阵;然后,利用已知的约束信息及约束点与被约束点的邻域点之间的关系来修改相似性矩阵;最后,利用图划分算法得到最终的聚类结果.真实数据上的实验结果表明,提出的方法可以有效提高聚类质量.  相似文献   

17.
基于事件项语义图聚类的多文档摘要方法   总被引:2,自引:2,他引:0  
基于事件的抽取式摘要方法一般首先抽取那些描述重要事件的句子,然后把它们重组并生成摘要。该文将事件定义为事件项以及与其关联的命名实体,并聚焦从外部语义资源获取的事件项语义关系。首先基于事件项语义关系创建事件项语义关系图并使用改进的DBSCAN算法对事件项进行聚类,接着为每类选择一个代表事件项或者选择一类事件项来表示文档集的主题,最后从文档抽取那些包含代表项并且最重要的句子生成摘要。该文的实验结果证明在多文档自动摘要中考虑事件项语义关系是必要的和可行的。  相似文献   

18.
何海江  龙跃进 《计算机应用》2011,31(11):3108-3111
针对标记训练集不足的问题,提出了一种协同训练的多样本排序学习算法,从无标签数据挖掘隐含的排序信息。算法使用了两类多样本排序学习机,从当前已有的标记数据集分别构造两个不同的排序函数。相应地,每一个无标签查询都有两个不同的文档排列,由似然损失来计算这两个排列的相似性,为那些文档排列相似度低的查询贴上标签,使两个多样本排序学习机新增了训练数据。在排序学习公开数据集LETOR上的实验结果证实,协同训练的排序算法很有效。另外,还讨论了标注比例对算法的影响。  相似文献   

19.
带约束条件的聚类算法研究   总被引:7,自引:0,他引:7  
该文描述了带约束条件的聚类和约束条件的分类。在介绍CLIQUE算法的基础上,通过对CLIQUE算法的改进,提出了一种能够在高维空间中处理实例对约束条件的聚类算法CON-CLIQUE。通过实验验证了该算法的正确性和效率。  相似文献   

20.
为了在只有少量已知标记的数据集中获得较好的聚类效果,提出了一种基于图收缩的半监督聚类算法。首先将整个样本空间中的数据表达为一个带权图,再根据给出的must-link约束,对图进行边收缩的修改,进而增强must-link约束。在此基础上引入图拉普拉斯算子,结合cannot-link约束将样本空间投影到一个特征子空间。最后在子空间上进行聚类分析。实验结果表明,该方法不仅提高了对复杂数据的聚类结果,而且在约束对数量较少时也能获得较好的结果。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号