首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
谱聚类将数据聚类问题转化成图划分问题,通过寻找最优的子图,对数据点进行聚类。谱聚类的关键是构造合适的相似矩阵,将数据集的内在结构真实地描述出来。针对传统的谱聚类算法采用高斯核函数来构造相似矩阵时对尺度参数的选择很敏感,而且在聚类阶段需要随机确定初始的聚类中心,聚类性能也不稳定等问题,本文提出了基于消息传递的谱聚类算法。该算法采用密度自适应的相似性度量方法,可以更好地描述数据点之间的关系,然后利用近邻传播(Affinity propagation,AP)聚类中“消息传递”机制获得高质量的聚类中心,提高了谱聚类算法的性能。实验表明,新算法可以有效地处理多尺度数据集的聚类问题,其聚类性能非常稳定,聚类质量也优于传统的谱聚类算法和k-means算法。  相似文献   

2.
机器学习的无监督聚类算法已被广泛应用于各种目标识别任务。基于密度峰值的快速搜索聚类算法(DPC)能快速有效地确定聚类中心点和类个数,但在处理复杂分布形状的数据和高维图像数据时仍存在聚类中心点不容易确定、类数偏少等问题。为了提高其处理复杂高维数据的鲁棒性,文中提出了一种基于学习特征表示的密度峰值快速搜索聚类算法(AE-MDPC)。该算法采用无监督的自动编码器(AutoEncoder)学出数据的最优特征表示,结合能刻画数据全局一致性的流形相似性,提高了同类数据间的紧致性和不同类数据间的分离性,促使潜在类中心点的密度值成为局部最大。在4个人工数据集和4个真实图像数据集上将AE-MDPC与经典的K-means,DBSCAN,DPC算法以及结合了PCA的DPC算法进行比较。实验结果表明,在外部评价指标聚类精度、内部评价指标调整互信息和调整兰德指数上,AE-MDPC的聚类性能优于对比算法,而且提供了更好的可视化性能。总之,基于特征表示学习且结合流形距离的AE-MDPC算法能有效地处理复杂流形数据和高维图像数据。  相似文献   

3.
To cluster web documents, all of which have the same name entities, we attempted to use existing clustering algorithms such as K-means and spectral clustering. Unexpectedly, it turned out that these algorithms are not effective to cluster web documents. According to our intensive investigation, we found that clustering such web pages is more complicated because (1) the number of clusters (known as ground truth) is larger than two or three clusters as in general clustering problems and (2) clusters in the data set have extremely skewed distributions of cluster sizes. To overcome the aforementioned problem, in this paper, we propose an effective clustering algorithm to boost up the accuracy of K-means and spectral clustering algorithms. In particular, to deal with skewed distributions of cluster sizes, our algorithm performs both bisection and merge steps based on normalized cuts of the similarity graph G to correctly cluster web documents. Our experimental results show that our algorithm improves the performance by approximately 56% compared to spectral bisection and 36% compared to K-means.  相似文献   

4.
We present a modified find density peaks (MFDP) clustering algorithm. In the MFDP, a critical parameter, dc, is auto-defined by minimizing the entropy of all points. By considering both the point density, ρ, and large distance from points with higher densities, δ, the high-dimensional points are transformed into a 2D space. The halo points of the original FDP cluster algorithm are redefined, and a definition of boundary points is introduced to illustrate the intersection region between clusters. To demonstrate the clustering ability, the distance-based K-means clustering and density-based algorithms DBSCAN, original FDP are employed respectively. Four criteria are introduced to evaluate the clustering algorithms quantitatively. For most of the cases, the MFDP provides a superior clustering result than both of the typical clustering algorithms, and FDP in 20 commonly used benchmark datasets, particularly in clearly depicting the intersection region between clusters. Finally, we evaluate the performance of the MFDP in the cluster analysis of conformations in molecular dynamics (MD). In the MD clustering process, eight typical cluster center conformations are selected in six collective variable spaces. Moreover, it is in strong agreement with the experiment results. The clustering results demonstrate the potential for generalized applications of the modified algorithm to similar problems.  相似文献   

5.
Due to data sparseness and attribute redundancy in high-dimensional data, clusters of objects often exist in subspaces rather than in the entire space. To effectively address this issue, this paper presents a new optimization algorithm for clustering high-dimensional categorical data, which is an extension of the k-modes clustering algorithm. In the proposed algorithm, a novel weighting technique for categorical data is developed to calculate two weights for each attribute (or dimension) in each cluster and use the weight values to identify the subsets of important attributes that categorize different clusters. The convergence of the algorithm under an optimization framework is proved. The performance and scalability of the algorithm is evaluated experimentally on both synthetic and real data sets. The experimental studies show that the proposed algorithm is effective in clustering categorical data sets and also scalable to large data sets owning to its linear time complexity with respect to the number of data objects, attributes or clusters.  相似文献   

6.
This paper presents a new k-means type algorithm for clustering high-dimensional objects in sub-spaces. In high-dimensional data, clusters of objects often exist in subspaces rather than in the entire space. For example, in text clustering, clusters of documents of different topics are categorized by different subsets of terms or keywords. The keywords for one cluster may not occur in the documents of other clusters. This is a data sparsity problem faced in clustering high-dimensional data. In the new algorithm, we extend the k-means clustering process to calculate a weight for each dimension in each cluster and use the weight values to identify the subsets of important dimensions that categorize different clusters. This is achieved by including the weight entropy in the objective function that is minimized in the k-means clustering process. An additional step is added to the k-means clustering process to automatically compute the weights of all dimensions in each cluster. The experiments on both synthetic and real data have shown that the new algorithm can generate better clustering results than other subspace clustering algorithms. The new algorithm is also scalable to large data sets.  相似文献   

7.
高维数据流的自适应子空间聚类算法   总被引:1,自引:0,他引:1       下载免费PDF全文
高维数据流聚类是数据挖掘领域中的研究热点。由于数据流具有数据量大、快速变化、高维性等特点,许多聚类算法不能取得较好的聚类质量。提出了高维数据流的自适应子空间聚类算法SAStream。该算法改进了HPStream中的微簇结构并定义了候选簇,只在相应的子空间内计算新来数据点到候选簇质心的距离,减少了聚类时被检查微簇的数目,将形成的微簇存储在金字塔时间框架中,使用时间衰减函数删除过期的微簇;当数据流量大时,根据监测的系统资源使用情况自动调整界限半径和簇选择因子,从而调节聚类的粒度。实验结果表明,该算法具有良好的聚类质量和快速的数据处理能力。  相似文献   

8.
一种半监督K均值多关系数据聚类算法   总被引:1,自引:0,他引:1  
高滢  刘大有  齐红  刘赫 《软件学报》2008,19(11):2814-2821
提出了一种半监督K均值多关系数据聚类算法.该算法在K均值聚类算法的基础上扩展了其初始类簇的选择方法和对象相似性度量方法,以用于多关系数据的半监督学习.为了获取高性能,该算法在聚类过程中充分利用了标记数据、对象属性及各种关系信息.多关系数据库Movie上的实验结果验证了该算法的有效性.  相似文献   

9.
为提高金融业务数据集上的聚类质量和聚类效率,提出簇的直径、簇间的相似度这2个概念。利用距离尺度降维的中心距序降维法,将多维数据降至一维,在一维上利用自适应排序聚类算法ASC聚类。该算法和传统的Cobweb算法、K-means算法做对比,实验表明该方法能提高簇间相似度,最大提高200%。  相似文献   

10.
目的 高光谱图像的高维特性和非线性结构给聚类任务带来了"维数灾难"和线性不可分问题,以往的工作将特征提取过程与聚类过程互相剥离,难以同时优化。为了解决上述问题,提出了一种新的嵌入式深度神经网络模糊C均值聚类方法(EDFCC)。方法 EDFCC算法为了提取更加有效的深层特征,联合优化高光谱图像的特征提取和聚类过程,将模糊C均值聚类算法嵌入至深度自编码器网络中,可以保持两任务联合优化的优势,同时利用深度自编码器网络降维以及逼近任意非线性函数的能力,逐步将原始数据映射到潜在特征空间,提取数据的深层特征。所提方法采用模糊C均值聚类算法约束特征提取过程,学习适用于聚类的高光谱数据深层特征,动态调整聚类指示矩阵。结果 实验结果表明,EDFCC算法在Indian Pines和Pavia University两个高光谱数据集上的聚类精度分别达到了42.95%和60.59%,与当前流行的低秩子空间聚类算法(LRSC)相比分别提高了3%和4%,相比于基于自编码器的数据聚类算法(AEKM)分别提高了2%和3%。结论 EDFCC算法能够从高光谱图像的高维光谱信息中提取更加有效的深层特征,提升聚类精度,并且由于EDFCC算法不需要额外的训练过程,大大提升了聚类效率。  相似文献   

11.
孙林  秦小营  徐久成  薛占熬 《软件学报》2022,33(4):1390-1411
密度峰值聚类(density peak clustering, DPC)是一种简单有效的聚类分析方法.但在实际应用中,对于簇间密度差别大或者簇中存在多密度峰的数据集,DPC很难选择正确的簇中心;同时,DPC中点的分配方法存在多米诺骨牌效应.针对这些问题,提出一种基于K近邻(K-nearest neighbors,KNN)和优化分配策略的密度峰值聚类算法.首先,基于KNN、点的局部密度和边界点确定候选簇中心;定义路径距离以反映候选簇中心之间的相似度,基于路径距离提出密度因子和距离因子来量化候选簇中心作为簇中心的可能性,确定簇中心.然后,为了提升点的分配的准确性,依据共享近邻、高密度最近邻、密度差值和KNN之间距离构建相似度,并给出邻域、相似集和相似域等概念,以协助点的分配;根据相似域和边界点确定初始聚类结果,并基于簇中心获得中间聚类结果.最后,依据中间聚类结果和相似集,从簇中心到簇边界将簇划分为多层,分别设计点的分配策略;对于具体层次中的点,基于相似域和积极域提出积极值以确定点的分配顺序,将点分配给其积极域中占主导地位的簇,获得最终聚类结果.在11个合成数据集和27个真实数据集上进行仿真...  相似文献   

12.
This paper proposes a new method to weight subspaces in feature groups and individual features for clustering high-dimensional data. In this method, the features of high-dimensional data are divided into feature groups, based on their natural characteristics. Two types of weights are introduced to the clustering process to simultaneously identify the importance of feature groups and individual features in each cluster. A new optimization model is given to define the optimization process and a new clustering algorithm FG-k-means is proposed to optimize the optimization model. The new algorithm is an extension to k-means by adding two additional steps to automatically calculate the two types of subspace weights. A new data generation method is presented to generate high-dimensional data with clusters in subspaces of both feature groups and individual features. Experimental results on synthetic and real-life data have shown that the FG-k-means algorithm significantly outperformed four k-means type algorithms, i.e., k-means, W-k-means, LAC and EWKM in almost all experiments. The new algorithm is robust to noise and missing values which commonly exist in high-dimensional data.  相似文献   

13.
A hybrid clustering procedure for concentric and chain-like clusters   总被引:1,自引:0,他引:1  
K-means algorithm is a well known nonhierarchical method for clustering data. The most important limitations of this algorithm are that: (1) it gives final clusters on the basis of the cluster centroids or the seed points chosen initially, and (2) it is appropriate for data sets having fairly isotropic clusters. But this algorithm has the advantage of low computation and storage requirements. On the other hand, hierarchical agglomerative clustering algorithm, which can cluster nonisotropic (chain-like and concentric) clusters, requires high storage and computation requirements. This paper suggests a new method for selecting the initial seed points, so that theK-means algorithm gives the same results for any input data order. This paper also describes a hybrid clustering algorithm, based on the concepts of multilevel theory, which is nonhierarchical at the first level and hierarchical from second level onwards, to cluster data sets having (i) chain-like clusters and (ii) concentric clusters. It is observed that this hybrid clustering algorithm gives the same results as the hierarchical clustering algorithm, with less computation and storage requirements.  相似文献   

14.
针对等值面生成方法从C/S到B/S的移植存在效率低、交互性差的缺点,提出一种基于ArcGIS Server的等值面快速生成方法,通过ArcGIS Server中的ModelBuilder创建模型,建立地理处理服务,以SOAP方式访问调用服务器提供的Web服务生成等值面,在客户端加以渲染。在东莞市三防决策支持子系统中的应用结果表明,该方法在效率、外观、交互性方面都较符合用户需求,可减少网络传输量并提高GIS分析性能。  相似文献   

15.
Clustering has long been an important data processing task in different applications. Typically, it attempts to partition the available data into groups according to their underlying distributions, and each cluster is represented by a center or an exemplar. In this paper, a new clustering algorithm called gravitational-force-based affinity propagation (GAP) is proposed, based on the well-known Newton''s law of universal gravitation. It views the available data points as nodes of a network (or planets of a universe) and the clusters and their corresponding exemplars can be obtained by transmitting affinity messages based on the gravitational forces between data points in a network. While GAP is inspired by the recently proposed affinity propagation (AP) clustering approach, it provides a new definition of the similarity between data points which makes the AP process more convincing and at the same time facilitates the differentiation of data points'' importance. The experimental results show that the GAP clustering algorithm, with comparable clustering accuracy, is even more efficient than the original AP clustering approach.  相似文献   

16.
宋艳  殷俊 《计算机应用》2005,40(11):3211-3216
为了解决谱聚类算法中相似矩阵的构造不能满足簇内数据点高度相似的问题,给出一种基于共享近邻的多视角谱聚类算法(MV-SNN)。首先,算法通过提高共享近邻个数多的两个数据点的相似度,使同簇的数据之间的相似度更高;然后,将改进后的多个视角的相似矩阵进行相加从而整合得到全局相似矩阵;最后,为了解决一般谱聚类算法在后期仍需要通过k均值聚类算法进行数据点划分的问题,给出拉普拉斯矩阵秩约束的方法,从而直接通过全局相似矩阵得到最终的类簇结构。实验结果表明,对比其他几种多视角谱聚类算法,MV-SNN算法在三个聚类衡量标准:准确度、纯度和归一化互信息上的性能提高了1%~20%,在聚类时间上减少了50%左右,可见MV-SNN算法的聚类性能更好,用时更短。  相似文献   

17.
宋艳  殷俊 《计算机应用》2020,40(11):3211-3216
为了解决谱聚类算法中相似矩阵的构造不能满足簇内数据点高度相似的问题,给出一种基于共享近邻的多视角谱聚类算法(MV-SNN)。首先,算法通过提高共享近邻个数多的两个数据点的相似度,使同簇的数据之间的相似度更高;然后,将改进后的多个视角的相似矩阵进行相加从而整合得到全局相似矩阵;最后,为了解决一般谱聚类算法在后期仍需要通过k均值聚类算法进行数据点划分的问题,给出拉普拉斯矩阵秩约束的方法,从而直接通过全局相似矩阵得到最终的类簇结构。实验结果表明,对比其他几种多视角谱聚类算法,MV-SNN算法在三个聚类衡量标准:准确度、纯度和归一化互信息上的性能提高了1%~20%,在聚类时间上减少了50%左右,可见MV-SNN算法的聚类性能更好,用时更短。  相似文献   

18.
Clustering is an important technique in data mining. The innovative algorithm proposed in this paper obtains clusters by first identifying boundary points as opposed to existing methods that calculate core cluster points before expanding to the boundary points. To achieve this, an affine space-based boundary detection algorithm was employed to divide data points into cluster boundary and internal points. A connection matrix was then formed by establishing neighbor relationships between internal and boundary points to perform clustering. Our clustering algorithm with an affine space-based boundary detection algorithm accurately detected clusters in datasets with different densities, shapes, and sizes. The algorithm excelled at dealing with high-dimensional datasets.  相似文献   

19.
以密度敏感距离作为相似性测度,结合近邻传播聚类算法和谱聚类算法,提出了一种密度敏感的层次化聚类算法。算法以密度敏感距离为相似度,多次应用近邻传播算法在数据集中选取一些“可能的类代表点”;用谱聚类算法将“可能的类代表点”再聚类得到“最终的类代表点”;每个数据点根据其类代表点的类标签信息找到自己的类标签。实验结果表明,该算法在处理时间、内存占用率和聚类错误率上都优于传统的近邻传播算法和谱聚类算法。  相似文献   

20.
刘奕志  程汝峰  梁永全 《计算机科学》2018,45(2):125-129, 146
基于加权K近邻的密度峰值发现算法(FKNN-DPC)是一种简单、高效的聚类算法,能够自动发现簇中心,并采用加权K近邻的思想快速、准确地完成对非簇中心样本的分配,在各种规模、任意维度、任意形状的数据集上都能得到高质量的聚类结果,但其样本分配策略中的权重仅考虑了样本间的欧氏距离。文中提出了一种基于共享近邻的相似度度量方式,并以此相似度改进样本分配策略,使得样本的分配更符合真实的簇归属情况,从而提高聚类质量。在UCI真实数据集上进行实验,并将所提算法与K-means,DBSCAN,AP,DPC,FKNN-DPC等算法进行对比,验证了其有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号