首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
自适应的软子空间聚类算法   总被引:6,自引:0,他引:6  
陈黎飞  郭躬德  姜青山 《软件学报》2010,21(10):2513-2523
软子空间聚类是高维数据分析的一种重要手段.现有算法通常需要用户事先设置一些全局的关键参数,且没有考虑子空间的优化.提出了一个新的软子空间聚类优化目标函数,在最小化子空间簇类的簇内紧凑度的同时,最大化每个簇类所在的投影子空间.通过推导得到一种新的局部特征加权方式,以此为基础提出一种自适应的k-means型软子空间聚类算法.该算法在聚类过程中根据数据集及其划分的信息,动态地计算最优的算法参数.在实际应用和合成数据集上的实验结果表明,该算法大幅度提高了聚类精度和聚类结果的稳定性.  相似文献   

2.
Projective clustering by histograms   总被引:5,自引:0,他引:5  
Recent research suggests that clustering for high-dimensional data should involve searching for "hidden" subspaces with lower dimensionalities, in which patterns can be observed when data objects are projected onto the subspaces. Discovering such interattribute correlations and location of the corresponding clusters is known as the projective clustering problem. We propose an efficient projective clustering technique by histogram construction (EPCH). The histograms help to generate "signatures", where a signature corresponds to some region in some subspace, and signatures with a large number of data objects are identified as the regions for subspace clusters. Hence, projected clusters and their corresponding subspaces can be uncovered. Compared to the best previous methods to our knowledge, this approach is more flexible in that less prior knowledge on the data set is required, and it is also much more efficient. Our experiments compare behaviors and performances of this approach and other projective clustering algorithms with different data characteristics. The results show that our technique is scalable to very large databases, and it is able to return accurate clustering results.  相似文献   

3.
Density Conscious Subspace Clustering for High-Dimensional Data   总被引:2,自引:0,他引:2  
Instead of finding clusters in the full feature space, subspace clustering is an emergent task which aims at detecting clusters embedded in subspaces. Most of previous works in the literature are density-based approaches, where a cluster is regarded as a high-density region in a subspace. However, the identification of dense regions in previous works lacks of considering a critical problem, called "the density divergence problem” in this paper, which refers to the phenomenon that the region densities vary in different subspace cardinalities. Without considering this problem, previous works utilize a density threshold to discover the dense regions in all subspaces, which incurs the serious loss of clustering accuracy (either recall or precision of the resulting clusters) in different subspace cardinalities. To tackle the density divergence problem, in this paper, we devise a novel subspace clustering model to discover the clusters based on the relative region densities in the subspaces, where the clusters are regarded as regions whose densities are relatively high as compared to the region densities in a subspace. Based on this idea, different density thresholds are adaptively determined to discover the clusters in different subspace cardinalities. Due to the infeasibility of applying previous techniques in this novel clustering model, we also devise an innovative algorithm, referred to as DENCOS (DENsity COnscious Subspace clustering), to adopt a divide-and-conquer scheme to efficiently discover clusters satisfying different density thresholds in different subspace cardinalities. As validated by our extensive experiments on various data sets, DENCOS can discover the clusters in all subspaces with high quality, and the efficiency of DENCOS outperformes previous works.  相似文献   

4.
Subspace clustering is a data-mining task that groups similar data objects and at the same time searches the subspaces where similarities appear. For this reason, subspace clustering is recognized as more general and complicated than standard clustering. In this article, we present ChameleoClust+, a bioinspired evolutionary subspace clustering algorithm that takes advantage of an evolvable genome structure to detect various numbers of clusters located in different subspaces. ChameleoClust+ incorporates several biolike features such as a variable genome length, both functional and nonfunctional elements, and mutation operators including large rearrangements. It was assessed and compared with the state-of-the-art methods on a reference benchmark using both real-world and synthetic data sets. Although other algorithms may need complex parameter settings, ChameleoClust+ needs to set only one subspace clustering ad hoc and intuitive parameter: the maximal number of clusters. The remaining parameters of ChameleoClust+ are related to the evolution strategy (eg, population size, mutation rate), and a single setting for all of them turned out to be effective for all the benchmark data sets. A sensitivity analysis has also been carried out to study the impact of each parameter on the subspace clustering quality.  相似文献   

5.
Almost all subspace clustering algorithms proposed so far are designed for numeric datasets. In this paper, we present a k-means type clustering algorithm that finds clusters in data subspaces in mixed numeric and categorical datasets. In this method, we compute attributes contribution to different clusters. We propose a new cost function for a k-means type algorithm. One of the advantages of this algorithm is its complexity which is linear with respect to the number of the data points. This algorithm is also useful in describing the cluster formation in terms of attributes contribution to different clusters. The algorithm is tested on various synthetic and real datasets to show its effectiveness. The clustering results are explained by using attributes weights in the clusters. The clustering results are also compared with published results.  相似文献   

6.
Traditional clustering models based on distance similarity are not always effective in capturing correlation among data objects, while pattern-based clustering can do well in identifying correlation hidden among data objects. However, the state-of-the-art pattern-based clustering methods are inefficient and provide no metric to measure the clustering quality. This paper presents a new pattern-based subspace clustering method, which can tackle the problems mentioned above. Observing the analogy between mining frequent itemsets and discovering subspace clusters, we apply pattern tree – a structure used in frequent itemsets mining to determining the target subspaces by scanning the database once, which can be done efficiently in large datasets. Furthermore, we introduce a general clustering quality evaluation model to guide the identifying of meaningful clusters. The proposed new method enables the users to set flexibly proper quality-control parameters to meet different needs. Experimental results on synthetic and real datasets show that our method outperforms the existing methods in both efficiency and effectiveness.  相似文献   

7.
Robust projected clustering   总被引:4,自引:2,他引:2  
Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many overlapping clusters. Such algorithms have been extensively studied for numerical data, but only a few have been proposed for categorical data. Typical drawbacks of existing projected and subspace clustering algorithms for numerical or categorical data are that they rely on parameters whose appropriate values are difficult to set appropriately or that they are unable to identify projected clusters with few relevant attributes. We present P3C, a robust algorithm for projected clustering that can effectively discover projected clusters in the data while minimizing the number of required parameters. P3C does not need the number of projected clusters as input, and can discover, under very general conditions, the true number of projected clusters. P3C is effective in detecting very low-dimensional projected clusters embedded in high dimensional spaces. P3C positions itself between projected and subspace clustering in that it can compute both disjoint or overlapping clusters. P3C is the first projected clustering algorithm for both numerical and categorical data.  相似文献   

8.
As one of the most fundamental yet important methods of data clustering, center-based partitioning approach clusters the dataset into k subsets, each of which is represented by a centroid or medoid. In this paper, we propose a new medoid-based k-partitions approach called Clustering Around Weighted Prototypes (CAWP), which works with a similarity matrix. In CAWP, each cluster is characterized by multiple objects with different representative weights. With this new cluster representation scheme, CAWP aims to simultaneously produce clusters of improved quality and a set of ranked representative objects for each cluster. An efficient algorithm is derived to alternatingly update the clusters and the representative weights of objects with respect to each cluster. An annealing-like optimization procedure is incorporated to alleviate the local optimum problem for better clustering results and at the same time to make the algorithm less sensitive to parameter setting. Experimental results on benchmark document datasets show that, CAWP achieves favorable effectiveness and efficiency in clustering, and also provides useful information for cluster-specified analysis.  相似文献   

9.
针对现有子空间聚类方法处理类簇间存在重叠时聚类准确率较低的问题,文中提出基于概率模型的重叠子空间聚类算法.首先采用混合范数的子空间表示方法将高维数据分割为若干个子空间.然后使用服从指数族分布的概率模型判断子空间内数据的重叠部分,并将数据分配到正确的子空间内,进而得到聚类结果,在参数估计时利用交替最大化方法确定函数最优解.在人造数据集和UCI数据集上的测试实验表明,文中算法具有良好的聚类性能,适用于较大规模的数据集.  相似文献   

10.
In high dimensional data, many dimensions are irrelevant to each other and clusters are usually hidden under noise. As an important extension of the traditional clustering, subspace clustering can be utilized to simultaneously cluster the high dimensional data into several subspaces and associate the low-dimensional subspaces with the corresponding points. In subspace clustering, it is a crucial step to construct an affinity matrix with block-diagonal form, in which the blocks correspond to different clusters. The distance-based methods and the representation-based methods are two major types of approaches for building an informative affinity matrix. In general, it is the difference between the density inside and outside the blocks that determines the efficiency and accuracy of the clustering. In this work, we introduce a well-known approach in statistic physics method, namely link prediction, to enhance subspace clustering by reinforcing the affinity matrix.More importantly,we introduce the idea to combine complex network theory with machine learning. By revealing the hidden links inside each block, we maximize the density of each block along the diagonal, while restrain the remaining non-blocks in the affinity matrix as sparse as possible. Our method has been shown to have a remarkably improved clustering accuracy comparing with the existing methods on well-known datasets.  相似文献   

11.
Clustering algorithms generally accept a parameter k from the user, which determines the number of clusters sought. However, in many application domains, like document categorization, social network clustering, and frequent pattern summarization, the proper value of k is difficult to guess. An alternative clustering formulation that does not require k is to impose a lower bound on the similarity between an object and its corresponding cluster representative. Such a formulation chooses exactly one representative for every cluster and minimizes the representative count. It has many additional benefits. For instance, it supports overlapping clusters in a natural way. Moreover, for every cluster, it selects a representative object, which can be effectively used in summarization or semi-supervised classification task. In this work, we propose an algorithm, SimClus, for clustering with lower bound on similarity. It achieves a O(log n) approximation bound on the number of clusters, whereas for the best previous algorithm the bound can be as poor as O(n). Experiments on real and synthetic data sets show that our algorithm produces more than 40% fewer representative objects, yet offers the same or better clustering quality. We also propose a dynamic variant of the algorithm, which can be effectively used in an on-line setting.  相似文献   

12.
This paper proposes a new method to weight subspaces in feature groups and individual features for clustering high-dimensional data. In this method, the features of high-dimensional data are divided into feature groups, based on their natural characteristics. Two types of weights are introduced to the clustering process to simultaneously identify the importance of feature groups and individual features in each cluster. A new optimization model is given to define the optimization process and a new clustering algorithm FG-k-means is proposed to optimize the optimization model. The new algorithm is an extension to k-means by adding two additional steps to automatically calculate the two types of subspace weights. A new data generation method is presented to generate high-dimensional data with clusters in subspaces of both feature groups and individual features. Experimental results on synthetic and real-life data have shown that the FG-k-means algorithm significantly outperformed four k-means type algorithms, i.e., k-means, W-k-means, LAC and EWKM in almost all experiments. The new algorithm is robust to noise and missing values which commonly exist in high-dimensional data.  相似文献   

13.
Subspace clustering methods partition the data that lie in or close to a union of subspaces in accordance with the subspace structure. Such methods with sparsity prior, such as sparse subspace clustering (SSC) (Elhamifar and Vidal in IEEE Trans Pattern Anal Mach Intell 35(11):2765–2781, 2013) with the sparsity induced by the \(\ell ^{1}\)-norm, are demonstrated to be effective in subspace clustering. Most of those methods require certain assumptions, e.g. independence or disjointness, on the subspaces. However, these assumptions are not guaranteed to hold in practice and they limit the application of existing sparse subspace clustering methods. In this paper, we propose \(\ell ^{0}\)-induced sparse subspace clustering (\(\ell ^{0}\)-SSC). In contrast to the required assumptions, such as independence or disjointness, on subspaces for most existing sparse subspace clustering methods, we prove that \(\ell ^{0}\)-SSC guarantees the subspace-sparse representation, a key element in subspace clustering, for arbitrary distinct underlying subspaces almost surely under the mild i.i.d. assumption on the data generation. We also present the “no free lunch” theorem which shows that obtaining the subspace representation under our general assumptions can not be much computationally cheaper than solving the corresponding \(\ell ^{0}\) sparse representation problem of \(\ell ^{0}\)-SSC. A novel approximate algorithm named Approximate \(\ell ^{0}\)-SSC (A\(\ell ^{0}\)-SSC) is developed which employs proximal gradient descent to obtain a sub-optimal solution to the optimization problem of \(\ell ^{0}\)-SSC with theoretical guarantee. The sub-optimal solution is used to build a sparse similarity matrix upon which spectral clustering is performed for the final clustering results. Extensive experimental results on various data sets demonstrate the superiority of A\(\ell ^{0}\)-SSC compared to other competing clustering methods. Furthermore, we extend \(\ell ^{0}\)-SSC to semi-supervised learning by performing label propagation on the sparse similarity matrix learnt by A\(\ell ^{0}\)-SSC and demonstrate the effectiveness of the resultant semi-supervised learning method termed \(\ell ^{0}\)-sparse subspace label propagation (\(\ell ^{0}\)-SSLP).  相似文献   

14.
贾洪杰  丁世飞  史忠植 《软件学报》2015,26(11):2836-2846
谱聚类将聚类问题转化成图划分问题,是一种基于代数图论的聚类方法.在求解图划分目标函数时,一般利用Rayleigh熵的性质,通过计算Laplacian矩阵的特征向量将原始数据点映射到一个低维的特征空间中,再进行聚类.然而在谱聚类过程中,存储相似矩阵的空间复杂度是O(n2),对Laplacian矩阵特征分解的时间复杂度一般为O(n3),这样的复杂度在处理大规模数据时是无法接受的.理论证明,Normalized Cut图聚类与加权核k-means都等价于矩阵迹的最大化问题.因此,可以用加权核k-means算法来优化Normalized Cut的目标函数,这就避免了对Laplacian矩阵特征分解.不过,加权核k-means算法需要计算核矩阵,其空间复杂度依然是O(n2).为了应对这一挑战,提出近似加权核k-means算法,仅使用核矩阵的一部分来求解大数据的谱聚类问题.理论分析和实验对比表明,近似加权核k-means的聚类表现与加权核k-means算法是相似的,但是极大地减小了时间和空间复杂性.  相似文献   

15.
Scalable Clustering Algorithms with Balancing Constraints   总被引:2,自引:0,他引:2  
Clustering methods for data-mining problems must be extremely scalable. In addition, several data mining applications demand that the clusters obtained be balanced, i.e., of approximately the same size or importance. In this paper, we propose a general framework for scalable, balanced clustering. The data clustering process is broken down into three steps: sampling of a small representative subset of the points, clustering of the sampled data, and populating the initial clusters with the remaining data followed by refinements. First, we show that a simple uniform sampling from the original data is sufficient to get a representative subset with high probability. While the proposed framework allows a large class of algorithms to be used for clustering the sampled set, we focus on some popular parametric algorithms for ease of exposition. We then present algorithms to populate and refine the clusters. The algorithm for populating the clusters is based on a generalization of the stable marriage problem, whereas the refinement algorithm is a constrained iterative relocation scheme. The complexity of the overall method is O(kN log N) for obtaining k balanced clusters from N data points, which compares favorably with other existing techniques for balanced clustering. In addition to providing balancing guarantees, the clustering performance obtained using the proposed framework is comparable to and often better than the corresponding unconstrained solution. Experimental results on several datasets, including high-dimensional (>20,000) ones, are provided to demonstrate the efficacy of the proposed framework.
Joydeep GhoshEmail:
  相似文献   

16.
为了有效地发现数据聚簇,尤其是任意形状的聚簇,近年来提出了许多基于密度的聚类算法,如DBSCAN.OPTICS,DENCLUE,CLIQUE等.提出了一个新的基于密度的聚类算法CODU(clustering by ordering dense unit),基本思想是对单位子空间按密度排序,对每一个子空间,如果其密度大于周围邻居的密度则形成一个新的聚簇.由于子空间的数目远小于数据对象的数目,因此算法效率较高.同时,提出了一个新的数据可视化方法,将数据对象看做刺激光谱映射到三维空间,使聚类的结果清晰地展示出来.  相似文献   

17.
The diameter k-clustering problem is the problem of partitioning a finite subset of ? d into k subsets called clusters such that the maximum diameter of the clusters is minimized. One early clustering algorithm that computes a hierarchy of approximate solutions to this problem (for all values of k) is the agglomerative clustering algorithm with the complete linkage strategy. For decades, this algorithm has been widely used by practitioners. However, it is not well studied theoretically. In this paper, we analyze the agglomerative complete linkage clustering algorithm. Assuming that the dimension d is a constant, we show that for any k the solution computed by this algorithm is an O(logk)-approximation to the diameter k-clustering problem. Our analysis does not only hold for the Euclidean distance but for any metric that is based on a norm. Furthermore, we analyze the closely related k-center and discrete k-center problem. For the corresponding agglomerative algorithms, we deduce an approximation factor of O(logk) as well.  相似文献   

18.
In this paper, we first study an important but unsolved dilemma in the literature of subspace clustering, which is referred to as “information overlapping-data coverage” challenge. Current solutions of subspace clustering usually invoke a grid-based Apriori-like procedure to identify dense regions and construct subspace clusters afterward. Due to the nature of monotonicity property in Apriori-like procedures, it is inherent that if a region is identified as dense, all its projected regions are also identified as dense, causing overlapping/redundant clustering information to be inevitably reported to users when generating clusters from such highly correlated regions. However, naive methods to filter redundant clusters will incur a challenging problem in the other side of the dilemma, called the “data coverage” issue. Note that two clusters may have highly correlated dense regions but their data members could be highly different to each other. Arbitrarily removing one of them may lose the coverage of data with clustering information, thus likely reporting an incomplete and biased clustering result. In this paper, therefore, we further propose an innovative algorithm, called "NOnRedundant Subspace Cluster mining” (NORSC), to efficiently discover a succinct collection of subspace clusters while also maintaining the required degree of data coverage. NORSC not only avoids generating the redundant clusters with most of the contained data covered by higher dimensional clusters to resolve the information overlapping problem but also limits the information loss to cope with the data coverage problem. As shown by our experimental results, NORSC is very effective in identifying a concise and small set of subspace clusters, while incurring time complexity in orders of magnitude better than that of previous works.  相似文献   

19.
Clustering has been widely used to identify possible structures in data and help users to understand data in an unsupervised manner. Traditional clustering methods often provide a single partitioning of the data that groups similar data objects in one group while separates dissimilar ones into different groups. However, it has been well recognized that assuming only a single clustering for a data set can be too strict and cannot capture the diversity in applications. Multiple clustering structures can be hidden in the data, and each represents a unique perspective of the data. Different multi-clustering methods, which aim to discover multiple independent structures hidden in data, have been proposed in the recent decade. Although multi-clustering methods may provide more information for users, it is still challenging for users to efficiently and effectively understand each clustering structure. Subspace multi-clustering methods address this challenge by providing each clustering a feature subspace. Moreover, most subspace multi-clustering methods are especially scalable for high-dimensional data, which has become more and more popular in real applications due to the advances of big data technologies. In this paper, we focus on the subject of subspace multi-clustering, which has not been reviewed by any previous survey. We formulate the subspace multi-clustering problem and categorize the methodologies in different perspectives (e.g., de-coupled methods and coupled methods). We compare different methods on a series of specific properties (e.g., input parameters and different kinds of subspaces) and analyze the advantages and disadvantages. We also discuss several interesting and meaningful future directions.  相似文献   

20.
Clustering in Dynamic Spatial Databases   总被引:2,自引:0,他引:2  
Efficient clustering in dynamic spatial databases is currently an open problem with many potential applications. Most traditional spatial clustering algorithms are inadequate because they do not have an efficient support for incremental clustering.In this paper, we propose DClust, a novel clustering technique for dynamic spatial databases. DClust is able to provide multi-resolution view of the clusters, generate arbitrary shapes clusters in the presence of noise, generate clusters that are insensitive to ordering of input data and support incremental clustering efficiently. DClust utilizes the density criterion that captures arbitrary cluster shapes and sizes to select a number of representative points, and builds the Minimum Spanning Tree (MST) of these representative points, called R-MST. After the initial clustering, a summary of the cluster structure is built. This summary enables quick localization of the effect of data updates on the current set of clusters. Our experimental results show that DClust outperforms existing spatial clustering methods such as DBSCAN, C2P, DENCLUE, Incremental DBSCAN and BIRCH in terms of clustering time and accuracy of clusters found.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号