首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Partitional clustering is a common approach to cluster analysis. Although many algorithms have been proposed, partitional clustering remains a challenging problem with respect to the reliability and efficiency of recovering high quality solutions in terms of its criterion functions. In this paper, we propose a niching genetic k-means algorithm (NGKA) for partitional clustering, which aims at reliably and efficiently identifying high quality solutions in terms of the sum of squared errors criterion. Within the NGKA, we design a niching method, which encourages mating among similar clustering solutions while allowing for some competitions among dissimilar solutions, and integrate it into a genetic algorithm to prevent premature convergence during the evolutionary clustering search. Further, we incorporate one step of k-means operation into the regeneration steps of the resulted niching genetic algorithm to improve its computational efficiency. The proposed algorithm was applied to cluster both simulated data and gene expression data and compared with previous work. Experimental results clear show that the NGKA is an effective clustering algorithm and outperforms two other genetic algorithm based clustering methods implemented for comparison.  相似文献   

2.
Clustering techniques have received attention in many fields of study such as engineering, medicine, biology and data mining. The aim of clustering is to collect data points. The K-means algorithm is one of the most common techniques used for clustering. However, the results of K-means depend on the initial state and converge to local optima. In order to overcome local optima obstacles, a lot of studies have been done in clustering. This paper presents an efficient hybrid evolutionary optimization algorithm based on combining Modify Imperialist Competitive Algorithm (MICA) and K-means (K), which is called K-MICA, for optimum clustering N objects into K clusters. The new Hybrid K-ICA algorithm is tested on several data sets and its performance is compared with those of MICA, ACO, PSO, Simulated Annealing (SA), Genetic Algorithm (GA), Tabu Search (TS), Honey Bee Mating Optimization (HBMO) and K-means. The simulation results show that the proposed evolutionary optimization algorithm is robust and suitable for handling data clustering.  相似文献   

3.
Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as an optimization process of grouping documents into k clusters so that a particular criterion function is minimized or maximized. Usually, the cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. To solve this problem, we applied the concepts of neighbors and link, introduced in [S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25 (5) (2000) 345–366], to document clustering. If two documents are similar enough, they are considered as neighbors of each other. And the link between two documents represents the number of their common neighbors. Instead of just considering the pairwise similarity, the neighbors and link involve the global information into the measurement of the closeness of two documents. In this paper, we propose to use the neighbors and link for the family of k-means algorithms in three aspects: a new method to select initial cluster centroids based on the ranks of candidate documents; a new similarity measure which uses a combination of the cosine and link functions; and a new heuristic function for selecting a cluster to split based on the neighbors of the cluster centroids. Our experimental results on real-life data sets demonstrated that our proposed methods can significantly improve the performance of document clustering in terms of accuracy without increasing the execution time much.  相似文献   

4.
Adapting k-means for supervised clustering   总被引:2,自引:1,他引:1  
k-means is traditionally viewed as an algorithm for the unsupervised clustering of a heterogeneous population into a number of more homogeneous groups of objects. However, it is not necessarily guaranteed to group the same types (classes) of objects together. In such cases, some supervision is needed to partition objects which have the same label into one cluster. This paper demonstrates how the popular k-means clustering algorithm can be profitably modified to be used as a classifier algorithm. The output field itself cannot be used in the clustering but it is used in developing a suitable metric defined on other fields. The proposed algorithm combines Simulated Annealing with the modified k-means algorithm. We apply the proposed algorithm to real data sets, and compare the output of the resultant classifier to that of C4.5.  相似文献   

5.
Ant colony optimization (ACO) and particle swarm optimization (PSO) are two popular algorithms in swarm intelligence. Recently, a continuous ACO named ACOR was developed to solve the continuous optimization problems. This study incorporated ACOR with PSO to improve the search ability, investigating four types of hybridization as follows: (1) sequence approach, (2) parallel approach, (3) sequence approach with an enlarged pheromone-particle table, and (4) global best exchange. These hybrid systems were applied to data clustering. The experimental results utilizing public UCI datasets show that the performances of the proposed hybrid systems are superior compared to those of the K-mean, standalone PSO, and standalone ACOR. Among the four strategies of hybridization, the sequence approach with the enlarged pheromone table is superior to the other approaches because the enlarged pheromone table diversifies the generation of new solutions of ACOR and PSO, which prevents traps into the local optimum.  相似文献   

6.
Fast and exact out-of-core and distributed k-means clustering   总被引:1,自引:2,他引:1  
Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset.In this paper, we present a new algorithm, called fast and exact k-means clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset and provably produces the same cluster centres as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centres and then takes one or more passes over the entire dataset to adjust these cluster centres. We provide theoretical analysis to show that the cluster centres thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared with k-means.This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analysing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down loading all data and running sequential k-means or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance. Ruoming Jin is currently an assistant professor in the Computer Science Department at Kent State University. He received a BE and a ME degree in computer engineering from Beihang University (BUAA), China in 1996 and 1999, respectively. He earned his MS degree in computer science from University of Delaware in 2001, and his Ph.D. degree in computer science from the Ohio State University in 2005. His research interests include data mining, databases, processing of streaming data, bioinformatics, and high performance computing. He has published more than 30 papers in these areas. He is a member of ACM and SIGKDD. Anjan Goswami studied robotics at the Indian Institute of Technology at Kanpur. While working with IBM, he was interested in studying computer science. He then obtained a masters degree from the University of South Florida, where he worked on computer vision problems. He then transferred to the PhD program in computer science at OSU, where he did a Masters thesis on efficient clustering algorithms for massive, distributed and streaming data. On successful completion of this, he decided to join a web-service-provider company to do research in designing and developing high-performance search solutions for very large structured data. Anjan' favourite recreations are studying and predicting technology trends, nature photography, hiking, literature and soccer. Gagan Agrawal is an Associate Professor of Computer Science and Engineering at the Ohio State University. He received his B.Tech degree from Indian Institute of Technology, Kanpur, in 1991, and M.S. and Ph.D degrees from University of Maryland, College Park, in 1994 and 1996, respectively. His research interests include parallel and distributed computing, compilers, data mining, grid computing, and data integration. He has published more than 110 refereed papers in these areas. He is a member of ACM and IEEE Computer Society. He received a National Science Foundation CAREER award in 1998.  相似文献   

7.
The problem of optimal non-hierarchical clustering is addressed. A new algorithm combining differential evolution and k-means is proposed and tested on eight well-known real-world data sets. Two criteria (clustering validity indexes), namely TRW and VCR, were used in the optimization of classification. The classification of objects to be optimized is encoded by the cluster centers in differential evolution (DE) algorithm. It induced the problem of rearrangement of centers in the population to ensure an efficient search via application of evolutionary operators. A new efficient heuristic for this rearrangement was also proposed. The plain DE variants with and without the rearrangement were compared with corresponding hybrid k-means variants. The experimental results showed that hybrid variants with k-means algorithm are essentially more efficient than the non-hybrid ones. Compared to a standard k-means algorithm with restart, the new hybrid algorithm was found more reliable and more efficient, especially in difficult tasks. The results for TRW and VCR criterion were compared. Both criteria provided the same optimal partitions and no significant differences were found in efficiency of the algorithms using these criteria.  相似文献   

8.
The rapid development of earth observation technology has produced large quantities of remote-sensing data. Unsupervised classification (i.e. clustering) of remote-sensing images, an important means to acquire land-use/cover information, has become increasingly in demand due to its simplicity and ease of application. Traditional methods, such as k-means, struggle to solve this NP-hard (Non-deterministic Polynomial hard) image classification problem. Particle swarm optimization (PSO), always achieving better result than k-means, has recently been applied to unsupervised image classification. However, PSO was also found to be easily trapped on local optima. This article proposes a novel unsupervised Levy flight particle swarm optimization (ULPSO) method for image classification with balanced exploitation and exploration capabilities. It benefits from a new searching strategy: the worst particle in the swarm is targeted and its position is updated with Levy flight at each iteration. The effectiveness of the proposed method was tested with three types of remote-sensing imagery (Landsat Thematic Mapper (TM), Flightline C1 (FLC), and QuickBird) that are distinct in terms of spatial and spectral resolution and landscape. Our results showed that ULPSO is able to achieve significantly better and more stable classification results than k-means and the other two intelligent methods based on genetic algorithm (GA) and particle swarm optimization (PSO) over all of the experiments. ULPSO is, therefore, recommended as an effective alternative for unsupervised remote-sensing image classification.  相似文献   

9.
Cooperative coevolution (CC) was used to improve the performance of evolutionary algorithms (EAs) on complex optimization problems in a divide-and-conquer way. In this paper, we show that the CC framework can be very helpful to improve the performance of particle swarm optimization (PSO) on clustering high-dimensional datasets. Based on CC framework, the original partitional clustering problem is first decomposed to several subproblems, each of which is then evolved by an optimizer independently. We employ a very simple but efficient optimization algorithm, namely bare-bone particle swarm optimization (BPSO), as the optimizer to solve each subproblem cooperatively. In addition, we design a new centroid-based encoding schema for each particle and apply the Chernoff bounds to decide a proper population size. The experimental results on synthetic and real-life datasets illustrate the effectiveness and efficiency of the BPSO and CC framework. The comparisons show the proposed algorithm significantly outperforms five EA-based clustering algorithms, i.e., PSO, SRPSO, ACO, ABC and DE, and K-means on most of the datasets.  相似文献   

10.
贾洪杰  丁世飞  史忠植 《软件学报》2015,26(11):2836-2846
谱聚类将聚类问题转化成图划分问题,是一种基于代数图论的聚类方法.在求解图划分目标函数时,一般利用Rayleigh熵的性质,通过计算Laplacian矩阵的特征向量将原始数据点映射到一个低维的特征空间中,再进行聚类.然而在谱聚类过程中,存储相似矩阵的空间复杂度是O(n2),对Laplacian矩阵特征分解的时间复杂度一般为O(n3),这样的复杂度在处理大规模数据时是无法接受的.理论证明,Normalized Cut图聚类与加权核k-means都等价于矩阵迹的最大化问题.因此,可以用加权核k-means算法来优化Normalized Cut的目标函数,这就避免了对Laplacian矩阵特征分解.不过,加权核k-means算法需要计算核矩阵,其空间复杂度依然是O(n2).为了应对这一挑战,提出近似加权核k-means算法,仅使用核矩阵的一部分来求解大数据的谱聚类问题.理论分析和实验对比表明,近似加权核k-means的聚类表现与加权核k-means算法是相似的,但是极大地减小了时间和空间复杂性.  相似文献   

11.
This study proposed a novel PSO–SVM model that hybridized the particle swarm optimization (PSO) and support vector machines (SVM) to improve the classification accuracy with a small and appropriate feature subset. This optimization mechanism combined the discrete PSO with the continuous-valued PSO to simultaneously optimize the input feature subset selection and the SVM kernel parameter setting. The hybrid PSO–SVM data mining system was implemented via a distributed architecture using the web service technology to reduce the computational time. In a heterogeneous computing environment, the PSO optimization was performed on the application server and the SVM model was trained on the client (agent) computer. The experimental results showed the proposed approach can correctly select the discriminating input features and also achieve high classification accuracy.  相似文献   

12.
Clustering is an important and popular technique in data mining. It partitions a set of objects in such a manner that objects in the same clusters are more similar to each another than objects in the different cluster according to certain predefined criteria. K-means is simple yet an efficient method used in data clustering. However, K-means has a tendency to converge to local optima and depends on initial value of cluster centers. In the past, many heuristic algorithms have been introduced to overcome this local optima problem. Nevertheless, these algorithms too suffer several short-comings. In this paper, we present an efficient hybrid evolutionary data clustering algorithm referred to as K-MCI, whereby, we combine K-means with modified cohort intelligence. Our proposed algorithm is tested on several standard data sets from UCI Machine Learning Repository and its performance is compared with other well-known algorithms such as K-means, K-means++, cohort intelligence (CI), modified cohort intelligence (MCI), genetic algorithm (GA), simulated annealing (SA), tabu search (TS), ant colony optimization (ACO), honey bee mating optimization (HBMO) and particle swarm optimization (PSO). The simulation results are very promising in the terms of quality of solution and convergence speed of algorithm.  相似文献   

13.
This paper presents selective regeneration particle swarm optimization (SRPSO), a novel algorithm developed based on particle swarm optimization (PSO). It contains two new features, unbalanced parameter setting and particle regeneration operation. The unbalanced parameter setting enables fast convergence of the algorithm and the particle regeneration operation allows the search to escape from local optima and explore for better solutions. This algorithm is applied to data clustering problems for performance evaluation and a hybrid algorithm (KSRPSO) of K-means clustering method and SRPSO is developed. In the conducted numerical experiments, SRPSO and KSRPSO are compared to the original PSO algorithm, K-means, as well as, other methods proposed by other studies. The results demonstrate that SRPSO and KSRPSO are efficient, accurate, and robust methods for data clustering problems.  相似文献   

14.
膜计算(也称为P系统或膜系统)是一种新颖的分布式、并行计算模型.为了处理数据聚类问题,提出了一种采用混合进化机制的膜聚类算法.它使用了一个由3个细胞组成的组织P系统,为一个待聚类的数据集发现最优的簇中心.其对象表示候选的簇中心,并且这3个细胞分别使用了3种不同的进化机制:遗传算子、速度-位移模型和差分进化机制.然而,所使用的速度-位移模型和差分进化机制是结合了这个特殊膜结构和转运机制所提出的改进版本.这种混合进化机制能够增强系统中对象的多样性和改善收敛性能.在混合进化机制和转运机制控制下,这种膜聚类算法能够确定一个数据集的良好划分.所提出的膜聚类算法在3个人工数据集和5个真实数据集上被评估,并与k-means和几种进化聚类算法进行比较.统计显著性测试建立了所提出的膜聚类算法的优势.  相似文献   

15.
In this paper, we present a fast global k-means clustering algorithm by making use of the cluster membership and geometrical information of a data point. This algorithm is referred to as MFGKM. The algorithm uses a set of inequalities developed in this paper to determine a starting point for the jth cluster center of global k-means clustering. Adopting multiple cluster center selection (MCS) for MFGKM, we also develop another clustering algorithm called MFGKM+MCS. MCS determines more than one starting point for each step of cluster split; while the available fast and modified global k-means clustering algorithms select one starting point for each cluster split. Our proposed method MFGKM can obtain the least distortion; while MFGKM+MCS may give the least computing time. Compared to the modified global k-means clustering algorithm, our method MFGKM can reduce the computing time and number of distance calculations by a factor of 3.78-5.55 and 21.13-31.41, respectively, with the average distortion reduction of 5,487 for the Statlog data set. Compared to the fast global k-means clustering algorithm, our method MFGKM+MCS can reduce the computing time by a factor of 5.78-8.70 with the average reduction of distortion of 30,564 using the same data set. The performances of our proposed methods are more remarkable when a data set with higher dimension is divided into more clusters.  相似文献   

16.
This paper proposes a new method to weight subspaces in feature groups and individual features for clustering high-dimensional data. In this method, the features of high-dimensional data are divided into feature groups, based on their natural characteristics. Two types of weights are introduced to the clustering process to simultaneously identify the importance of feature groups and individual features in each cluster. A new optimization model is given to define the optimization process and a new clustering algorithm FG-k-means is proposed to optimize the optimization model. The new algorithm is an extension to k-means by adding two additional steps to automatically calculate the two types of subspace weights. A new data generation method is presented to generate high-dimensional data with clusters in subspaces of both feature groups and individual features. Experimental results on synthetic and real-life data have shown that the FG-k-means algorithm significantly outperformed four k-means type algorithms, i.e., k-means, W-k-means, LAC and EWKM in almost all experiments. The new algorithm is robust to noise and missing values which commonly exist in high-dimensional data.  相似文献   

17.
Harmony K-means algorithm for document clustering   总被引:2,自引:0,他引:2  
Fast and high quality document clustering is a crucial task in organizing information, search engine results, enhancing web crawling, and information retrieval or filtering. Recent studies have shown that the most commonly used partition-based clustering algorithm, the K-means algorithm, is more suitable for large datasets. However, the K-means algorithm can generate a local optimal solution. In this paper we propose a novel Harmony K-means Algorithm (HKA) that deals with document clustering based on Harmony Search (HS) optimization method. It is proved by means of finite Markov chain theory that the HKA converges to the global optimum. To demonstrate the effectiveness and speed of HKA, we have applied HKA algorithms on some standard datasets. We also compare the HKA with other meta-heuristic and model-based document clustering approaches. Experimental results reveal that the HKA algorithm converges to the best known optimum faster than other methods and the quality of clusters are comparable.  相似文献   

18.
In this paper, we propose a new parallel clustering algorithm, named Parallel Bisecting k-means with Prediction (PBKP), for message-passing multiprocessor systems. Bisecting k-means tends to produce clusters of similar sizes, and according to our experiments, it produces clusters with smaller entropy (i.e., purer clusters) than k-means does. Our PBKP algorithm fully exploits the data-parallelism of the bisecting k-means algorithm, and adopts a prediction step to balance the workloads of multiple processors to achieve a high speedup. We implemented PBKP on a cluster of Linux workstations and analyzed its performance. Our experimental results show that the speedup of PBKP is linear with the number of processors and the number of data points. Moreover, PBKP scales up better than the parallel k-means with respect to the dimension and the desired number of clusters. This research was supported in part by AFRL/Wright Brothers Institute (WBI).  相似文献   

19.
数值型和分类型混合数据的模糊K-Prototypes聚类算法   总被引:15,自引:0,他引:15  
陈宁  陈安  周龙骧 《软件学报》2001,12(8):1107-1119
由于数据库经常同时包含数值型和分类型的属性,因此研究能够处理混合型数据的聚类算法无疑是很重要的.讨论了混合型数据的聚类问题,提出了一种模糊K-prototypes算法.该算法融合了K-means和K-modes对数值型和分类型数据的处理方法,能够处理混合类型的数据.模糊技术体现聚类的边界特征,更适合处理含有噪声和缺失数据的数据库.实验结果显示,模糊算法比相应的确定算法得到的结果准确度高.  相似文献   

20.
In big data era, more and more data are collected from multiple views, each of which reflect distinct perspectives of the data. Many multi-view data are accompanied by incompatible views and high dimension, both of which bring challenges for multi-view clustering. This paper proposes a strategy of simultaneous weighting on view and feature to discriminate their importance. Each feature of multi-view data is given bi-level weights to express its importance in feature level and view level, respectively. Furthermore, we implements the proposed weighting method in the classical k-means algorithm to conduct multi-view clustering task. An efficient gradient-based optimization algorithm is embedded into k-means algorithm to compute the bi-level weights automatically. Also, the convergence of the proposed weight updating method is proved by theoretical analysis. In experimental evaluation, synthetic datasets with varied noise and missing-value are created to investigate the robustness of the proposed approach. Then, the proposed approach is also compared with five state-of-the-art algorithms on three real-world datasets. The experiments show that the proposed method compares very favourably against the other methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号