首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
The present paper considers the problem of partitioning a dataset into a known number of clusters using the sum of squared errors criterion (SSE). A new clustering method, called DE-KM, which combines differential evolution algorithm (DE) with the well known K-means procedure is described. In the method, the K-means algorithm is used to fine-tune each candidate solution obtained by mutation and crossover operators of DE. Additionally, a reordering procedure which allows the evolutionary algorithm to tackle the redundant representation problem is proposed. The performance of the DE-KM clustering method is compared to the performance of differential evolution, global K-means method, genetic K-means algorithm and two variants of the K-means algorithm. The experimental results show that if the number of clusters K is sufficiently large, DE-KM obtains solutions with lower SSE values than the other five algorithms.  相似文献   

2.
The k-means algorithm and its variations are known to be fast clustering algorithms. However, they are sensitive to the choice of starting points and are inefficient for solving clustering problems in large datasets. Recently, incremental approaches have been developed to resolve difficulties with the choice of starting points. The global k-means and the modified global k-means algorithms are based on such an approach. They iteratively add one cluster center at a time. Numerical experiments show that these algorithms considerably improve the k-means algorithm. However, they require storing the whole affinity matrix or computing this matrix at each iteration. This makes both algorithms time consuming and memory demanding for clustering even moderately large datasets. In this paper, a new version of the modified global k-means algorithm is proposed. We introduce an auxiliary cluster function to generate a set of starting points lying in different parts of the dataset. We exploit information gathered in previous iterations of the incremental algorithm to eliminate the need of computing or storing the whole affinity matrix and thereby to reduce computational effort and memory usage. Results of numerical experiments on six standard datasets demonstrate that the new algorithm is more efficient than the global and the modified global k-means algorithms.  相似文献   

3.
Major problems exist in both crisp and fuzzy clustering algorithms. The fuzzy c-means type of algorithms use weights determined by a power m of inverse distances that remains fixed over all iterations and over all clusters, even though smaller clusters should have a larger m. Our method uses a different “distance” for each cluster that changes over the early iterations to fit the clusters. Comparisons show improved results. We also address other perplexing problems in clustering: (i) find the optimal number K of clusters; (ii) assess the validity of a given clustering; (iii) prevent the selection of seed vectors as initial prototypes from affecting the clustering; (iv) prevent the order of merging from affecting the clustering; and (v) permit the clusters to form more natural shapes rather than forcing them into normed balls of the distance function. We employ a relatively large number K of uniformly randomly distributed seeds and then thin them to leave fewer uniformly distributed seeds. Next, the main loop iterates by assigning the feature vectors and computing new fuzzy prototypes. Our fuzzy merging then merges any clusters that are too close to each other. We use a modified Xie-Bene validity measure as the goodness of clustering measure for multiple values of K in a user-interaction approach where the user selects two parameters (for eliminating clusters and merging clusters after viewing the results thus far). The algorithm is compared with the fuzzy c-means on the iris data and on the Wisconsin breast cancer data.  相似文献   

4.
In this paper, we present a fast global k-means clustering algorithm by making use of the cluster membership and geometrical information of a data point. This algorithm is referred to as MFGKM. The algorithm uses a set of inequalities developed in this paper to determine a starting point for the jth cluster center of global k-means clustering. Adopting multiple cluster center selection (MCS) for MFGKM, we also develop another clustering algorithm called MFGKM+MCS. MCS determines more than one starting point for each step of cluster split; while the available fast and modified global k-means clustering algorithms select one starting point for each cluster split. Our proposed method MFGKM can obtain the least distortion; while MFGKM+MCS may give the least computing time. Compared to the modified global k-means clustering algorithm, our method MFGKM can reduce the computing time and number of distance calculations by a factor of 3.78-5.55 and 21.13-31.41, respectively, with the average distortion reduction of 5,487 for the Statlog data set. Compared to the fast global k-means clustering algorithm, our method MFGKM+MCS can reduce the computing time by a factor of 5.78-8.70 with the average reduction of distortion of 30,564 using the same data set. The performances of our proposed methods are more remarkable when a data set with higher dimension is divided into more clusters.  相似文献   

5.
Motivated by the poor performance (linear complexity) of the EM algorithm in clustering large data sets, and inspired by the successful accelerated versions of related algorithms like k-means, we derive an accelerated variant of the EM algorithm for Gaussian mixtures that: (1) offers speedups that are at least linear in the number of data points, (2) ensures convergence by strictly increasing a lower bound on the data log-likelihood in each learning step, and (3) allows ample freedom in the design of other accelerated variants. We also derive a similar accelerated algorithm for greedy mixture learning, where very satisfactory results are obtained. The core idea is to define a lower bound on the data log-likelihood based on a grouping of data points. The bound is maximized by computing in turn (i) optimal assignments of groups of data points to the mixture components, and (ii) optimal re-estimation of the model parameters based on average sufficient statistics computed over groups of data points. The proposed method naturally generalizes to mixtures of other members of the exponential family. Experimental results show the potential of the proposed method over other state-of-the-art acceleration techniques.
Nikos VlassisEmail:
  相似文献   

6.
By using a kernel function, data that are not easily separable in the original space can be clustered into homogeneous groups in the implicitly transformed high-dimensional feature space. Kernel k-means algorithms have recently been shown to perform better than conventional k-means algorithms in unsupervised classification. However, few reports have examined the benefits of using a kernel function and the relative merits of the various kernel clustering algorithms with regard to the data distribution. In this study, we reformulated four representative clustering algorithms based on a kernel function and evaluated their performances for various data sets. The results indicate that each kernel clustering algorithm gives markedly better performance than its conventional counterpart for almost all data sets. Of the kernel clustering algorithms studied in the present work, the kernel average linkage algorithm gives the most accurate clustering results.  相似文献   

7.
Almost all subspace clustering algorithms proposed so far are designed for numeric datasets. In this paper, we present a k-means type clustering algorithm that finds clusters in data subspaces in mixed numeric and categorical datasets. In this method, we compute attributes contribution to different clusters. We propose a new cost function for a k-means type algorithm. One of the advantages of this algorithm is its complexity which is linear with respect to the number of the data points. This algorithm is also useful in describing the cluster formation in terms of attributes contribution to different clusters. The algorithm is tested on various synthetic and real datasets to show its effectiveness. The clustering results are explained by using attributes weights in the clusters. The clustering results are also compared with published results.  相似文献   

8.
Clustering is a powerful machine learning technique that groups “similar” data points based on their characteristics. Many clustering algorithms work by approximating the minimization of an objective function, namely the sum of within-the-cluster distances between points. The straightforward approach involves examining all the possible assignments of points to each of the clusters. This approach guarantees the solution will be a global minimum; however, the number of possible assignments scales quickly with the number of data points and becomes computationally intractable even for very small datasets. In order to circumvent this issue, cost function minima are found using popular local search-based heuristic approaches such as k-means and hierarchical clustering. Due to their greedy nature, such techniques do not guarantee that a global minimum will be found and can lead to sub-optimal clustering assignments. Other classes of global search-based techniques, such as simulated annealing, tabu search, and genetic algorithms, may offer better quality results but can be too time-consuming to implement. In this work, we describe how quantum annealing can be used to carry out clustering. We map the clustering objective to a quadratic binary optimization problem and discuss two clustering algorithms which are then implemented on commercially available quantum annealing hardware, as well as on a purely classical solver “qbsolv.” The first algorithm assigns N data points to K clusters, and the second one can be used to perform binary clustering in a hierarchical manner. We present our results in the form of benchmarks against well-known k-means clustering and discuss the advantages and disadvantages of the proposed techniques.  相似文献   

9.
10.
To cluster web documents, all of which have the same name entities, we attempted to use existing clustering algorithms such as K-means and spectral clustering. Unexpectedly, it turned out that these algorithms are not effective to cluster web documents. According to our intensive investigation, we found that clustering such web pages is more complicated because (1) the number of clusters (known as ground truth) is larger than two or three clusters as in general clustering problems and (2) clusters in the data set have extremely skewed distributions of cluster sizes. To overcome the aforementioned problem, in this paper, we propose an effective clustering algorithm to boost up the accuracy of K-means and spectral clustering algorithms. In particular, to deal with skewed distributions of cluster sizes, our algorithm performs both bisection and merge steps based on normalized cuts of the similarity graph G to correctly cluster web documents. Our experimental results show that our algorithm improves the performance by approximately 56% compared to spectral bisection and 36% compared to K-means.  相似文献   

11.
Clustering is one of the widely used knowledge discovery techniques to reveal structures in a dataset that can be extremely useful to the analyst. In iterative clustering algorithms the procedure adopted for choosing initial cluster centers is extremely important as it has a direct impact on the formation of final clusters. Since clusters are separated groups in a feature space, it is desirable to select initial centers which are well separated. In this paper, we have proposed an algorithm to compute initial cluster centers for k-means algorithm. The algorithm is applied to several different datasets in different dimension for illustrative purposes. It is observed that the newly proposed algorithm has good performance to obtain the initial cluster centers for the k-means algorithm.  相似文献   

12.
针对标准中值滤波算法边缘保持能力较差的缺点,设计了K均值中值滤波算法和递归K均值中值滤波算法。使用K均值方法将中心像素点的邻域数据序列分成两类,将该像素点所属的这类数据的中值作为输出。为了加快算法的速度,提出了两个阈值TLTH,保证在不影响结果精确性的同时尽量减少处理时间。传统的K均值算法耗时较高,论文的另一个改进是对K均值的优化处理,大大缩短了算法的处理时间。实验证明改进算法具有较好的噪声抑制能力和边缘保持能力。  相似文献   

13.
This paper proposes a novel intuitionistic fuzzy c-least squares support vector regression (IFC-LSSVR) with a Sammon mapping clustering algorithm. Sammon mapping effectively reduces the complexity of raw data, while intuitionistic fuzzy sets (IFSs) can effectively tune the membership of data points, and LSSVR improves the conventional fuzzy c-regression model. The proposed clustering algorithm combines the advantages of IFSs, LSSVR and Sammon mapping for solving actual clustering problems. Moreover, IFC-LSSVR with Sammon mapping adopts particle swarm optimization to obtain optimal parameters. Experiments conducted on a web-based adaptive learning environment and a dataset of wheat varieties demonstrate that the proposed algorithm is more efficient than conventional algorithms, such as the k-means (KM) and fuzzy c-means (FCM) clustering algorithms, in standard measurement indexes. This study thus demonstrates that the proposed model is a credible fuzzy clustering algorithm. The novel method contributes not only to the theoretical aspects of fuzzy clustering, but is also widely applicable in data mining, image systems, rule-based expert systems and prediction problems.  相似文献   

14.
During the last decade artificial immune systems have drawn much of the researchers’ attention. All the work that has been done allowed to develop many interesting algorithms which come in useful when solving engineering problems such as data mining and analysis, anomaly detection and many others. Being constantly developed and improved, the algorithms based on immune metaphors have some limitations, though. In this paper we elaborate on the concept of a novel artificial immune algorithm by considering the possibility of combining the clonal selection principle and the well known K-means algorithm. This novel approach and a new way of performing suppression (based on the usefulness of the evolving lymphocytes) in clonal selection result in a very effective and stable immune algorithm for both unsupervised and supervised learning. Further improvements to the cluster analysis by means of the proposed algorithm, immune K-means, are introduced. Different methods for clusters construction are compared, together with multi-point cluster validity index and a novel strategy based on minimal spanning tree (mst) and a analysis of the midpoints of the edges of the (mst). Interesting and useful improvements of the proposed approach by means of negative selection algorithms are proposed and discussed.  相似文献   

15.
In this paper, we present a modified filtering algorithm (MFA) by making use of center variations to speed up clustering process. Our method first divides clusters into static and active groups. We use the information of cluster displacements to reject unlikely cluster centers for all nodes in the kd-tree. We reduce the computational complexity of filtering algorithm (FA) through finding candidates for each node mainly from the set of active cluster centers. Two conditions for determining the set of candidate cluster centers for each node from active clusters are developed. Our approach is different from the major available algorithm, which passes no information from one stage of iteration to the next. Theoretical analysis shows that our method can reduce the computational complexity, in terms of the number of distance calculations, of FA at each stage of iteration by a factor of FC/AC, where FC and AC are the numbers of total clusters and active clusters, respectively. Compared with the FA, our algorithm can effectively reduce the computing time and number of distance calculations. It is noted that our proposed algorithm can generate the same clusters as that produced by hard k-means clustering. The superiority of our method is more remarkable when a larger data set with higher dimension is used.  相似文献   

16.
半监督的改进K-均值聚类算法   总被引:4,自引:1,他引:3       下载免费PDF全文
K-均值聚类算法必须事先获取聚类数目,并且随机地选取聚类初始中心会造成聚类结果不稳定,容易在获得一个局部最优值时终止。提出了一种基于半监督学习理论的改进K-均值聚类算法,利用少量标签数据建立图的最小生成树并迭代分裂获取K-均值聚类算法所需要的聚类数和初始聚类中心。在IRIS数据集上的实验表明,尽管随机样本构造的生成树不同,聚类中心也不同,但聚类是一致且稳定的,迭代的次数较少,验证了该文算法的有效性。  相似文献   

17.
Speed-density relationships are used by mesoscopic traffic simulators to represent traffic dynamics. While classical speed-density relationships provide useful insights into the traffic dynamics problem, they may be restrictive for such applications. This paper addresses the problem of calibrating speed-density relationship parameters using data mining techniques, and proposes a novel hierarchical clustering algorithm based on K-means clustering. By combining K-means with agglomerative hierarchical clustering, the proposed new algorithm is able to reduce early-stage errors inherent in agglomerative hierarchical clustering resulted in improved clustering performance. Moreover, in order to improve the precision of parametric calibration, densities and flows are utilized as variables. The proposed approach is tested against sensor data captured from the 3rd Ring Road of Beijing. The testing results show that the performance of our algorithm is better than existing solutions.  相似文献   

18.
Clustering a large volume of data in a distributed environment is a challenging issue. Data stored across multiple machines are huge in size, and solution space is large. Genetic algorithm deals effectively with larger solution space and provides better solution. In this paper, we proposed a novel clustering algorithm for distributed datasets, using combination of genetic algorithm (GA) with Mahalanobis distance and k-means clustering algorithm. The proposed algorithm is two phased; in phase 1, GA is applied in parallel on data chunks located across different machines. Mahalanobis distance is used as fitness value in GA, which considers covariance between the data points and thus provides a better representation of initial data. K-means with K-means\( ++ \) initialization is applied in phase 2 on intermediate output to get final result. The proposed algorithm is implemented on Hadoop framework, which is inherently designed to deal with distributed datasets in a fault-tolerant manner. Extensive experiments were conducted for multiple real-life and synthetic datasets to measure performance of our proposed algorithm. Results were compared with MapReduce-based algorithms, mrk-means, parallel k-means and scaling GA.  相似文献   

19.
In practical cluster analysis tasks, an efficient clustering algorithm should be less sensitive to parameter configurations and tolerate the existence of outliers. Based on the neural gas (NG) network framework, we propose an efficient prototype-based clustering (PBC) algorithm called enhanced neural gas (ENG) network. Several problems associated with the traditional PBC algorithms and original NG algorithm such as sensitivity to initialization, sensitivity to input sequence ordering and the adverse influence from outliers can be effectively tackled in our new scheme. In addition, our new algorithm can establish the topology relationships among the prototypes and all topology-wise badly located prototypes can be relocated to represent more meaningful regions. Experimental results1on synthetic and UCI datasets show that our algorithm possesses superior performance in comparison to several PBC algorithms and their improved variants, such as hard c-means, fuzzy c-means, NG, fuzzy possibilistic c-means, credibilistic fuzzy c-means, hard/fuzzy robust clustering and alternative hard/fuzzy c-means, in static data clustering tasks with a fixed number of prototypes.  相似文献   

20.
This paper proposes a new method to weight subspaces in feature groups and individual features for clustering high-dimensional data. In this method, the features of high-dimensional data are divided into feature groups, based on their natural characteristics. Two types of weights are introduced to the clustering process to simultaneously identify the importance of feature groups and individual features in each cluster. A new optimization model is given to define the optimization process and a new clustering algorithm FG-k-means is proposed to optimize the optimization model. The new algorithm is an extension to k-means by adding two additional steps to automatically calculate the two types of subspace weights. A new data generation method is presented to generate high-dimensional data with clusters in subspaces of both feature groups and individual features. Experimental results on synthetic and real-life data have shown that the FG-k-means algorithm significantly outperformed four k-means type algorithms, i.e., k-means, W-k-means, LAC and EWKM in almost all experiments. The new algorithm is robust to noise and missing values which commonly exist in high-dimensional data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号