首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 509 毫秒
1.
K-means算法所使用的聚类准则函数是将数据集中各个簇的误差平方值直接相加而得到的,不能有效处理簇的密度不均且大小差异较大的数据集。为此,将K-means算法的聚类准则函数定义为加权的簇内标准差之和,权重为簇内数据对象数占总数目的比例。同时,调整了传统K-means算法将数据对象重新分配给簇的方法,采用一个数据对象到中心点的加权距离代替传统K-means算法中的距离,将数据对象分配给使加权距离最小的中心点所在的簇。实验结果表明,针对模拟数据集的聚类,改进K-means算法可以明显减少大而稀的簇中数据对象被错误地分配到相邻的小而密簇的可能性,改善了聚类的质量;针对UCI数据集的聚类,改进算法使得各个簇更为紧凑,从而验证了改进K-means算法的有效性。  相似文献   

2.
樊仲欣  王兴  苗春生 《计算机应用》2019,39(4):1027-1031
为解决利用层次方法的平衡迭代规约和聚类(BIRCH)算法聚类结果依赖于数据对象的添加顺序,且对非球状的簇聚类效果不好以及受簇直径阈值的限制每个簇只能包含数量相近的数据对象的问题,提出一种改进的BIRCH算法。该算法用描述数据对象个体间连通性的连通距离和连通强度阈值替代簇直径阈值,还将簇合并的步骤加入到聚类特征树的生成过程中。在自定义及iris、wine、pendigits数据集上的实验结果表明,该算法比多阈值BIRCH、密度改进BIRCH等现有改进算法的聚类准确率更高,尤其在大数据集上比密度改进BIRCH准确率提高6个百分点,耗时降低61%。说明该算法能够适用于在线实时增量数据,可以识别非球形簇和体积不均匀簇,具有去噪功能,且时间和空间复杂度明显降低。  相似文献   

3.
In this paper, we extend the conventional vector quantization by incorporating a vigilance parameter, which steers the tradeoff between plasticity and stability during incremental online learning. This is motivated in the adaptive resonance theory (ART) network approach and is exploited in our paper for forming a one-pass incremental and evolving variant of vector quantization. This variant can be applied for online clustering, classification and approximation tasks with an unknown number of clusters. Additionally, two novel extensions are described: one concerns the incorporation of the sphere of influence of clusters in the vector quantization learning process by selecting the ‘winning cluster’ based on the distances of a data point to the surface of all clusters. Another one introduces a deletion of cluster satellites and an online split-and-merge strategy: clusters are dynamically split and merged after each incremental learning step. Both strategies prevent the algorithm to generate a wrong cluster partition due to a bad a priori setting of the most essential parameter(s). The extensions will be applied to clustering of two- and high-dimensional data, within an image classification framework and for model-based fault detection based on data-driven evolving fuzzy models.  相似文献   

4.
一种基于密度的空间数据流在线聚类算法   总被引:2,自引:0,他引:2  
于彦伟  王沁  邝俊  何杰 《自动化学报》2012,38(6):1051-1059
为了解决空间数据流中任意形状簇的聚类问题,提出了一种基于密度的空间数据流在线聚类算法(On-line density-based clustering algorithm for spatial datastream,OLDStream),该算法在先前聚类结果上聚类增量空间数据,仅对新增空间点及其满足核心点条件的邻域数据做局部聚类更新,降低聚类更新的时间复杂度,实现对空间数据流的在线聚类.OLDStream算法具有快速处理大规模空间数据流、实时获取全局任意形状的聚类簇结果、对数据流的输入顺序不敏感、并能发现孤立点数据等优势.在真实数据和合成数据上的综合实验验证了算法的聚类效果、高效率性和较高的可伸缩性,同时实验结果的统计分析显示仅有4%的空间点消耗最坏运行时间,对每个空间点的平均聚类时间约为0.033 ms.  相似文献   

5.
In this paper we propose a system for localization of cephalometric landmarks. The process of localization is carried out in two steps: deriving a smaller expectation window for each landmark using a trained neuro-fuzzy system (NFS) then applying a template-matching algorithm to pin point the exact location of the landmark. Four points are located on each image using edge detection. The four points are used to extract more features such as distances, shifts and rotation angles of the skull. Limited numbers of representative groups that will be used for training are selected based on k-means clustering. The most effective features are selected based on a Fisher discriminant for each feature set. Using fuzzy linguistics if-then rules, membership degree is assigned to each of the selected features and fed to the FNS. The FNS is trained, utilizing gradient descent, to learn the relation between the sizes, rotations and translations of landmarks and their locations. The data for training is obtained manually from one image from each cluster. Images whose features are located closer to the center of their cluster are used for extracting data for the training set. The expected locations on target images can then be predicted using the trained FNS. For each landmark a parametric template space is constructed from a set of templates extracted from several images based on the clarity of the landmark in that image. The template is matched to the search windows to find the exact location of the landmark. Decomposition of landmark shapes is used to desensitize the algorithm to size differences. The system is trained to locate 20 landmarks on a database of 565 images. Preliminary results show a recognition rate of more than 90%.  相似文献   

6.
李曙光  周彤 《计算机科学》2011,38(11):241-244
有界聚类问题源于II3M研究院开发的一个分布式流处理系统,即S系统。问题的输入是一个点赋权和边赋权的无向图,并指定若干个称为终端的顶点。称顶点集合的一个子集为一个子类。子类中所有顶点的权和加上该子类边界上所有边的权和称为该子类的费用。有界聚类问题是要得到所有顶点的一个聚类,要求每个子类的费用不超过给定预算召,每个子类至多包含一个终端,并使得所有子类的总费用最小。对于限制树宽图上的有界聚类问题,给出了拟多项式时间精确算法。利用取整的技巧对该算法进行修正,可在多项式时间之内得到(1+ε)-近似解,其中每个子类的费用不超过(1+ε)B,:是任意小的正数。如果进一步要求每个子类恰好包含一个终端,则所给算法可在多项式时间之内得到(1+ε)-近似解,其中每个子类的费用不超过(2+ε)B。  相似文献   

7.
由于空间数据库通常蕴含海量数据,因此一个普通的空间查询很可能会导致多查询结果问题。为了解决上述问题,提出了一种空间查询结果自动分类方法。在离线阶段,根据空间对象之间的位置相近度和语义相关度来评估空间对象之间的耦合关系,在此基础上利用概率密度评估方法对空间对象进行聚类,每个聚类代表一种类型的用户需求;在在线查询处理阶段,对于一个给定的空间查询,在查询结果集上利用改进的C4.5决策树算法动态生成一棵查询结果分类树,用户可通过检查分类树分支的标签来逐步定位到其感兴趣的空间对象。实验结果表明,提出的空间对象聚类方法能够有效地体现空间对象在语义和位置上的相近性,查询结果分类方法具有较好的分类效果和较低的搜索代价。  相似文献   

8.
Competitive learning mechanisms for clustering, in general, suffer from poor performance for very high-dimensional (>1000) data because of "curse of dimensionality" effects. In applications such as document clustering, it is customary to normalize the high-dimensional input vectors to unit length, and it is sometimes also desirable to obtain balanced clusters, i.e., clusters of comparable sizes. The spherical kmeans (spkmeans) algorithm, which normalizes the cluster centers as well as the inputs, has been successfully used to cluster normalized text documents in 2000+ dimensional space. Unfortunately, like regular kmeans and its soft expectation-maximization-based version, spkmeans tends to generate extremely imbalanced clusters in high-dimensional spaces when the desired number of clusters is large (tens or more). This paper first shows that the spkmeans algorithm can be derived from a certain maximum likelihood formulation using a mixture of von Mises-Fisher distributions as the generative model, and in fact, it can be considered as a batch-mode version of (normalized) competitive learning. The proposed generative model is then adapted in a principled way to yield three frequency-sensitive competitive learning variants that are applicable to static data and produced high-quality and well-balanced clusters for high-dimensional data. Like kmeans, each iteration is linear in the number of data points and in the number of clusters for all the three algorithms. A frequency-sensitive algorithm to cluster streaming data is also proposed. Experimental results on clustering of high-dimensional text data sets are provided to show the effectiveness and applicability of the proposed techniques.  相似文献   

9.
We consider a strongly NP-hard problem of partitioning a finite sequence of vectors in a Euclidean space into two clusters of given size with the criterion of minimizing the total sum of square distances from cluster elements to their centers. The center of the first cluster is subject to optimization, defined by the mean value of all vectors in this cluster. The center of the second cluster is fixed at the origin. The partition is subject to the following condition: the difference between indices of two subsequent vectors included in the first cluster is bounded from above and below by given constants. We propose an exact pseudopolynomial algorithm for the case of a problem where the dimension of the space is fixed, and components of input vectors are integer-valued.  相似文献   

10.
聚类作为一种无监督的学习方法,通常需要人为地提供聚类的簇数。在先验知识缺乏的情况下,通过人为指定聚类参数是不合实际的。近年来研究的聚类有效性函数(Cluster Validity Index) 用于估计簇的数目及聚类效果的优劣。本文提出了一种新的基于有效性指数的聚类算法,无需提供聚类的参数。算法每步合并两个簇,使有效性指数值增加最大或减小最少。本文运用引力模型度量相似度,对可能出现的异常点情况作均匀化的处理。实验表明,本文的算法能正确发现特定数据的簇个数,和其它聚类方法比较,聚类结果具有较低的错误率,并在效率上优于一般的基于有效性指数的聚类算法。  相似文献   

11.
具有聚类功能的边界检测技术的研究   总被引:1,自引:1,他引:0       下载免费PDF全文
为快速有效地检测聚类的边界点,提出了一种新的基于三角剖分的聚类边界检测算法DTBOUND。该算法通过计算三角剖分图中每个数据点的变异系数将数据集分解成内部点和外部点两部分,然后从每一个未分类的内部点开始进行深度优先遍历,将相连的内部点以及和内部点相连的外部点作为一个聚类;最后从得到的聚类中提取边界点。该算法只有一个参数(变异系数阈值β),实验结果表明该算法可以快速、有效地识别任意形状、不同大小和不同密度的聚类和聚类的边界点。  相似文献   

12.
The Clustered Vehicle Routing Problem (CluVRP) is a variant of the Capacitated Vehicle Routing Problem in which customers are grouped into clusters. Each cluster has to be visited once, and a vehicle entering a cluster cannot leave it until all customers have been visited. This paper presents two alternative hybrid metaheuristic algorithms for the CluVRP. The first algorithm is based on an Iterated Local Search algorithm, in which only feasible solutions are explored and problem-specific local search moves are utilized. The second algorithm is a hybrid genetic search, for which the shortest Hamiltonian path between each pair of vertices within each cluster should be precomputed. Using this information, a sequence of clusters can be used as a solution representation and large neighborhoods can be efficiently explored, by means of bi-directional dynamic programming, sequence concatenation, and appropriate data structures. Extensive computational experiments are performed on benchmark instances from the literature, as well as new large scale instances. Recommendations on the choice of algorithm are provided, based on average cluster size.  相似文献   

13.
为了从移动终端位置数据中精准识别居民职住地,提出了一种基于时空约束密度聚类的职住地识别方法。首先,利用基于K-means的DBSCAN(density-based spatial clustering of applications with noise)时空驻点聚类过程将居民多天的原始轨迹点分成不同的时空驻点簇。然后,利用基于速度阈值的停留点簇和移动点簇识别过程将居民的每一个时空驻点簇区分为停留点簇或移动点簇。接着,利用基于K近距离的DBSCAN重要停留点聚类过程将居民的停留点分成不同的重要停留点簇。最后,利用基于KD-tree优化的KNN(K-nearest neighbor)职住地识别过程将居民的每个重要停留点识别为工作地、居住地、职住同一区域或兴趣地点区域。实验结果表明,该方法的每个过程都是合理有效的,并且最终的职住地识别效果要优于时间阈值法、累加时间法和信息熵法。  相似文献   

14.
Visibility determination is one of the oldest problems in computer graphics. The visibility, in terms of back-to-front polygon visibility ordering, is determined by updating a priority list as the viewpoint moves. A new list-priority algorithm, utilizing a property of Voronoi diagrams, is proposed in this paper. The operation is in two phases. First, in a pre-processing phase the scene is divided into Voronoi cells. A sub-list associated with each cell contains references to those polygons that intersect with it. The polygons are assigned a fixed set of view-independent priority orders within the cluster. Last, an interactive phase sorts the clusters according to the depth value of each Voronoi site. The most time-consuming work is performed during the pre-processing phase that only has to be executed once for the scene. Since all the polygons in a cell are pre-computed to obtain the fixed priority order within the cluster, a relatively simple task is left in the interactive phase, which is only to sort the clusters repeatedly when the viewpoint is changed. This method contains performance benefits that make it better shaped than previous BSP based methods.  相似文献   

15.
Continually advancing technology has made it feasible to capture data online for onward transmission as a steady flow of newly generated data points, termed as data stream. Continuity and unboundedness of data streams make storage of data and multiple scans of data an impractical proposition for the purpose of knowledge discovery. Need to learn structures from data in streaming environment has been a driving force for making clustering a popular technique for knowledge discovery from data streams. Continuous nature of streaming data makes it infeasible to look for point membership among the clusters discovered so far, necessitating employment of a synopsis structure to consolidate incoming data points. This synopsis is exploited for building clustering scheme to meet subsequent user demands. The proposed Exclusive and Complete Clustering (ExCC) algorithm captures non-overlapping clusters in data streams with mixed attributes, such that each point either belongs to some cluster or is an outlier/noise. The algorithm is robust, adaptive to changes in data distribution and detects succinct outliers on-the-fly. It deploys a fixed granularity grid structure as synopsis and performs clustering by coalescing dense regions in grid. Speed-based pruning is applied to synopsis prior to clustering to ensure currency of discovered clusters. Extensive experimentation demonstrates that the algorithm is robust, identifies succinct outliers on-the-fly and is adaptive to change in the data distribution. ExCC algorithm is further evaluated for performance and compared with other contemporary algorithms.  相似文献   

16.
Clustering Large Graphs via the Singular Value Decomposition   总被引:1,自引:0,他引:1  
We consider the problem of partitioning a set of m points in the n-dimensional Euclidean space into k clusters (usually m and n are variable, while k is fixed), so as to minimize the sum of squared distances between each point and its cluster center. This formulation is usually the objective of the k-means clustering algorithm (Kanungo et al. (2000)). We prove that this problem in NP-hard even for k = 2, and we consider a continuous relaxation of this discrete problem: find the k-dimensional subspace V that minimizes the sum of squared distances to V of the m points. This relaxation can be solved by computing the Singular Value Decomposition (SVD) of the m × n matrix A that represents the m points; this solution can be used to get a 2-approximation algorithm for the original problem. We then argue that in fact the relaxation provides a generalized clustering which is useful in its own right. Finally, we show that the SVD of a random submatrix—chosen according to a suitable probability distribution—of a given matrix provides an approximation to the SVD of the whole matrix, thus yielding a very fast randomized algorithm. We expect this algorithm to be the main contribution of this paper, since it can be applied to problems of very large size which typically arise in modern applications.  相似文献   

17.
We consider the strongly NP-hard problem of partitioning a finite set of Euclidean points into two clusters so as to minimize the sum (over both clusters) of the weighted sums of the squared intra-cluster distances from the elements of the cluster to its center. The weights of the sums are equal to the cardinalities of the clusters. The center of one of the clusters is given as input, while the center of the other cluster is unknown and is determined as the mean value over all points in this cluster, i.e., as the geometric center (centroid). The version of the problem with constrained cardinalities of the clusters is analyzed. We construct an approximation algorithm for the problem and show that it is a fully polynomial-time approximation scheme (FPTAS) if the space dimension is bounded by a constant.  相似文献   

18.
Fuzzy clustering is a widely applied method for extracting the underlying models within data. It has been applied successfully in many real-world applications. Fuzzy c-means is one of the most popular fuzzy clustering methods because it produces reasonable results and its implementation is straightforward. One problem with all fuzzy clustering algorithms such as fuzzy c-means is that some data points which are assigned to some clusters have low membership values. It is possible that many samples may be assigned to a cluster with low-confidence. In this paper, an efficient and noise-aware implementation of support vector machines, namely relaxed constraints support vector machines, is used to solve the mentioned problem and improve the performance of fuzzy c-means algorithm. First, fuzzy c-means partitions data into appropriate clusters. Then, the samples with high membership values in each cluster are selected for training a multi-class relaxed constraints support vector machine classifier. Finally, the class labels of the remaining data points are predicted by the latter classifier. The performance of the proposed clustering method is evaluated by quantitative measures such as cluster entropy and Minkowski scores. Experimental results on real-life data sets show the superiority of the proposed method.  相似文献   

19.
A Possibilistic Fuzzy c-Means Clustering Algorithm   总被引:20,自引:0,他引:20  
In 1997, we proposed the fuzzy-possibilistic c-means (FPCM) model and algorithm that generated both membership and typicality values when clustering unlabeled data. FPCM constrains the typicality values so that the sum over all data points of typicalities to a cluster is one. The row sum constraint produces unrealistic typicality values for large data sets. In this paper, we propose a new model called possibilistic-fuzzy c-means (PFCM) model. PFCM produces memberships and possibilities simultaneously, along with the usual point prototypes or cluster centers for each cluster. PFCM is a hybridization of possibilistic c-means (PCM) and fuzzy c-means (FCM) that often avoids various problems of PCM, FCM and FPCM. PFCM solves the noise sensitivity defect of FCM, overcomes the coincident clusters problem of PCM and eliminates the row sum constraints of FPCM. We derive the first-order necessary conditions for extrema of the PFCM objective function, and use them as the basis for a standard alternating optimization approach to finding local minima of the PFCM objective functional. Several numerical examples are given that compare FCM and PCM to PFCM. Our examples show that PFCM compares favorably to both of the previous models. Since PFCM prototypes are less sensitive to outliers and can avoid coincident clusters, PFCM is a strong candidate for fuzzy rule-based system identification.  相似文献   

20.
In this study a fuzzy c-means clustering algorithm based method is proposed for solving a capacitated multi-facility location problem of known demand points which are served from capacitated supply centres. It involves the integrated use of fuzzy c-means and convex programming. In fuzzy c-means, data points are allowed to belong to several clusters with different degrees of membership. This feature is used here to split demands between supply centers. The cluster number is determined by an incremental method that starts with two and designated when capacity of each cluster is sufficient for its demand. Finally, each group of cluster and each model are solved as a single facility location problem. Then each single facility location problem given by fuzzy c-means is solved by convex programming which optimizes transportation cost is used to fine-tune the facility location. Proposed method is applied to several facility location problems from OR library (Osman & Christofides, 1994) and compared with centre of gravity and particle swarm optimization based algorithms. Numerical results of an asphalt producer’s real-world data in Turkey are reported. Numerical results show that the proposed approach performs better than using original fuzzy c-means, integrated use of fuzzy c-means and center of gravity methods in terms of transportation costs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号