首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Assume that a dissimilarity measure between elements and subsets of the set being clustered is given. We define the transformation of the set of subsets under which each subset is transformed into the set of all elements whose dissimilarity to it is not greater than a given threshold. Then a cluster is defined as a fixed point of this transformation. Three well-known clustering strategies are considered from this point of view: hierarchical clustering, graph-theoretic methods, and conceptual clustering. For hierarchical clustering generalizations are obtained that allow for overlapping clusters and/or clusters not forming a cover. Three properties of dissimilarity are introduced which guarantee the existence of fixed points for each threshold. We develop the relation to the theory of quasi-concave set functions, to help give an additional interpretation of clusters.  相似文献   

2.
Classification and spatial methods can be used in conjunction to represent the individual information of similar preferences by means of groups. In the context of latent class models and using Simulated Annealing, the cluster-unfolding model for two-way two-mode preference rating data has been shown to be superior to a two-step approach of first deriving the clusters and then unfolding the classes. However, the high computational cost makes the procedure only suitable for small or medium-sized data sets, and the hypothesis of independent and normally distributed preference data may also be too restrictive in many practical situations. Therefore, an alternating least squares procedure is proposed, in which the individuals and the objects are partitioned into clusters, while at the same time the cluster centers are represented by unfolding. An enhanced Simulated Annealing algorithm in the least squares framework is also proposed in order to address the local optimum problem. Real and artificial data sets are analyzed to illustrate the performance of the model.  相似文献   

3.
Clustering the rows and columns of a contingency table   总被引:3,自引:2,他引:1  
A number of ways of investigating heterogeneity in a two-way contingency table are reviewed. In particular, we consider chi-square decompositions of the Pearson chi-square statistic with respect to the nodes of a hierarchical clustering of the rows and/or the columns of the table. A cut-off point which indicates significant clustering may be defined on the binary trees associated with the respective row and column cluster analyses. This approach provides a simple graphical procedure which is useful in interpreting a significant chi-square statistic of a contingency table.The author gratefully acknowledges the constructive comments of the referees and the editor.  相似文献   

4.
k consisting of k clusters, with k > 2. Bottom-up agglomerative approaches are also commonly used to construct partitions, and we discuss these in terms of worst-case performance for metric data sets. Our main contribution derives from a new restricted partition formulation that requires each cluster to be an interval of a given ordering of the objects being clustered. Dynamic programming can optimally split such an ordering into a partition Pk for a large class of objectives that includes min-diameter. We explore a variety of ordering heuristics and show that our algorithm, when combined with an appropriate ordering heuristic, outperforms traditional algorithms on both random and non-random data sets.  相似文献   

5.
A Binary Integer Program to Maximize the Agreement Between Partitions   总被引:1,自引:1,他引:0  
This research note focuses on a problem where the cluster sizes for two partitions of the same object set are assumed known; however, the actual assignments of objects to clusters are unknown for one or both partitions. The objective is to find a contingency table that produces maximum possible agreement between the two partitions, subject to constraints that the row and column marginal frequencies for the table correspond exactly to the cluster sizes for the partitions. This problem was described by H. Messatfa (Journal of Classification, 1992, pp. 5–15), who provided a heuristic procedure based on the linear transportation problem. We present an exact solution procedure using binary integer programming. We demonstrate that our proposed method efficiently obtains optimal solutions for problems of practical size. We would like to thank the Editor, Willem Heiser, and an anonymous reviewer for helpful comments that resulted in improvements of this article.  相似文献   

6.
We propose a non-negative real-valued model of hierarchical classes (HICLAS) for two-way two-mode data. Like the other members of the HICLAS family, the non-negative real-valued model (NNRV-HICLAS) implies simultaneous hierarchically organized classifications of all modes involved in the data. A distinctive feature of the novel model is that it yields continuous, non-negative real-valued reconstructed data, which considerably expands the application range of the HICLAS family. The expansion implies a major algorithmic challenge as it involves a move from the typical discrete optimization problems in HICLAS to a mixed discrete-continuous one. To solve this mixed discrete-continuous optimization problem, a two-stage algorithm combining a simulated annealing and an alternating local descent stage is proposed. Subsequently it is evaluated in a simulation study. Finally, the NNRVHICLAS model is applied to an empirical data set on anger.  相似文献   

7.
A permutation-based algorithm for block clustering   总被引:2,自引:1,他引:1  
Hartigan (1972) discusses the direct clustering of a matrix of data into homogeneous blocks. He introduces a stepwise divisive method for block clustering within a certain class of block structures which induce clustering trees for both row and column margins. While this class of structures is appealing, the stopping criterion for his method, which is based on asymptotic theory and the assumption that the individual elements of the data matrix are normally distributed, is quite restrictive. In this paper we propose a permutation-based algorithm for block clustering within the same class of block structures. By using permutation arguments to decide where to split and when to stop, our algorithm becomes applicable in a wide variety of cases, including matrices of categorical data and matrices of small-to-moderate size. In addition, our algorithm offers considerable flexibility in how block homogeneity is defined. The algorithm is studied in a series of simulation experiments on matrices of known structure, and illustrated in examples drawn from the fields of taxonomy, political science, and data architecture.  相似文献   

8.
This paper proposes a measure of spatial homogeneity for sets of d-dimensional points based on nearest neighbor distances. Tests for spatial uniformity are examined which assess the tendency of the entire data set to aggregate and evaluate the character of individual clusters. The sizes and powers of three statistical tests of uniformity against aggregation, regularity, and unimodality are studied to determine robustness. The paper also studies the effects of normalization and incorrect prior information. A percentile frame sampling procedure is proposed that does not require a sampling window but is superior to a toroidal frame and to buffer zone sampling in particular situations. Examples test two data sets for homogeneity and search the results of a hierarchical clustering for homogeneous clusters.This work was partially supported by NSF Grant ECS-8300204.  相似文献   

9.
Single linkage clusters on a set of points are the maximal connected sets in a graph constructed by connecting all points closer than a given threshold distance. The complete set of single linkage clusters is obtained from all the graphs constructed using different threshold distances. The set of clusters forms a hierarchical tree, in which each non-singleton cluster divides into two or more subclusters; the runt size for each single linkage cluster is the number of points in its smallest subcluster. The maximum runt size over all single linkage clusters is our proposed test statistic for assessing multimodality. We give significance levels of the test for two null hypotheses, and consider its power against some bimodal alternatives. Research partially supported by NSF Grant No. DMS-8617919.  相似文献   

10.
The problem of measuring the impact of individual data points in a cluster analysis is examined. The purpose is to identify those data points that have an influence on the resulting cluster partitions. Influence of a single data point is considered present when different cluster partitions result from the removal of the element from the data set. The Hubert and Arabie (1985) corrected Rand index was used to provide numerical measures of influence of a data point. Simulated data sets consisting of a variety of cluster structures and error conditions were generated to validate the influence measures. The results showed that the measure of internal influence was 100% accurate in identifying those data elements exhibiting an influential effect. The nature of the influence, whether beneficial or detrimental to the clustering, can be evaluated with the use of the gamma and point-biserial statistics.  相似文献   

11.
A validation study of a variable weighting algorithm for cluster analysis   总被引:1,自引:0,他引:1  
De Soete (1986, 1988) proposed a variable weighting procedure when Euclidean distance is used as the dissimilarity measure with an ultrametric hierarchical clustering method. The algorithm produces weighted distances which approximate ultrametric distances as closely as possible in a least squares sense. The present simulation study examined the effectiveness of the De Soete procedure for an applications problem for which it was not originally intended. That is, to determine whether or not the algorithm can be used to reduce the influence of variables which are irrelevant to the clustering present in the data. The simulation study examined the ability of the procedure to recover a variety of known underlying cluster structures. The results indicate that the algorithm is effective in identifying extraneous variables which do not contribute information about the true cluster structure. Weights near 0.0 were typically assigned to such extraneous variables. Furthermore, the variable weighting procedure was not adversely effected by the presence of other forms of error in the data. In general, it is recommended that the variable weighting procedure be used for applied analyses when Euclidean distance is employed with ultrametric hierarchical clustering methods.  相似文献   

12.
To reveal the structure underlying two-way two-mode object by variable data, Mirkin (1987) has proposed an additive overlapping clustering model. This model implies an overlapping clustering of the objects and a reconstruction of the data, with the reconstructed variable profile of an object being a summation of the variable profiles of the clusters it belongs to. Grasping the additive (overlapping) clustering structure of object by variable data may, however, be seriously hampered in case the data include a very large number of variables. To deal with this problem, we propose a new model that simultaneously clusters the objects in overlapping clusters and reduces the variable space; as such, the model implies that the cluster profiles and, hence, the reconstructed data profiles are constrained to lie in a lowdimensional space. An alternating least squares (ALS) algorithm to fit the new model to a given data set will be presented, along with a simulation study and an illustrative example that makes use of empirical data.  相似文献   

13.
In agglomerative hierarchical clustering, pair-group methods suffer from a problem of non-uniqueness when two or more distances between different clusters coincide during the amalgamation process. The traditional approach for solving this drawback has been to take any arbitrary criterion in order to break ties between distances, which results in different hierarchical classifications depending on the criterion followed. In this article we propose a variable-group algorithm that consists in grouping more than two clusters at the same time when ties occur. We give a tree representation for the results of the algorithm, which we call a multidendrogram, as well as a generalization of the Lance andWilliams’ formula which enables the implementation of the algorithm in a recursive way. The authors thank A. Arenas for discussion and helpful comments. This work was partially supported by DGES of the Spanish Government Project No. FIS2006–13321–C02–02 and by a grant of Universitat Rovira i Virgili.  相似文献   

14.
Optimization Strategies for Two-Mode Partitioning   总被引:2,自引:2,他引:0  
Two-mode partitioning is a relatively new form of clustering that clusters both rows and columns of a data matrix. In this paper, we consider deterministic two-mode partitioning methods in which a criterion similar to k-means is optimized. A variety of optimization methods have been proposed for this type of problem. However, it is still unclear which method should be used, as various methods may lead to non-global optima. This paper reviews and compares several optimization methods for two-mode partitioning. Several known methods are discussed, and a new fuzzy steps method is introduced. The fuzzy steps method is based on the fuzzy c-means algorithm of Bezdek (1981) and the fuzzy steps approach of Heiser and Groenen (1997) and Groenen and Jajuga (2001). The performances of all methods are compared in a large simulation study. In our simulations, a two-mode k-means optimization method most often gives the best results. Finally, an empirical data set is used to give a practical example of two-mode partitioning. We would like to thank two anonymous referees whose comments have improved the quality of this paper. We are also grateful to Peter Verhoef for providing the data set used in this paper.  相似文献   

15.
A preliminary study of optimal variable weighting in k-means clustering   总被引:2,自引:0,他引:2  
Recently, algorithms for optimally weighting variables in non-hierarchical and hierarchical clustering methods have been proposed. Preliminary Monte Carlo research has shown that at least one of these algorithms cross-validates extremely well.The present study applies a k-means, optimal weighting procedure to two empirical data sets and contrasts its cross-validation performance with that of unit (i.e., equal) weighting of the variables. We find that the optimal weighting procedure cross-validates better in one of the two data sets. In the second data set its comparative performance strongly depends on the approach used to find seed values for the initial k-means partitioning.The authors would like to acknowledge the support of the Citibank Fellowship from the Sol C. Snider Entrepreneurial Center at the Wharton School. The authors would like to express their appreciation to J. Douglas Carroll and Abba M. Kreiger for comments on an earlier version of the paper.  相似文献   

16.
17.
In this paper, we consider several generalizations of the popular Ward’s method for agglomerative hierarchical clustering. Our work was motivated by clustering software, such as the R function hclust, which accepts a distance matrix as input and applies Ward’s definition of inter-cluster distance to produce a clustering. The standard version of Ward’s method uses squared Euclidean distance to form the distance matrix. We explore the effect on the clustering of using other definitions of distance, such as the Minkowski distance.  相似文献   

18.
The Baire metric induces an ultrametric on a dataset and is of linear computational complexity, contrasted with the standard quadratic time agglomerative hierarchical clustering algorithm. In this work we evaluate empirically this new approach to hierarchical clustering. We compare hierarchical clustering based on the Baire metric with (i) agglomerative hierarchical clustering, in terms of algorithm properties; (ii) generalized ultrametrics, in terms of definition; and (iii) fast clustering through k-means partitioning, in terms of quality of results. For the latter, we carry out an in depth astronomical study. We apply the Baire distance to spectrometric and photometric redshifts from the Sloan Digital Sky Survey using, in this work, about half a million astronomical objects. We want to know how well the (more costly to determine) spectrometric redshifts can predict the (more easily obtained) photometric redshifts, i.e. we seek to regress the spectrometric on the photometric redshifts, and we use clusterwise regression for this.  相似文献   

19.
The primary method for validating cluster analysis techniques is throughMonte Carlo simulations that rely on generating data with known cluster structure (e.g., Milligan 1996). This paper defines two kinds of data generation mechanisms with cluster overlap, marginal and joint; current cluster generation methods are framed within these definitions. An algorithm generating overlapping clusters based on shared densities from several different multivariate distributions is proposed and shown to lead to an easily understandable notion of cluster overlap. Besides outlining the advantages of generating clusters within this framework, a discussion is given of how the proposed data generation technique can be used to augment research into current classification techniques such as finite mixture modeling, classification algorithm robustness, and latent profile analysis.  相似文献   

20.
The distribution of lengths of phylogenetic trees under the taxonomic principle of parsimony is compared with the distribution obtained by randomizing the characters of the sequence data. This comparison allows us to define a measure of the extent to which sequence data contain significant hierarchical information. We show how to calculate this measure exactly for up to 10 taxa, and provide a good approximation for larger sets of taxa. The measure is applied to test sequences on 10 and 15 taxa.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号