首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Short text clustering by finding core terms   总被引:1,自引:1,他引:0  
A new clustering strategy, TermCut, is presented to cluster short text snippets by finding core terms in the corpus. We model the collection of short text snippets as a graph in which each vertex represents a piece of short text snippet and each weighted edge between two vertices measures the relationship between the two vertices. TermCut is then applied to recursively select a core term and bisect the graph such that the short text snippets in one part of the graph contain the term, whereas those snippets in the other part do not. We apply the proposed method on different types of short text snippets, including questions and search results. Experimental results show that the proposed method outperforms state-of-the-art clustering algorithms for clustering short text snippets.  相似文献   

2.
针对许多经典的图聚类算法存在输入参数难以确定、时间复杂度过高、聚类精度较低等缺点,本文提出了一种无需输入参数的基于核心顶点的图聚类算法(NGCC)。该算法将相似的顶点分配到同一个簇后,再利用PageRank算法发现核心顶点以形成初始簇。然后,将剩余的未标记顶点进行分配,形成最终簇结构。实验结果证明,NGCC算法在无需任何参数的条件下,在不同规模的数据集上的聚类质量与对比的经典图聚类算法相当或更优,而且适用范围更广。  相似文献   

3.
孙琛琛  申德荣  寇月  聂铁铮  于戈 《软件学报》2016,27(9):2303-2319
实体识别是数据质量的一个重要方面,对于大数据处理不可或缺.已有的实体识别研究工作聚焦于数据对象相似度算法、分块技术和监督的实体识别技术,而非监督的实体识别中匹配决定的问题很少被涉及.提出一种面向实体识别的聚类算法来弥补这个缺失.利用数据对象及其相似度构建带权重的数据对象相似图.聚类过程中,利用相似图上重启式随机游走来动态地计算类簇与结点的相似度.聚类的基本逻辑是,类簇迭代地吸收离它最近的结点.提出数据对象排序方法来优化聚类的顺序,提高聚类精确性;提出了优化的随机游走平稳概率分布计算方法,降低聚类算法开销.通过在真实数据集和生成数据集上的对比实验,验证了该算法的有效性.  相似文献   

4.
基于节点相似度的网络社团检测算法研究   总被引:1,自引:0,他引:1  
社团结构是众多复杂网络的统计特性之一,挖掘网络中存在的社团结构日益受到人们的普遍关注。网络中的社团结构检测本质上类似于传统机器学习领域的聚类分析,其关键问题在于如何定义网络中节点间的相似度。首先提出了基于节点相似度的节点分裂算法SUN,相比传统的基于边界数(betweenness)的节点分裂算法GN, SGN在速度和精度上都有明显改善;接着,在利用各种节点相似度计算方法得到节点间的相似度之后,采用几种经典的聚类分析算法对网络进行社团划分,在模拟数据和真实数据上的实验表明:基于网络拓扑结构信息的signal和regular方法优于基于网络节点局部信息的Jaccard方法,而且对于复杂网络社团划分问题,如果选择好的网络节点相似度构造方法,已有的基于相似度矩阵的聚类分析算法都能快速有效地对网络社团进行划分。  相似文献   

5.
杜航原  张晶  王文剑   《智能系统学报》2020,15(6):1113-1120
针对聚类集成中一致性函数设计问题,本文提出一种深度自监督聚类集成算法。该算法首先根据基聚类划分结果采用加权连通三元组算法计算样本之间的相似度矩阵,基于相似度矩阵表达邻接关系,将基聚类由特征空间中的数据表示变换至图数据表示;在此基础上,基聚类的一致性集成问题被转化为对基聚类图数据表示的图聚类问题。为此,本文利用图神经网络构造自监督聚类集成模型,一方面采用图自动编码器学习图的低维嵌入,依据低维嵌入似然分布估计聚类集成的目标分布;另一方面利用聚类集成目标对低维嵌入过程进行指导,确保模型获得的图低维嵌入与聚类集成结果是一致最优的。在大量数据集上进行了仿真实验,结果表明本文算法相比HGPA、CSPA和MCLA等算法可以进一步提高聚类集成结果的准确性。  相似文献   

6.
Spectral clustering is a clustering method based on algebraic graph theory. It has aroused extensive attention of academia in recent years, due to its solid theoretical foundation, as well as the good performance of clustering. This paper introduces the basic concepts of graph theory and reviews main matrix representations of the graph, then compares the objective functions of typical graph cut methods and explores the nature of spectral clustering algorithm. We also summarize the latest research achievements of spectral clustering and discuss several key issues in spectral clustering, such as how to construct similarity matrix and Laplacian matrix, how to select eigenvectors, how to determine cluster number, and the applications of spectral clustering. At last, we propose several valuable research directions in light of the deficiencies of spectral clustering algorithms.  相似文献   

7.
ASK-GraphView: A large scale graph visualization system   总被引:2,自引:0,他引:2  
We describe ASK-GraphView, a node-link-based graph visualization system that allows clustering and interactive navigation of large graphs, ranging in size up to 16 million edges. The system uses a scalable architecture and a series of increasingly sophisticated clustering algorithms to construct a hierarchy on an arbitrary, weighted undirected input graph. By lowering the interactivity requirements we can scale to substantially bigger graphs. The user is allowed to navigate this hierarchy in a top down manner by interactively expanding individual clusters. ASK-GraphView also provides facilities for filtering and coloring, annotation and cluster labeling  相似文献   

8.
In this paper, we consider the problem of clustering and re-ranking web image search results so as to improve diversity at high ranks. We propose a novel ranking framework, namely cluster-constrained conditional Markov random walk (CCCMRW), which has two key steps: first, cluster images into topics, and then perform Markov random walk in an image graph conditioned on constraints of image cluster information. In order to cluster the retrieval results of web images, a novel graph clustering model is proposed in this paper. We explore the surrounding text to mine the correlations between words and images and therefore the correlations are used to improve clustering results. Two kinds of correlations, namely word to image and word to word correlations, are mainly considered. As a standard text process technique, tf-idf method cannot measure the correlation of word to image directly. Therefore, we propose to combine tf-idf method with a novel feature of word, namely visibility, to infer the word-to-image correlation. By latent Dirichlet allocation model, we define a topic relevance function to compute the weights of word-to-word correlations. Taking word to image correlations as heterogeneous links and word-to-word correlations as homogeneous links, graph clustering algorithms, such as complex graph clustering and spectral co-clustering, are respectively used to cluster images into topics in this paper. In order to perform CCCMRW, a two-layer image graph is constructed with image cluster nodes as upper layer added to a base image graph. Conditioned on the image cluster information from upper layer, Markov random walk is constrained to incline to walk across different image clusters, so as to give high rank scores to images of different topics and therefore gain the diversity. Encouraging clustering and re-ranking outputs on Google image search results are reported in this paper.  相似文献   

9.
Attributed graph clustering, also known as community detection on attributed graphs, attracts much interests recently due to the ubiquity of attributed graphs in real life. Many existing algorithms have been proposed for this problem, which are either distance based or model based. However, model selection in attributed graph clustering has not been well addressed, that is, most existing algorithms assume the cluster number to be known a priori. In this paper, we propose two efficient approaches for attributed graph clustering with automatic model selection. The first approach is a popular Bayesian nonparametric method, while the second approach is an asymptotic method based on a recently proposed model selection criterion, factorized information criterion. Experimental results on both synthetic and real datasets demonstrate that our approaches for attributed graph clustering with automatic model selection significantly outperform the state-of-the-art algorithm.  相似文献   

10.
李金泽  徐喜荣  潘子琦  李晓杰 《计算机科学》2017,44(Z6):424-427, 450
聚类算法是近年来国际上机器学习领域的一个新的研究热点。为了能在任意形状的样本空间上聚类,学者们提出了谱聚类和图论聚类等优秀的算法。首先介绍了图论聚类算法中的谱聚类经典NJW算法和NeiMu图论聚类算法的基本思路,提出了改进的自适应谱聚类NJW算法。提出的自适应NJW算法的优点在于无需调试参数,即可自动求出聚类个数,克服了经典NJW算法需要事先设置聚类个数且需反复调试参数δ才能得出数据分类结果的缺点。在UCI标准数据集及实测数据集上对自适应NJW算法与经典NJW算法、自适应NJW算法与NeiMu图论聚类算法进行了比较。实验结果表明,自适应NJW算法方便快捷,且具有较好的实用性。  相似文献   

11.
将对象作顶点集,用直觉模糊数刻画对象间的相关性和不相关性表示成直觉模糊边;建立了半直觉模糊图模型。定义了半直觉模糊图的生成子图、度、路径、相关截图、序关系、最大生成树等概念。给出基于半直觉模糊图的聚类分析算法,分析了算法的复杂度。结合经典实例作了基于半直觉模糊图的聚类分析,结果显示基于半直觉模糊图的聚类分析算法复杂度低于一般直觉模糊聚类算法。高效实用且自动化程度高。  相似文献   

12.
Clustering problems are applicable to several areas of science. Approaches and algorithms are as varied as the applications. From a graph theory perspective, clustering can be generated by partitioning an input graph into a vertex-disjoint union of cliques (clusters) through addition and deletion of edges. Finding the minimum number of edges additions and deletions required to cluster data that can be represented as graphs is a well-known problem in combinatorial optimization, often referred to as cluster editing problem. However, many real-world clustering applications are characterized by overlapping clusters, that is, clusters that are non-disjoint. In these situations cluster editing cannot be applied to these problems. Literature concerning a relaxation of the cluster editing, where clusters can overlap, is scarce. In this work, we propose the overlapping cluster editing problem, a variation of the cluster editing where the goal is to partition a graph, also by editing edges, into maximal cliques that are not necessarily disjoint. In addition, we also present three slightly different versions of a hybrid heuristic to solve this problem. Each hybrid heuristic is based on coupling two metaheuristicsthat, together, generate a set of clusters; and one of three mixed-integer linear programming models, also introduced in this paper, that uses these clusters as input. The objective with the metaheuristics is to limit the solution exploration space in the models’ resolution, therefore reducing its computational time.Tests results show that the all proposed hybrid heuristic versions are able to generate good-quality overlapping cluster editing solutions. In particular, one version of the hybrid heuristic achieved, at a low computational cost, the best results in 51 of 112 randomly-generated graphs. Although the other two hybrid heuristic versions have harder to solve models, they obtained reasonable results in medium-sized randomly-generated graphs. In addition, the hybrid heuristic achieved good results identifying labeled overlapping clusters in a supervised data set experiment. Furthermore, we also show that, with our new problem definition, clustering a vertex in more than one cluster can reduce the edges editing cost.  相似文献   

13.
Subspace clustering finds sets of objects that are homogeneous in subspaces of high-dimensional datasets, and has been successfully applied in many domains. In recent years, a new breed of subspace clustering algorithms, which we denote as enhanced subspace clustering algorithms, have been proposed to (1) handle the increasing abundance and complexity of data and to (2) improve the clustering results. In this survey, we present these enhanced approaches to subspace clustering by discussing the problems they are solving, their cluster definitions and algorithms. Besides enhanced subspace clustering, we also present the basic subspace clustering and the related works in high-dimensional clustering.  相似文献   

14.
距离与差异性度量是聚类分析中的基本概念,是许多聚类算法的核心内容。在经典的聚类分析中,度量差异性的指标是距离的简单函数。该文针对混合属性数据集,提出两种距离定义,将差异性度量推广成为距离、类大小等因素的多元函数,使得原来只适用于数值属性或分类属性数据的聚类算法可用于混合属性数据。实验结果表明新的距离定义和差异性度量方法可提高聚类的质量。  相似文献   

15.
Tian  Juan  Yong-dong  Jin-tao 《Neurocomputing》2009,72(13-15):3203
Spectral clustering consists of two distinct stages: (a) construct an affinity graph from the dataset and (b) cluster the data points through finding an optimal partition of the affinity graph. The focus of the paper is the first step. Existing spectral clustering algorithms adopt Gaussian function to define the affinity graph since it is easy to implement. However, Gaussian function is hard to depict the intrinsic structure of the data, and it has to specify a scaling parameter whose selection is still an open issue in spectral clustering. Therefore, we propose a new definition of affinity graph for spectral clustering from the graph partition perspective. In particular, we propose two consistencies: smooth consistency and constraint consistency, for affinity graph to hold, and then define the affinity graph respecting these consistencies in a regularization framework of ranking on manifolds. Meanwhile the proposed definition of affinity graph is applicable to both unsupervised and semi-supervised spectral clustering. Encouraging experimental results on synthetic and real world data demonstrate the effectiveness of the proposed approach.  相似文献   

16.
Large graphs are scale free and ubiquitous having irregular relationships. Clustering is used to find existent similar patterns in graphs and thus help in getting useful insights. In real-world, nodes may belong to more than one cluster thus, it is essential to analyze fuzzy cluster membership of nodes. Traditional centralized fuzzy clustering algorithms incur high communication cost and produce poor quality of clusters when used for large graphs. Thus, scalable solutions are obligatory to handle huge amount of data in less computational time with minimum disk access. In this paper, we proposed a parallel fuzzy clustering algorithm named ‘PGFC’ for handling scalable graph data. It will be advantageous from the viewpoint of expert systems to develop a clustering algorithm that can assure scalability along with better quality of clusters for handling large graphs.The algorithm is parallelized using bulk synchronous parallel (BSP) based Pregel model. The cluster centers are initialized using degree centrality measure, resulting in lesser number of iterations. The performance of PGFC is compared with other state of art clustering algorithms using synthetic graphs and real world networks. The experimental results reveal that the proposed PGFC scales up linearly to handle large graphs and produces better quality of clusters when compared to other graph clustering counterparts.  相似文献   

17.
To cluster web documents, all of which have the same name entities, we attempted to use existing clustering algorithms such as K-means and spectral clustering. Unexpectedly, it turned out that these algorithms are not effective to cluster web documents. According to our intensive investigation, we found that clustering such web pages is more complicated because (1) the number of clusters (known as ground truth) is larger than two or three clusters as in general clustering problems and (2) clusters in the data set have extremely skewed distributions of cluster sizes. To overcome the aforementioned problem, in this paper, we propose an effective clustering algorithm to boost up the accuracy of K-means and spectral clustering algorithms. In particular, to deal with skewed distributions of cluster sizes, our algorithm performs both bisection and merge steps based on normalized cuts of the similarity graph G to correctly cluster web documents. Our experimental results show that our algorithm improves the performance by approximately 56% compared to spectral bisection and 36% compared to K-means.  相似文献   

18.
We show how the quantum paradigm can be used to speed up unsupervised learning algorithms. More precisely, we explain how it is possible to accelerate learning algorithms by quantizing some of their subroutines. Quantization refers to the process that partially or totally converts a classical algorithm to its quantum counterpart in order to improve performance. In particular, we give quantized versions of clustering via minimum spanning tree, divisive clustering and k-medians that are faster than their classical analogues. We also describe a distributed version of k-medians that allows the participants to save on the global communication cost of the protocol compared to the classical version. Finally, we design quantum algorithms for the construction of a neighbourhood graph, outlier detection as well as smart initialization of the cluster centres.  相似文献   

19.
In Ad Hoc networks, the performance is significantly degraded as the size of the network grows. The network clustering by which the nodes are hierarchically organized on the basis of the proximity relieves this performance degradation. Finding the weakly connected dominating set (WCDS) is a promising approach for clustering the wireless Ad Hoc networks. Finding the minimum WCDS in the unit disk graph is an NP-Hard problem, and a host of approximation algorithms has been proposed. In this article, we first proposed a centralized approximation algorithm called DLA-CC based on distributed learning automata (DLA) for finding a near optimal solution to the minimum WCDS problem. Then, we propose a DLA-based clustering algorithm called DLA-DC for clustering the wireless Ad Hoc networks. The proposed cluster formation algorithm is a distributed implementation of DLA-CC, in which the dominator nodes and their closed neighbors assume the role of the cluster-heads and cluster members, respectively. In this article, we compute the worst case running time and message complexity of the clustering algorithm for finding a near optimal cluster-head set. We argue that by a proper choice of the learning rate of the clustering algorithm, a trade-off between the running time and message complexity of algorithm with the cluster-head set size (clustering optimality) can be made. The simulation results show the superiority of the proposed algorithms over the existing methods.  相似文献   

20.
Identification of the correct number of clusters and the appropriate partitioning technique are some important considerations in clustering where several cluster validity indices, primarily utilizing the Euclidean distance, have been used in the literature. In this paper a new measure of connectivity is incorporated in the definitions of seven cluster validity indices namely, DB-index, Dunn-index, Generalized Dunn-index, PS-index, I-index, XB-index and SV-index, thereby yielding seven new cluster validity indices which are able to automatically detect clusters of any shape, size or convexity as long as they are well-separated. Here connectivity is measured using a novel approach following the concept of relative neighborhood graph. It is empirically established that incorporation of the property of connectivity significantly improves the capabilities of these indices in identifying the appropriate number of clusters. The well-known clustering techniques, single linkage clustering technique and K-means clustering technique are used as the underlying partitioning algorithms. Results on eight artificially generated and three real-life data sets show that connectivity based Dunn-index performs the best as compared to all the other six indices. Comparisons are made with the original versions of these seven cluster validity indices.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号