首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 112 毫秒
1.
Efficiently searching top-k representative vertices is crucial for understanding the structure of large dynamic graphs. Recent studies show that communities formed by a vertex with high local clustering coefficient and its neighbours can achieve enhanced information propagation speed as well as disease transmission speed. However, local clustering coefficient, which measures the cliquishness of a vertex in its local neighbourhood, prefers vertices with small degrees. To remedy this issue, in this paper we propose a new ranking measure, weighted clustering coefficient (WCC) of vertices, by integrating both local clustering coefficient and degree. WCC not only inherits the properties of local clustering coefficient but also approximately measures the density (i.e., average degree) of its neighbourhood subgraph. Thus, vertices with higher WCC are more likely to be representative. We study efficiently computing and monitoring top-k representative vertices based on WCC over large dynamic graphs. To reduce the search space, we propose a series of heuristic upper bounds for WCC to prune a large portion of disqualifying vertices from the search space. We also develop an approximation algorithm by utilizing Flajolet-Martin sketch to trade acceptable accuracy for enhanced efficiency. An efficient incremental algorithm dealing with frequent updates in dynamic graphs is explored as well. Extensive experimental results on a variety of real-life graph datasets demonstrate the efficiency and effectiveness of our approaches.  相似文献   

2.
A novel graph theoretic approach for data clustering is presented and its application to the image segmentation problem is demonstrated. The data to be clustered are represented by an undirected adjacency graph 𝒢 with arc capacities assigned to reflect the similarity between the linked vertices. Clustering is achieved by removing arcs of 𝒢 to form mutually exclusive subgraphs such that the largest inter-subgraph maximum flow is minimized. For graphs of moderate size (~ 2000 vertices), the optimal solution is obtained through partitioning a flow and cut equivalent tree of 𝒢, which can be efficiently constructed using the Gomory-Hu algorithm (1961). However for larger graphs this approach is impractical. New theorems for subgraph condensation are derived and are then used to develop a fast algorithm which hierarchically constructs and partitions a partially equivalent tree of much reduced size. This algorithm results in an optimal solution equivalent to that obtained by partitioning the complete equivalent tree and is able to handle very large graphs with several hundred thousand vertices. The new clustering algorithm is applied to the image segmentation problem. The segmentation is achieved by effectively searching for closed contours of edge elements (equivalent to minimum cuts in 𝒢), which consist mostly of strong edges, while rejecting contours containing isolated strong edges. This method is able to accurately locate region boundaries and at the same time guarantees the formation of closed edge contours  相似文献   

3.
An abstraction resilient to common malware obfuscation techniques is the call-graph. A call-graph is the representation of an executable file as a directed graph with labeled vertices, where the vertices correspond to functions and the edges to function calls. Unfortunately, most of the interesting graph comparison problems, including full-graph comparison and computing the largest common subgraph, belong to the \(NP\) -hard class. This makes the study and use of graphs in large scale systems difficult. Existing work has focused only on offline clustering and has not addressed the issue of clustering streams of graphs. In this paper we present Classy, a scalable distributed system that clusters streams of large call-graphs for purposes including automated malware classification and facilitating malware analysts. Since algorithms aimed at clustering sets are not suitable for clustering streams of objects, we propose the use of a clustering algorithm that relies on the notion of candidate clusters and reference samples therein. We demonstrate via thorough experimentation that this approach yields results very close to the offline optimal. Graph similarity is determined by computing a graph edit distance (GED) of pairs of graphs using an adapted version of simulated annealing. Furthermore, we present a novel lower bound for the GED. We also study the problem of approximating statistics of clusters of graphs when the distances of only a fraction of all possible pairs have been computed. Finally, we present results and statistics from a real production-side system that has clustered and contains more than 0.8 million graphs.  相似文献   

4.
Graph clustering is successfully applied in various applications for finding similar patterns. Recently, deep learning- based autoencoder has been used efficiently for detecting disjoint clusters. However, in real-world graphs, vertices may belong to multiple clusters. Thus, it is obligatory to analyze the membership of vertices toward clusters. Furthermore, existing approaches are centralized and are inefficient in handling large graphs. In this paper, a deep learning-based model ‘DFuzzy’ is proposed for finding fuzzy clusters from large graphs in distributed environment. It performs clustering in three phases. In first phase, pre-training is performed by initializing the candidate cluster centers. Then, fine tuning is performed to learn the latent representations by mining the local information and capturing the structure using PageRank. Further, modularity is used to redefine clusters. In last phase, reconstruction error is minimized and final cluster centers are updated. Experiments are performed over real-life graph data, and the performance of DFuzzy is compared with four state-of-the-art clustering algorithms. Results show that DFuzzy scales up linearly to handle large graphs and produces better quality of clusters when compared to state-of-the-art clustering algorithms. It is also observed that deep structures can help in getting better graph representations and provide improved clustering performance.  相似文献   

5.
基于谱聚类与混合模型的SAR图像多尺度分割   总被引:2,自引:2,他引:0       下载免费PDF全文
针对谱聚类方法应用于合成孔径雷达(SAR)图像分割时Laplace矩阵的特征值和特征向量难以计算的问题,结合SAR图像在多个尺度的统计信息,给出了一个包含顶点凝聚、初始分割和分割细化3个步骤的SAR图像多尺度分割方法。首先,用一个顶点数不断减少的凝聚图序列来逼近从SAR图像得到的图;然后应用谱聚类方法对最粗尺度的凝聚图进行分割得到初始分割结果;最后根据SAR图像的统计性质,利用基于混合模型估计的分类后验概率将初始分割结果逐尺度进行细化得到SAR图像的最终分割。实验结果表明了方法的有效性。  相似文献   

6.
Networks with billions of vertices introduce new challenges to perform graph analysis in a reasonable time. Clustering coefficient is an important analytical measure of networks such as social networks and biological networks. To compute clustering coefficient in big graphs, existing distributed algorithms suffer from low efficiency such that they may fail due to demanding lots of memory, or even, if they complete successfully, their execution time is not acceptable for real-world applications. We present a distributed MapReduce-based algorithm, called CCFinder, to efficiently compute clustering coefficient in very big graphs. CCFinder is executed on Apache Spark, a scalable data processing platform. It efficiently detects existing triangles through using our proposed data structure, called FONL, which is cached in the distributed memory provided by Spark and reused multiple times. As data items in the FONL are fine-grained and contain the minimum required information, CCFinder requires less storage space and has better parallelism in comparison with its competitors. To find clustering coefficient, our solution to triangle counting is extended to have degree information of the vertices in the appropriate places. We performed several experiments on a Spark cluster with 60 processors. The results show that CCFinder achieves acceptable scalability and outperforms six existing competitor methods. Four competitors are those methods proposed based on graph processing systems, i.e., GraphX, NScale, NScaleSpark, and Pregel frameworks, and two others are the Cohen’s method and NodeIterator++, introduced based on MapReduce.  相似文献   

7.
针对图数据挖掘效率低、精度低等问题,提出基于环型网络模体应用马尔科夫聚类的图挖掘模型.首先,依据输入图的点集,采用Erdo″s-Rényi模型生成随机图,在输入图和随机图的子图挖掘过程中利用向量的加法性质判定环型子图,计算网络模体的统计特征,判断子图是否为网络模体.然后,求解图中边的绝对贡献值关联矩阵,通过动态阈值法求得阈值,二值化处理该矩阵.最后,对已稀疏化的图进行扩张和膨胀操作,使其达到收敛状态.实验表明,文中模型有效减少运行时间,在保证聚类质量同时提高图挖掘效率.  相似文献   

8.
This paper describes a method for recognizing partially occluded objects for bin-picking tasks using eigenspace analysis, referred to as the “eigen window” method, that stores multiple partial appearances of an object in an eigenspace. Such partial appearances require a large amount of memory space. Three measurements, detectability, uniqueness, and reliability, on windows are developed to eliminate redundant windows and thereby reduce memory requirements. Using a pose clustering technique, the method determines the pose of an object and the object type itself. We have implemented the method and verified its validity  相似文献   

9.
The visualization of dynamic graphs demands visually encoding at least three major data dimensions: vertices, edges, and time steps. Many of the state‐of‐the‐art techniques can show an overview of vertices and edges but lack a data‐scalable visual representation of the time aspect. In this paper, we address the problem of displaying dynamic graphs with a thousand or more time steps. Our proposed interleaved parallel edge splatting technique uses a time‐to‐space mapping and shows the complete dynamic graph in a static visualization. It provides an overview of all data dimensions, allowing for visually detecting time‐varying data patterns; hence, it serves as a starting point for further data exploration. By applying clustering and ordering techniques on the vertices, edge splatting on the links, and a dense time‐to‐space mapping, our approach becomes visually scalable in all three dynamic graph data dimensions. We illustrate the usefulness of our technique by applying it to call graphs and US domestic flight data with several hundred vertices, several thousand edges, and more than a thousand time steps.  相似文献   

10.
In this paper we propose an encoding scheme and ad hoc operators for a genetic approach to hierarchical graph clustering. Given a connected graph whose vertices correspond to points within a Euclidean space and a fitness function, a hierarchy of graphs in which each vertex corresponds to a connected subgraph of the graph below is generated. Both the number of clustering levels and the number of clusters on each level are not fixed a priori and are subject to optimization.  相似文献   

11.
This paper contributes a method for combining sparse parallel graph algorithms with dense parallel linear algebra algorithms in order to understand dynamic graphs including the temporal behavior of vertices. Our method is the first to cluster vertices in a dynamic graph based on arbitrary temporal behaviors. In order to successfully implement this method, we develop a feature based pipeline for dynamic graphs and apply Nonnegative Matrix Factorization (NMF) to these features. We demonstrate these steps with a sample of the Twitter mentions graph as well as a CAIDA network traffic graph. We contribute and analyze a parallel NMF algorithm presenting both theoretical and empirical studies of performance. This work can be leveraged by graph/network analysts to understand the temporal behavior cluster structure and segmentation structure of dynamic graphs.  相似文献   

12.
李曙光  周彤 《计算机科学》2011,38(11):241-244
有界聚类问题源于II3M研究院开发的一个分布式流处理系统,即S系统。问题的输入是一个点赋权和边赋权的无向图,并指定若干个称为终端的顶点。称顶点集合的一个子集为一个子类。子类中所有顶点的权和加上该子类边界上所有边的权和称为该子类的费用。有界聚类问题是要得到所有顶点的一个聚类,要求每个子类的费用不超过给定预算召,每个子类至多包含一个终端,并使得所有子类的总费用最小。对于限制树宽图上的有界聚类问题,给出了拟多项式时间精确算法。利用取整的技巧对该算法进行修正,可在多项式时间之内得到(1+ε)-近似解,其中每个子类的费用不超过(1+ε)B,:是任意小的正数。如果进一步要求每个子类恰好包含一个终端,则所给算法可在多项式时间之内得到(1+ε)-近似解,其中每个子类的费用不超过(2+ε)B。  相似文献   

13.
This paper shows how to construct a generative model for graph structure through the embedding of the nodes of the graph in a vector space. We commence from a sample of graphs where the correspondences between nodes are unknown ab initio. We also work with graphs where there may be structural differences present, i.e. variations in the number of nodes in each graph and their edge structure. We characterise the graphs using the heat-kernel, and this is obtained by exponentiating the Laplacian eigensystem with time. The idea underpinning the method is to embed the nodes of the graphs into a vector space by performing a Young-Householder decomposition of the heat-kernel into an inner product of node co-ordinate matrices. The co-ordinates of the nodes are determined by the eigenvalues and eigenvectors of the Laplacian matrix, together with a time-parameter which can be used to scale the embedding. Node correspondences are located by applying Scott and Longuet-Higgins algorithm to the embedded nodes. We capture variations in graph structure using the covariance matrix for corresponding embedded point positions. We construct a point-distribution model for the embedded node positions using the eigenvalues and eigenvectors of the covariance matrix. We show how to use this model to both project individual graphs into the eigenspace of the point position covariance matrix and how to fit the model to potentially noisy graphs to reconstruct the Laplacian matrix. We illustrate the utility of the resulting method for shape analysis using data from the Caltech–Oxford and COIL databases.  相似文献   

14.
We consider how continuous-time quantum walks can be used for graph matching. We focus in detail on both exact and inexact graph matching, and consider in depth the problem of measuring graph similarity. We commence by constructing an auxiliary graph, in which the two graphs to be matched are co-joined by a layer of indicator vertices (one for each potential correspondence between a pair of vertices). We simulate a continuous-time quantum walk in parallel on the two graphs. The layer of connecting indicator vertices in the auxiliary graph allow quantum interference to take place between the two walks. The interference amplitudes on the indicator vertices are determined by differences in the two walks, and can be used to calculate probabilities for matches between pairs of vertices from the graphs. By applying the Hungarian (Kuhn-Munkres) algorithm to these probabilities, we recover a correspondence mapping between the graphs. To calculate graph similarity, we combine these probabilities with edge-consistency information to give a consistency measure. Based on the consistency measure, we define two graph similarity measures, one of which requires correspondence matches while the second does not. We analyse our approach experimentally using synthetic and real-world graphs. This reveals that our method gives results that are intermediate between the most sophisticated iterative techniques available, and simpler less complex ones.  相似文献   

15.
Graph clustering     
In this survey we overview the definitions and methods for graph clustering, that is, finding sets of “related” vertices in graphs. We review the many definitions for what is a cluster in a graph and measures of cluster quality. Then we present global algorithms for producing a clustering for the entire vertex set of an input graph, after which we discuss the task of identifying a cluster for a specific seed vertex by local computation. Some ideas on the application areas of graph clustering algorithms are given. We also address the problematics of evaluating clusterings and benchmarking cluster algorithms.  相似文献   

16.
Coloring large graphs based on independent set extraction   总被引:1,自引:0,他引:1  
This paper presents an effective approach (EXTRACOL) to coloring large graphs. The proposed approach uses a preprocessing method to extract large independent sets from the graph and a memetic algorithm to color the residual graph. Each preprocessing application identifies, with a dedicated tabu search algorithm, a number of pairwise disjoint independent sets of a given size in order to maximize the vertices removed from the graph. We evaluate EXTRACOL on the 11 largest graphs (with 1000 to 4000 vertices) of the DIMACS challenge benchmarks and show improved results for four very difficult graphs (DSJC1000.9, C2000.5, C2000.9, C4000.5). The behavior of the proposed algorithm is also analyzed.  相似文献   

17.
Large graphs are scale free and ubiquitous having irregular relationships. Clustering is used to find existent similar patterns in graphs and thus help in getting useful insights. In real-world, nodes may belong to more than one cluster thus, it is essential to analyze fuzzy cluster membership of nodes. Traditional centralized fuzzy clustering algorithms incur high communication cost and produce poor quality of clusters when used for large graphs. Thus, scalable solutions are obligatory to handle huge amount of data in less computational time with minimum disk access. In this paper, we proposed a parallel fuzzy clustering algorithm named ‘PGFC’ for handling scalable graph data. It will be advantageous from the viewpoint of expert systems to develop a clustering algorithm that can assure scalability along with better quality of clusters for handling large graphs.The algorithm is parallelized using bulk synchronous parallel (BSP) based Pregel model. The cluster centers are initialized using degree centrality measure, resulting in lesser number of iterations. The performance of PGFC is compared with other state of art clustering algorithms using synthetic graphs and real world networks. The experimental results reveal that the proposed PGFC scales up linearly to handle large graphs and produces better quality of clusters when compared to other graph clustering counterparts.  相似文献   

18.
The task of discovering natural groupings of input patterns, or clustering, is an important aspect of machine learning and pattern analysis. In this paper, we study the widely used spectral clustering algorithm which clusters data using eigenvectors of a similarity/affinity matrix derived from a data set. In particular, we aim to solve two critical issues in spectral clustering: (1) how to automatically determine the number of clusters, and (2) how to perform effective clustering given noisy and sparse data. An analysis of the characteristics of eigenspace is carried out which shows that (a) not every eigenvectors of a data affinity matrix is informative and relevant for clustering; (b) eigenvector selection is critical because using uninformative/irrelevant eigenvectors could lead to poor clustering results; and (c) the corresponding eigenvalues cannot be used for relevant eigenvector selection given a realistic data set. Motivated by the analysis, a novel spectral clustering algorithm is proposed which differs from previous approaches in that only informative/relevant eigenvectors are employed for determining the number of clusters and performing clustering. The key element of the proposed algorithm is a simple but effective relevance learning method which measures the relevance of an eigenvector according to how well it can separate the data set into different clusters. Our algorithm was evaluated using synthetic data sets as well as real-world data sets generated from two challenging visual learning problems. The results demonstrated that our algorithm is able to estimate the cluster number correctly and reveal natural grouping of the input data/patterns even given sparse and noisy data.  相似文献   

19.
20.
传统的基于链接的对象相似度计算方法仅考虑单个图中的节点。Blondel等人将该问题扩展到图间节点,提出Blondel算法,但该算法的时间和空间复杂度过高,不适用于大规模图之间的节点相似度计算。如何高效地计算两个图之间的相似度的方法仍有待研究。提出了B3(blockbased Blondel)算法,先对图进行分块,然后将分块作为一个独立整体,应用原Blondel算法计算块内的节点相似度和块间的相似度,最后再计算任意节点间的全局相似度。该算法是收敛的,并且大大降低了时空复杂度。实验也很好地证明了算法的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号