首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 531 毫秒
1.
We study the problem of answering k -hop reachability queries in a directed graph, i.e., whether there exists a directed path of length $k$ , from a source query vertex to a target query vertex in the input graph. The problem of $k$ -hop reachability is a general problem of the classic reachability (where $k=\infty $ ). Existing indexes for processing classic reachability queries, as well as for processing shortest path distance queries, are not applicable or not efficient for processing $k$ -hop reachability queries. We propose an efficient index for processing $k$ -hop reachability queries. Our experimental results on a wide range of real datasets show that our method is efficient and scalable in terms of both index construction and query processing.  相似文献   

2.
3.
Keyword search over XML data has attracted a lot of research efforts in the last decade, where one of the fundamental research problems is how to efficiently answer a given keyword query w.r.t. a certain query semantics. We found that the key factor resulting in the inefficiency for existing methods is that they all heavily suffer from the common-ancestor-repetition problem. In this paper, we propose a novel form of inverted list, namely the IDList; the IDList for keyword $k$ consists of ordered nodes that directly or indirectly contain $k$ . We then show that finding keyword query results based on the smallest lowest common ancestor and exclusive lowest common ancestor semantics can be reduced to ordered set intersection problem, which has been heavily optimized due to its application in areas such as information retrieval and database systems. We propose several algorithms that exploit set intersection in different directions and with or without using additional indexes. We further propose several algorithms that are based on hash search to simplify the operation of finding common nodes from all involved IDLists. We have conducted an extensive set of experiments using many state-of-the-art algorithms and several large-scale datasets. The results demonstrate that our proposed methods outperform existing methods by up to two orders of magnitude in many cases.  相似文献   

4.
Nowadays, clustering of massive datasets is a crucial part of many data-analytic tasks. Most of the available clustering algorithms have two shortcomings when used on big data: (1) a large group of clustering algorithms, e.g. \(k\) -means, has to keep the data in memory and iterate over the data many times which is very costly for big datasets, (2) clustering algorithms that run on limited memory sizes, especially the family of stream-clustering algorithms, do not have a parallel implementation to utilize modern multi-core processors and also they lack decent quality of results. In this paper, we propose an algorithm that combines parallel clustering with single-pass, stream-clustering algorithms. The aim is to make a clustering algorithm that utilizes maximum capabilities of a regular multi-core PC to cluster the dataset as fast as possible while resulting in acceptable quality of clusters. Our idea is to split the data into chunks and cluster each chunk in a separate thread. Then, the clusters extracted from chunks are aggregated at the final stage using re-clustering. Parameters of the algorithm can be adjusted according to hardware limitations. Experimental results on a 12-core computer show that the proposed method is much faster than its batch-processing equivalents (e.g. \(k\) -means++) and stream-based algorithms. Also, the quality of solution is often equal to \(k\) -means++, while it significantly dominates stream-clustering algorithms. Our solution also scales well with extra available cores and hence provides an effective and fast solution to clustering large datasets on multi-core and multi-processor systems.  相似文献   

5.
Learning from high-dimensional data is usually quite challenging, as captured by the well-known phrase curse of dimensionality. Data analysis often involves measuring the similarity between different examples. This sometimes becomes a problem, as many widely used metrics tend to concentrate in high-dimensional feature spaces. The reduced contrast makes it more difficult to distinguish between close and distant points, which renders many traditional distance-based learning methods ineffective. Secondary distances based on shared neighbor similarities have recently been proposed as one possible solution to this problem. However, these initial metrics failed to take hubness into account. Hubness is a recently described aspect of the dimensionality curse, and it affects all sorts of $k$ -nearest neighbor learning methods in severely negative ways. This paper is the first to discuss the impact of hubs on forming the shared neighbor similarity scores. We propose a novel, hubness-aware secondary similarity measure $simhub_s$ and an extensive experimental evaluation shows it to be much more appropriate for high-dimensional data classification than the standard $simcos_s$ measure. The proposed similarity changes the underlying $k$ NN graph in such a way that it reduces the overall frequency of label mismatches in $k$ -neighbor sets and increases the purity of occurrence profiles, which improves classifier performance. It is a hybrid measure, which takes into account both the supervised and the unsupervised hubness information. The analysis shows that both components are useful in their own ways and that the measure is therefore properly defined. This new similarity does not increase the overall computational cost, and the improvement is essentially ‘free’.  相似文献   

6.
We examine bounds on the locality of routing. A local routing algorithm makes a sequence of distributed forwarding decisions, each of which is made using only local information. Specifically, in addition to knowing the node for which a message is destined, an intermediate node might also know (1) its local neighbourhood (the subgraph corresponding to all network nodes within $k$ hops of itself, for some fixed $k$ ), (2) the node from which the message originated, and (3) the incoming port (which of its neighbours last forwarded the message). Our objective is to determine, as $k$ varies, which of these parameters are necessary and/or sufficient to permit local routing on a network modelled by a connected undirected graph. In particular, we establish tight bounds on $k$ for the feasibility of deterministic $k$ -local routing for various combinations of these parameters, as well as corresponding bounds on dilation (the worst-case ratio of actual route length to shortest path length).  相似文献   

7.
Although the \(k\)-NN classifier is a popular classification method, it suffers from the high computational cost and storage requirements it involves. This paper proposes two effective cluster-based data reduction algorithms for efficient \(k\)-NN classification. Both have low preprocessing cost and can achieve high data reduction rates while maintaining \(k\)-NN classification accuracy at high levels. The first proposed algorithm is called reduction through homogeneous clusters (RHC) and is based on a fast preprocessing clustering procedure that creates homogeneous clusters. The centroids of these clusters constitute the reduced training set. The second proposed algorithm is a dynamic version of RHC that retains all its properties and, in addition, it can manage datasets that cannot fit in main memory and is appropriate for dynamic environments where new training data are gradually available. Experimental results, based on fourteen datasets, illustrate that both algorithms are faster and achieve higher reduction rates than four known methods, while maintaining high classification accuracy.  相似文献   

8.
In order to overcome the premature convergence in particle swarm optimization (PSO), we introduce dynamical crossover, a crossover operator with variable lengths and positions, to PSO, which is briefly denoted as CPSO. To get rid of the drawbacks of only finding the convex clusters and being sensitive to the initial points in $k$ -means algorithm, a hybrid clustering algorithm based on CPSO is proposed. The difference between the work and the existing ones lies in that CPSO is firstly introduced into $k$ -means. Experimental results performing on several data sets illustrate that the proposed clustering algorithm can get completely rid of the shortcomings of $k$ -means algorithms, and acquire correct clustering results. The application in image segmentation illustrates that the proposed algorithm gains good performance.  相似文献   

9.
Finding cohesive subgroups is an important issue in studying social networks. Many models exist for defining cohesive subgraphs in social networks, such as clique, $k$ -clique, and $k$ -clan. The concept of $k$ -club is one of them. A $k$ -club of a graph is a maximal subset of the vertex set which induces a subgraph of diameter $k$ . It is a relaxation of a clique, which induces a subgraph of diameter $1$ . We conducted algorithmic studies on finding a $k$ -club of size as large as possible. In this paper, we show that one can find a $k$ -club of maximum size in $O^{*}(1.62^n)$ time where $n$ is the number of vertices of the input graph. We implemented a combinatorial branch-and-bound algorithm that finds a $k$ -club of maximum size and a new heuristic algorithm called IDROP given in this paper. To speed up the programs, we introduce a dynamic data structure called $k$ -DN which, under deletion of vertices from a graph, maintains for a given vertex $v$ the set of vertices at distances at most $k$ . From the experimental results that we obtained, we concluded that a $k$ -club of maximum size can be easily found in sparse graphs and dense graphs. Our heuristic algorithm finds, within reasonable time, $k$ -clubs of maximum size in most of experimental instances. The gap between the size of a $k$ -club of maximum size and a $k$ -club found by IDROP is a constant for the number of vertices that we are able to test.  相似文献   

10.
Wireless sensor networks (WSN) is a key enabling technique for achieving the vision of the Internet of Things. In many applications of WSN such as environmental monitoring and vehicle tracking, they may require to launch spatial queries for collecting and gathering sensory data for achieving certain goals. One such query is the \(K\) nearest neighbor (KNN) query, which aims to collect sensory data from \(k\) sensor nodes nearest to a certain query location. Techniques, namely the itinerary-based KNN query algorithms, are recently developed for facilitating KNN queries. Generally, these techniques propagate queries and collect data along a predetermined itinerary. However, query accuracy and boundary expansion are two challenges that are not well addressed. To mitigate these issues, in this paper, we propose a novel KNN query algorithm based on grid division routing in the setting of skewness distribution, where the itinerary is formed based on the connectivity of adjacent grid cells centers. This technique can achieve better query accuracy and cause less energy consumption by executing the query concurrently in subregions. Besides, the void region problem is well addressed based on the proximity of neighbor grid cells. Experiment result shows that our technique performs better in several aspects including query accuracy, data redundancy, and energy efficiency.  相似文献   

11.
Scaling skyline queries over high-dimensional datasets remains to be challenging due to the fact that most existing algorithms assume dimensional independence when establishing the worst-case complexity by discarding correlation distribution. In this paper, we present HashSkyline, a systematic and correlation-aware approach for scaling skyline queries over high-dimensional datasets with three novel features: First, it offers a fast hash-based method to prune non-skyline points by utilizing data correlation characteristics and speed up the overall skyline evaluation for correlated datasets. Second, we develop \(HashSkyline_{GPU}\), which can dramatically reduce the response time for anti-correlated and independent datasets by capitalizing on the parallel processing power of GPUs. Third, the HashSkyline approach uses the pivot cell-based mechanism combined with the correlation threshold to determine the correlation distribution characteristics for a given dataset, enabling adaptive configuration of HashSkyline for skyline query evaluation by auto-switching of \(HashSkyline_{CPU}\) and \(HashSkyline_{GPU}\). We evaluate the validity of HashSkyline using both synthetic datasets and real datasets. Our experiments show that HashSkyline consumes significantly less pre-processing cost and achieves significantly higher overall query performance, compared to existing state-of-the-art algorithms.  相似文献   

12.
In multi-task learning, there are roughly two approaches to discovering representations. The first is to discover task relevant representations, i.e., those that compactly represent solutions to particular tasks. The second is to discover domain relevant representations, i.e., those that compactly represent knowledge that remains invariant across many tasks. In this article, we propose a new approach to multi-task learning that captures domain-relevant knowledge by learning potential-based shaping functions, which augment a task’s reward function with artificial rewards. We address two key issues that arise when deriving potential functions. The first is what kind of target function the potential function should approximate; we propose three such targets and show empirically that which one is best depends critically on the domain and learning parameters. The second issue is the representation for the potential function. This article introduces the notion of $k$ -relevance, the expected relevance of a representation on a sample sequence of $k$ tasks, and argues that this is a unifying definition of relevance of which both task and domain relevance are special cases. We prove formally that, under certain assumptions, $k$ -relevance converges monotonically to a fixed point as $k$ increases, and use this property to derive Feature Selection Through Extrapolation of k-relevance (FS-TEK), a novel feature-selection algorithm. We demonstrate empirically the benefit of FS-TEK on artificial domains.  相似文献   

13.
Mining frequent patterns in a single network (graph) poses a number of challenges. Already only to match one path pattern to a network under subgraph isomorphism is NP-complete. Classical matching algorithms become intractable even for reasonably small patterns, on networks which are large or have a high average degree. Based on recent advances in parameterized complexity theory, we propose a novel miner for rooted trees in networks. The miner, for a fixed parameter $k$ k (maximal pattern size), can mine all rooted trees with delay linear in the size of the network and only mildly exponential in the fixed parameter $k$ k . This allows us to mine tractably, rooted trees, in large networks such as the WWW or social networks. We establish the practical applicability of our miner, by presenting an experimental evaluation on both synthetic and real-world data.  相似文献   

14.
Recently, a large amount of work has been devoted to the study of spectral clustering—a simple yet powerful method for finding structure in a data set using spectral properties of an associated pairwise similarity matrix. Most of the existing spectral clustering algorithms estimate only one cluster number or estimate non-unique cluster numbers based on eigengap criterion. However, the number of clusters not always exists one, and eigengap criterion lacks theoretical justification. In this paper, we propose non-unique cluster numbers determination methods based on stability in spectral clustering (NCNDBS). We first utilize the multiway normalized cut spectral clustering algorithm to cluster data set for a candidate cluster number $k$ . Then the ratio value of the multiway normalized cut criterion of the obtained clusters and the sum of the leading eigenvalues (descending sort) of the stochastic transition matrix is chosen as a standard to decide whether the $k$ is a reasonable cluster number. At last, by varying the scaling parameter in the Gaussian function, we judge whether the reasonable cluster number $k$ is also a stability one. By three stages, we can determine non-unique cluster numbers of a data set. The Lumpability theorem concluded by Meil $\breve{a}$ and Xu provides a theoretical base for our methods. NCNDBS can estimate non-unique cluster numbers of the data set successfully by illustrative experiments.  相似文献   

15.
Efficient processing of high-dimensional similarity joins plays an important role for a wide variety of data-driven applications. In this paper, we consider $\varepsilon $ -join variant of the problem. Given two $d$ -dimensional datasets and parameter $\varepsilon $ , the task is to find all pairs of points, one from each dataset that are within $\varepsilon $ distance from each other. We propose a new $\varepsilon $ -join algorithm, called Super-EGO, which belongs the EGO family of join algorithms. The new algorithm gains its advantage by using novel data-driven dimensionality re-ordering technique, developing a new EGO-strategy that more aggressively avoids unnecessary computation, as well as by developing a parallel version of the algorithm. We study the newly proposed Super-EGO algorithm on large real and synthetic datasets. The empirical study demonstrates significant advantage of the proposed solution over the existing state of the art techniques.  相似文献   

16.
The Voronoi diagram is an important technique for answering nearest-neighbor queries for spatial databases. We study how the Voronoi diagram can be used for uncertain spatial data, which are inherent in scientific and business applications. Specifically, we propose the Uncertain-Voronoi diagram (or UV-diagram), which divides the data space into disjoint “UV-partitions”. Each UV-partition $P$ is associated with a set $S$ of objects, such that any point $q$ located in $P$ has the set $S$ as its nearest neighbor with nonzero probabilities. The UV-diagram enables queries that return objects with nonzero chances of being the nearest neighbor (NN) of a given point $q$ . It supports “continuous nearest-neighbor search”, which refreshes the set of NN objects of $q$ , as the position of $q$ changes. It also allows the analysis of nearest-neighbor information, for example, to find out the number of objects that are the nearest neighbors of any point in a given area. A UV-diagram requires exponential construction and storage costs. To tackle these problems, we devise an alternative representation of a UV-diagram, by using a set of UV-cells. A UV-cell of an object $o$ is the extent $e$ for which $o$ can be the nearest neighbor of any point $q \in e$ . We study how to speed up the derivation of UV-cells by considering its nearby objects. We also use the UV-cells to design the UV-index, which supports different queries, and can be constructed in polynomial time. We have performed extensive experiments on both real and synthetic data to validate the efficiency of our approaches.  相似文献   

17.
Query result clustering has attracted considerable attention as a means of providing users with a concise overview of results. However, little research effort has been devoted to organizing the query results for entities which refer to real-world concepts, e.g., people, products, and locations. Entity-level result clustering is more challenging because diverse similarity notions between entities need to be supported in heterogeneous domains, e.g., image resolution is an important feature for cameras, but not for fruits. To address this challenge, we propose a hybrid relationship clustering algorithm, called Hydra, using co-occurrence and numeric features. Algorithm Hydra captures diverse user perceptions from co-occurrence and disambiguates different senses using feature-based similarity. In addition, we extend Hydra into ${\mathsf{Hydra }_\mathsf{gData }}$ Hydra gData with different sources, i.e., entity types and crowdsourcing. Experimental results show that the proposed algorithms achieve effectiveness and efficiency in real-life and synthetic datasets.  相似文献   

18.
Time-series classification (TSC) problems present a specific challenge for classification algorithms: how to measure similarity between series. A shapelet is a time-series subsequence that allows for TSC based on local, phase-independent similarity in shape. Shapelet-based classification uses the similarity between a shapelet and a series as a discriminatory feature. One benefit of the shapelet approach is that shapelets are comprehensible, and can offer insight into the problem domain. The original shapelet-based classifier embeds the shapelet-discovery algorithm in a decision tree, and uses information gain to assess the quality of candidates, finding a new shapelet at each node of the tree through an enumerative search. Subsequent research has focused mainly on techniques to speed up the search. We examine how best to use the shapelet primitive to construct classifiers. We propose a single-scan shapelet algorithm that finds the best $k$ shapelets, which are used to produce a transformed dataset, where each of the $k$ features represent the distance between a time series and a shapelet. The primary advantages over the embedded approach are that the transformed data can be used in conjunction with any classifier, and that there is no recursive search for shapelets. We demonstrate that the transformed data, in conjunction with more complex classifiers, gives greater accuracy than the embedded shapelet tree. We also evaluate three similarity measures that produce equivalent results to information gain in less time. Finally, we show that by conducting post-transform clustering of shapelets, we can enhance the interpretability of the transformed data. We conduct our experiments on 29 datasets: 17 from the UCR repository, and 12 we provide ourselves.  相似文献   

19.
We investigate data parallel techniques for belief propagation in acyclic factor graphs on multi-core systems. Belief propagation is a key inference algorithm in factor graph, a probabilistic graphical model that has found applications in many domains. In this paper, we explore data parallelism for basic operations over the potential tables in belief propagation. Data parallel techniques for these table operations are developed for shared memory platforms. We then propose a complete belief propagation algorithm using these table operations to perform exact inference in factor graphs. The proposed algorithms are implemented on state-of-the-art multi-socket multi-core systems with additional NUMA-aware optimizations. Our proposed algorithms exhibit good scalability using a representative set of factor graphs. On a four-socket Intel Westmere-EX system with 40 cores, we achieve 39.5 $\times $ speedup for the table operations and 39 $\times $ speedup for the complete algorithm using factor graphs with large potential tables.  相似文献   

20.
Two identical (anonymous) mobile agents start from arbitrary nodes of an unknown tree and have to meet at some node. Agents move in synchronous rounds: in each round an agent can either stay at the current node or move to one of its neighbors. We consider deterministic algorithms for this rendezvous task. The main result of this paper is a tight trade-off between the optimal time of completing rendezvous and the size of memory of the agents. For agents with $k$ memory bits, we show that optimal rendezvous time is $\Theta (n+n^2/k)$ in $n$ -node trees. More precisely, if $k \ge c\log n$ , for some constant $c$ , we design agents accomplishing rendezvous in arbitrary trees of size $n$ (unknown to the agents) in time $O(n+n^2/k)$ , starting with arbitrary delay. We also show that no pair of agents can accomplish rendezvous in time $o(n+n^2/k)$ , even in the class of lines of known length and even with simultaneous start. Finally, we prove that at least logarithmic memory is necessary for rendezvous, even for agents starting simultaneously in a $n$ -node line.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号