首页 | 官方网站   微博 | 高级检索  
 共查询到20条相似文献,搜索用时 0 毫秒
Recently, due to the imprecise nature of the data generated from a variety of streaming applications, such as sensor networks, query processing on uncertain data streams has become an important problem. However, all the existing works on uncertain data streams study unbounded streams. In this paper, we take the first step towards the important and challenging problem of answering sliding-window queries on uncertain data streams, with a focus on one of the most important types of queries—top-k queries. It is nontrivial to find an efficient solution for answering sliding-window top-k queries on uncertain data streams, because challenges not only stem from the strict space and time requirements of processing both arriving and expiring tuples in high-speed streams, but also rise from the exponential blowup in the number of possible worlds induced by the uncertain data model. In this paper, we design a unified framework for processing sliding-window top-k queries on uncertain streams. We show that all the existing top-k definitions in the literature can be plugged into our framework, resulting in several succinct synopses that use space much smaller than the window size, while they are also highly efficient in terms of processing time. We also extend our framework to answering multiple top-k queries. In addition to the theoretical space and time bounds that we prove for these synopses, we present a thorough experimental report to verify their practical efficiency on both synthetic and real data.  相似文献   

Evaluating refined queries in top-k retrieval systems   总被引:2,自引:0,他引:2  
In many applications, users specify target values for certain attributes/features without requiring exact matches to these values in return. Instead, the result is typically a ranked list of "top k" objects that best match the specified feature values. User subjectivity is an important aspect of such queries, i.e., which objects are relevant to the user and which are not depends on the perception of the user. Due to the subjective nature of top-k queries, the answers returned by the system to an user query often do not satisfy the users need right away, either because the weights and the distance functions associated with the features do not accurately capture the users perception or because the specified target values do not fully capture her information need or both. In such cases, the user would like to refine the query and resubmit it in order to get back a better set of answers. While there has been a lot of research on query refinement models, there is no work that we are aware of on supporting refinement of top-k queries efficiently in a database system. Done naively, each "refined" query can be treated as a "starting" query and evaluated from scratch. We explore alternative approaches that significantly improve the cost of evaluating refined queries by exploiting the observation that the refined queries are not modified drastically from one iteration to another. Our experiments over a real-life multimedia data set show that the proposed techniques save more than 80 percent of the execution cost of refined queries over the naive approach and is more than an order of magnitude faster than a simple sequential scan.  相似文献   

Optimizing top-k selection queries over multimedia repositories   总被引:2,自引:0,他引:2  
Repositories of multimedia objects having multiple types of attributes (e.g., image, text) are becoming increasingly common. A query on these attributes will typically, request not just a set of objects, as in the traditional relational query model (filtering), but also a grade of match associated with each object, which indicates how well the object matches the selection condition (ranking). Furthermore, unlike in the relational model, users may just want the k top-ranked objects for their selection queries for a relatively small k. In addition to the differences in the query model, another peculiarity of multimedia repositories is that they may allow access to the attributes of each object only through indexes. We investigate how to optimize the processing of top-k selection queries over multimedia repositories. The access characteristics of the repositories and the above query model lead to novel issues in query optimization. In particular, the choice of the indexes used to search the repository strongly influences the cost of processing the filtering condition. We define an execution space that is search-minimal, i.e., the set of indexes searched is minimal. Although the general problem of picking an optimal plan in the search-minimal execution space is NP-hard, we present an efficient algorithm that solves the problem optimally with respect to our cost model and execution space when the predicates in the query are independent. We also show that the problem of optimizing top-k selection queries can be viewed, in many cases, as that of evaluating more traditional selection conditions. Thus, both problems can be viewed together as an extended filtering problem to which techniques of query processing and optimization may be adapted.  相似文献   

In this paper, we propose parallel processing of continuous queries over data streams to handle the bottleneck of single processor DSMSs. Queries are executed in parallel over the logical machines in a multiprocessing environment. Scheduling parallel execution of operators is performed via finding the shortest path in a weighted graph called Query Mega Graph (QMG), which is a logical view of K machines. By lapse of time, number of tuples waiting in queues of different operators may be very different. When a queue becomes full, re-scheduling is done by updating weight of edges of QMG. In the new computed path, machines with more workload will be used less. The proposed system is formally presented and its correctness is proved. It is also modeled in PetriNets and its performance is evaluated and compared with serial query processing as well as the Min-Latency scheduling algorithm. The presented system is shown to outperform them w.r.t. tuple latency (response time), memory usage, throughput and also tuple loss- critical parameters in any data stream management systems.  相似文献   

One of the primary issues confronting XML message brokers is the difficulty associated with processing a large set of continuous XPath queries over incoming XML streams. This paper proposes a novel system designed to present an effective solution to this problem. The proposed system transforms multiple XPath queries before their run-time into a new data structure, called an XP-table, by sharing their common constraints. An XP-table is matched with a stream relation (SR) transformed from a target XML stream by a SAX parser. This arrangement is intended to minimize the run-time workload of continuous query processing. In addition, an early-query-termination strategy is proposed as an improved alternative to the basic approach. It optimizes query processing by arranging the evaluation sequence of the member-lists (m-lists) of an XP-table adaptively and offers increased efficiency, especially in cases of low selectivity. System performance is estimated and verified through a variety of experiments, including comparisons with previous approaches such as YFilter and LazyDFA. The proposed system is practically linear-scalable and stable for evaluating a set of XPath queries in a continuous and timely fashion.  相似文献   

Quantile computation has many applications including data mining and financial data analysis. It has been shown that an /spl epsi/-approximate summary can be maintained so that, given a quantile query (/spl phi/,/spl epsi/), the data item at rank /spl lceil//spl phi/N/spl rceil/ may be approximately obtained within the rank error precision /spl epsi/N over all N data items in a data stream or in a sliding window. However, scalable online processing of massive continuous quantile queries with different /spl phi/ and /spl epsi/ poses a new challenge because the summary is continuously updated with new arrivals of data items. In this paper, first we aim to dramatically reduce the number of distinct query results by grouping a set of different queries into a cluster so that they can be processed virtually as a single query while the precision requirements from users can be retained. Second, we aim to minimize the total query processing costs. Efficient algorithms are developed to minimize the total number of times for reprocessing clusters and to produce the minimum number of clusters, respectively. The techniques are extended to maintain near-optimal clustering when queries are registered and removed in an arbitrary fashion against whole data streams or sliding windows. In addition to theoretical analysis, our performance study indicates that the proposed techniques are indeed scalable with respect to the number of input queries as well as the number of items and the item arrival rate in a data stream.  相似文献   

Multi-dimensional top-k dominating queries   总被引:1,自引:0,他引:1  
The top-k dominating query returns k data objects which dominate the highest number of objects in a dataset. This query is an important tool for decision support since it provides data analysts an intuitive way for finding significant objects. In addition, it combines the advantages of top-k and skyline queries without sharing their disadvantages: (i) the output size can be controlled, (ii) no ranking functions need to be specified by users, and (iii) the result is independent of the scales at different dimensions. Despite their importance, top-k dominating queries have not received adequate attention from the research community. This paper is an extensive study on the evaluation of top-k dominating queries. First, we propose a set of algorithms that apply on indexed multi-dimensional data. Second, we investigate query evaluation on data that are not indexed. Finally, we study a relaxed variant of the query which considers dominance in dimensional subspaces. Experiments using synthetic and real datasets demonstrate that our algorithms significantly outperform a previous skyline-based approach. We also illustrate the applicability of this multi-dimensional analysis query by studying the meaningfulness of its results on real data.  相似文献   

The flexibility of XML data model allows a more natural representation of uncertain data compared with the relational model. Matching twig pattern against XML data is a fundamental problem in querying information from XML documents. For a probabilistic XML document, each twig answer has a probabilistic value because of the uncertainty of data. The twig answers that have small probabilistic value are useless to the users, and usually users only want to get the answers with the k largest probabilistic values. To this end, existing algorithms for ordinary XML documents cannot be directly applicable due to the need for handling probability distributional nodes and efficient calculation of top-k probabilities of answers in probabilistic XML. In this paper, we address the problem of finding twig answers with top-k probabilistic values against probabilistic XML documents directly. We propose a new encoding scheme called PEDewey for probabilistic XML in this paper. Based on this encoding scheme, we then design two algorithms for finding answers of top-k probabilities for twig queries. One is called ProTJFast, to process probabilistic XML data based on element streams in document order, and the other is called PTopKTwig, based on the element streams ordered by the path probability values. Experiments have been conducted to study the performance of these algorithms.  相似文献   

The top-k query is employed in a wide range of applications to generate a ranked list of data that have the highest aggregate scores over certain attributes. As the pool of attributes for selection by individual queries may be large, the data are indexed with per-attribute sorted lists, and a threshold algorithm (TA) is applied on the lists involved in each query. The TA executes in two phases—find a cut-off threshold for the top-k result scores, then evaluate all the records that could score above the threshold. In this paper, we focus on exact top-k queries that involve monotonic linear scoring functions over disk-resident sorted lists. We introduce a model for estimating the depths to which each sorted list needs to be processed in the two phases, so that (most of) the required records can be fetched efficiently through sequential or batched I/Os. We also devise a mechanism to quickly rank the data that qualify for the query answer and to eliminate those that do not, in order to reduce the computation demand of the query processor. Extensive experiments with four different datasets confirm that our schemes achieve substantial performance speed-up of between two times and two orders of magnitude over existing TAs, at the expense of a memory overhead of 4.8 bits per attribute value. Moreover, our scheme is robust to different data distributions and query characteristics.  相似文献   

Frequent pattern mining in data streams is an important research topic in the data mining community. In previous studies, a minimum support threshold was assumed to be available for mining frequent patterns. However, setting such a threshold is typically difficult. Hence, it is more reasonable to ask users to set a bound on the result size. The present study considers mining top-k frequent patterns from data streams using a sliding window technique. A single-pass algorithm, called MSWTP, is developed for the generation of top-k frequent patterns without a threshold. In the method, the content of the transactions in the sliding window is incrementally maintained in a summary data structure, named SWTP-tree, by scanning the stream only once. To make the mining operation efficient, insignificant patterns are distinguished from others by applying the Chernoff bound. Two kinds of obsolete pattern and one kind of insignificant pattern are periodically pruned from the pattern tree. Whenever necessary, the k most frequent patterns can be selected from SWTP-tree in order of their descending frequency. The performance of the proposed technique is evaluated via simulation experiments. The results show that the proposed method is both efficient and scalable, and that it outperforms comparable algorithms.  相似文献   

The fast development of GPS equipped devices has aroused widespread use of spatial keyword querying in location based services nowadays. Existing spatial keyword query methodologies mainly focus on the spatial and textual similarities, while leaving the semantic understanding of keywords in spatial Web objects and queries to be ignored. To address this issue, this paper studies the problem of semantic based spatial keyword querying. It seeks to return the k objects most similar to the query, subject to not only their spatial and textual properties, but also the coherence of their semantic meanings. To achieve that, we propose novel indexing structures, which integrate spatial, textual and semantic information in a hierarchical manner, so as to prune the search space effectively in query processing. Extensive experiments are carried out to evaluate and compare them with other baseline algorithms.  相似文献   

Recently, due to intrinsic characteristics in many underlying data sets, a number of probabilistic queries on uncertain data have been investigated. Top-k dominating queries are very important in many applications including decision making in a multidimensional space. In this paper, we study the problem of efficiently computing top-k dominating queries on uncertain data. We first formally define the problem. Then, we develop an efficient, threshold-based algorithm to compute the exact solution. To overcome some inherent computational deficiency in an exact computation, we develop an efficient randomized algorithm with an accuracy guarantee. Our extensive experiments demonstrate that both algorithms are quite efficient, while the randomized algorithm is quite scalable against data set sizes, object areas, k values, etc. The randomized algorithm is also highly accurate in practice.  相似文献   

《Computer Networks》2008,52(14):2605-2622
Top-k queries are desired aggregation operations on datasets. Examples of queries on network data include finding the top 100 source Autonomous Systems (AS), top 100 ports, or top domain names over IP packets or over IP flow records. Since the complete dataset is often not available or not feasible to examine, we are interested in processing top-k queries from samples.If all records can be processed, the top-k items can be obtained by counting the frequency of each item. Even when the full dataset is observed, however, resources are often insufficient for such counting so techniques were developed to overcome this issue. When we can observe only a random sample of the records, an orthogonal complication arises: The top frequencies in the sample are biased estimates of the actual top-k frequencies. This bias depends on the distribution and must be accounted for when seeking the actual value.We address this by designing and evaluating several schemes that derive rigorous confidence bounds for top-k estimates. Simulations on various datasets that include IP flows data, show that schemes exploiting more of the structure of the sample distribution produce much tighter confidence intervals with an order of magnitude fewer samples than simpler schemes that utilize only the sampled top-k frequencies. The simpler schemes, however, are more efficient in terms of computation.  相似文献   

Preference query processing is important for a wide range of applications involving distributed databases, such as network monitoring, web-based systems, and market analysis. In such applications, data objects are generated frequently and massively, which presents an important and challenging problem of continuous query processing over distributed data stream environments. A top-k dominating query, which has been receiving much research attention recently, returns the k data objects that dominate the highest number of data objects in a given dataset, and due to its dominance-based ranking function, we can easily obtain superior data objects. An emerging requirement in distributed stream environments is an efficient technique for continuously monitoring top-k dominating data objects. Despite of this fact, no study has addressed this problem. In this paper, therefore, we address the problem of continuous top-k dominating query processing over distributed data stream environments. We present two algorithms that monitor the exact top-k dominating data and efficiently eliminate unqualified data objects for the result, which reduces both communication and computation costs. In addition to these algorithms, we present an approximate algorithm that further reduces both communication and computation costs. Extensive experiments on both synthetic and real data have demonstrated the efficiency and scalability of our algorithms.  相似文献   

As data of an unprecedented scale are becoming accessible, it becomes more and more important to help each user identify the ideal results of a manageable size. As such a mechanism, skyline queries have recently attracted a lot of attention for its intuitive query formulation. This intuitiveness, however, has a side effect of retrieving too many results, especially for high-dimensional data. This paper is to support personalized skyline queries as identifying “truly interesting” objects based on user-specific preference and retrieval size k. In particular, we abstract personalized skyline ranking as a dynamic search over skyline subspaces guided by user-specific preference. We then develop a novel algorithm navigating on a compressed structure itself, to reduce the storage overhead. Furthermore, we also develop novel techniques to interleave cube construction with navigation for some scenarios without a priori structure. Finally, we extend the proposed techniques for user-specific preferences including equivalence preference. Our extensive evaluation results validate the effectiveness and efficiency of the proposed algorithms on both real-life and synthetic data.  相似文献   

Given a relation that contains main products and a set of relations corresponding to accessory products that can be combined with a main product, the Exploratory Top-k Join query retrieves the k best combinations of main and accessory products based on user preferences. As a result, the user is presented with a set of k combinations of distinct main products, where a main product is combined with accessory products only if the combination has a better score than the single main product. We model this problem as a rank-join problem, where each combination is represented by a tuple from the main relation and a set of tuples from (some of) the accessory relations. The nature of the problem is challenging because the inclusion of accessory products is not predefined by the user, but instead all potential combinations (joins) are explored during query processing in order to identify the highest scoring combinations. Existing approaches cannot be directly applied to this problem, as they are designed for joining a predefined set of relations. In this paper, we present algorithms for processing exploratory top-k joins that adopt the pull-bound framework for rank-join processing. We introduce a novel algorithm (XRJN) which employs a more efficient bounding scheme and allows earlier termination of query processing. We also provide theoretical guarantees on the performance of this algorithm, by proving that XRJN is instance-optimal. In addition, we consider a pulling strategy that boosts the performance of query processing even further. Finally, we conduct a detailed experimental study that demonstrates the efficiency of the proposed algorithms in various setups.  相似文献   

Consider a database consisting of a set of tuples, each of which contains an interval, a type and a weight. These tuples are called typed intervals and used to support applications involving diverse intervals. In this paper, we study top-k queries on typed intervals. The query reports k intervals intersecting the query time, containing a particular type and having the largest weight. The query time can be a point or an interval. Further, we define top-k continuous queries that return qualified intervals at each time point during the query interval. To efficiently answer such queries, a key challenge is to build an index structure to manage typed intervals. Employing the standard interval tree, we build the structure in a compact way to reduce the I/O cost, and provide analytically derived partitioning methods to manage the data. Query algorithms are proposed to support point, interval and continuous queries. An auxiliary main-memory structure is developed to report continuous results. Using large real and synthetic datasets, extensive experiments are performed in a prototype database system to demonstrate the effectiveness, efficiency and scalability. The results show that our method significantly outperforms alternative methods in most settings.  相似文献   

Efficient monitoring of skyline queries over distributed data streams   总被引:1,自引:0,他引:1  
Data management and data mining over distributed data streams have received considerable attention within the database community recently. This paper is the first work to address skyline queries over distributed data streams, where streams derive from multiple horizontally split data sources. Skyline query returns a set of interesting objects which are not dominated by any other objects within the base dataset. Previous work is concentrated on skyline computations over static data or centralized data streams. We present an efficient and an effective algorithm called BOCS to handle this issue under a more challenging environment of distributed streams. BOCS consists of an efficient centralized algorithm GridSky and an associated communication protocol. Based on the strategy of progressive refinement in BOCS, the skyline is incrementally computed by two phases. In the first phase, local skylines on remote sites are maintained by GridSky. At each time, only skyline increments on remote sites are sent to the coordinator. In the second phase, a global skyline is obtained by integrating remote increments with the latest global skyline. A theoretical analysis shows that BOCS is communication-optimal among all algorithms which use a share-nothing strategy. Extensive experiments demonstrate that our proposals are efficient, scalable, and stable.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号