首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Filtering is a generic technique for skyline retrieval in sensor networks, for the purpose of reducing the communication cost, the dominant part of energy consumption. The vast majority of existing filtering approaches are suitable for uniform and correlated datasets, whereas in many applications the data distribution is clustered or anti-correlated. The only work considering anti-correlated dataset requires significant energy for filtering construction, and it is hard to be efficiently adapted to clustered databases. In this paper, we propose a new filtering algorithm, which settles the problem by utilizing individual node characteristics and generating personalized filters. Given a fraction k, a personalized filter prunes at least k percent of points on assigned nodes. A novel scheme for data cluster representation and a sampling method are then proposed to reduce the filtering cost and maximize the benefit of filtering. Extensive simulation results show the superiority of our approach over existing techniques.  相似文献   

2.
Skyline computation has many applications including multi-criteria decision making. In this paper, we study the problem of efficiently computing the skyline over sliding windows on uncertain data elements against probability thresholds. Firstly, we characterize the properties of elements to be kept in our computation. Then, we show the size of dynamically maintained candidate set and the size of skyline. Novel, efficient techniques are developed to process continuous probabilistic skyline queries over sliding windows. Finally, we extend our techniques to cover the applications where multiple probability thresholds are given, “top-k” skyline data objects are retrieved, or elements have individual life-spans. Our extensive experiments demonstrate that the proposed techniques are very efficient and can handle a high-speed data stream in real time.  相似文献   

3.
Given a D-dimensional data set P and a query point q, a reverse skyline query (RSQ) returns all the data objects in P whose dynamic skyline contains q. It is important for many real life applications such as business planning and environmental monitoring. Currently, the state-of-the-art algorithm for answering the RSQ is the reverse skyline using skyline approximations (RSSA) algorithm, which is based on the precomputed approximations of the skylines. Although RSSA has some desirable features, e.g., applicability to arbitrary data distributions and dimensions, it needs for multiple accesses of the same nodes, incurring redundant I/O and CPU costs. In this paper, we propose several efficient algorithms for exact RSQ processing over multidimensional datasets. Our methods utilize a conventional data-partitioning index (e.g., R-tree) on the dataset P, and employ precomputation, reuse, and pruning techniques to boost the query performance. In addition, we extend our techniques to tackle a natural variant of the RSQ, i.e., constrained reverse skyline query (CRSQ), which retrieves the reverse skyline inside a specified constrained region. Extensive experimental evaluation using both real and synthetic datasets demonstrates that our proposed algorithms outperform RSSA by several orders of magnitude under all experimental settings.  相似文献   

4.
Scaling skyline queries over high-dimensional datasets remains to be challenging due to the fact that most existing algorithms assume dimensional independence when establishing the worst-case complexity by discarding correlation distribution. In this paper, we present HashSkyline, a systematic and correlation-aware approach for scaling skyline queries over high-dimensional datasets with three novel features: First, it offers a fast hash-based method to prune non-skyline points by utilizing data correlation characteristics and speed up the overall skyline evaluation for correlated datasets. Second, we develop \(HashSkyline_{GPU}\), which can dramatically reduce the response time for anti-correlated and independent datasets by capitalizing on the parallel processing power of GPUs. Third, the HashSkyline approach uses the pivot cell-based mechanism combined with the correlation threshold to determine the correlation distribution characteristics for a given dataset, enabling adaptive configuration of HashSkyline for skyline query evaluation by auto-switching of \(HashSkyline_{CPU}\) and \(HashSkyline_{GPU}\). We evaluate the validity of HashSkyline using both synthetic datasets and real datasets. Our experiments show that HashSkyline consumes significantly less pre-processing cost and achieves significantly higher overall query performance, compared to existing state-of-the-art algorithms.  相似文献   

5.
A skyline query returns a set of candidate records that satisfy several preferences. It is an operation commonly performed to aid decision making. Since executing a skyline query is expensive and a query plan may combine skyline queries with other data operations such as join, it is important that the query optimizer can quickly yield an accurate cardinality estimate for a skyline query. Log Sampling (LS) and Kernel-Based (?KB) skyline cardinality estimation are the two state-of-the-art skyline cardinality estimation methods. LS is based on a hypothetical model A(log(n)) B . Since this model is originally derived under strong assumptions like data independence between dimensions, it does not apply well to an arbitrary data set. Consequently, LS can yield large estimation errors. KB relies on the integration of the estimated probability density function (PDF) to derive the scale factor ?? ds . As the estimation of PDF and the ensuing integration both involve complex mathematical calculations, KB is time consuming. In view of these problems, we propose an innovative purely sampling-based (PS) method for skyline cardinality estimation. PS is non-parametric. It does not assume any particular data distribution and is, thus, more robust than LS. PS does not require complex mathematical calculations. Therefore, it is much simpler to implement and much faster to yield the estimates than KB. Extensive empirical studies show that for a variety of real and synthetic data sets, PS outperforms LS in terms of estimation speed, estimation accuracy, and estimation variability under the same space budget. PS outperforms KB in terms of estimation speed and estimation variability under the same performance mark.  相似文献   

6.
We devise a skyline algorithm that can efficiently mitigate the enormous overhead of processing millions of tuples on totally- and partially-ordered domains (henceforth, TODs and PODs). With massive datasets, existing techniques spend a significant amount of time on a dominance comparison because of both a large number of skyline points and the unprogressive method of skyline computing with PODs. (If data has high dimensionality, the situation is undoubtedly aggravated.) The progressiveness property turns out to be the key feature for solving all remaining problems. This article presents a FAST-SKY algorithm that deals successfully with these two obstacles and improves skyline query processing time strikingly, even with high-dimensional data. Progressive skyline evaluation with PODs is guaranteed by new index structures and topological sorting order. A stratification technique is adopted to index data on PODs, and we propose two new index structures: stratified R-trees (SR-trees) for low-dimensional data and stratified MinMax treaps (SM-treaps) for high-dimensional data. A fast dominance comparison is achieved by using a reporting query instead of a dominance query, and a dimensionality reduction technique. Experimental results suggest that in general cases (anti-correlated and uniform distributions) FAST-SKY is orders of magnitude faster than existing algorithms.  相似文献   

7.
Skyline query processing over uncertain data streams has attracted considerable attention in database community recently, due to its importance in helping users make intelligent decisions over complex data in many real applications. Although lots of recent efforts have been conducted to the skyline computation over data streams in a centralized environment typically with one processor, they cannot be well adapted to the skyline queries over complex uncertain streaming data, due to the computational complexity of the query and the limited processing capability. Furthermore, none of the existing studies on parallel skyline computation can effectively address the skyline query problem over uncertain data streams, as they are all developed to address the problem of parallel skyline queries over static certain data sets. In this paper, we formally define the parallel query problem over uncertain data streams with the sliding window streaming model. Particularly, for the first time, we propose an effective framework, named distributed parallel framework to address the problem based on the sliding window partitioning. Furthermore, we propose an efficient approach (parallel streaming skyline) to further optimize the parallel skyline computation with an optimized streaming item mapping strategy and the grid index. Extensive experiments with real deployment over synthetic and real data are conducted to demonstrate the effectiveness and efficiency of the proposed techniques.  相似文献   

8.
Elastic caching platforms (ECPs) play an important role in accelerating the performance of Web applications. Several cache strategies have been proposed for ECPs to manage data access and distributions while maintaining the service availability. In our earlier research, we have demonstrated that there is no “one-fits-all” strategy for heterogeneous scenarios and the selection of the optimal strategy is related with workload patterns, cluster size and the number of concurrent users. In this paper, we present a new reconfiguration framework named PRESC $^{2}$ . It applies machine learning approaches to determine an optimal cache strategy and supports online optimization of performance model through trace-driven simulation or semi-supervised classification. Besides, the authors also propose a robust cache entries synchronization algorithm and a new optimization mechanism to further lower the adaptation costs. In our experiments, we find that PRESC $^{2}$ improves the elasticity of ECPs and brings big performance gains when compared with static configurations.  相似文献   

9.
Robust Classification for Imprecise Environments   总被引:9,自引:0,他引:9  
In real-world environments it usually is difficult to specify target operating conditions precisely, for example, target misclassification costs. This uncertainty makes building robust classification systems problematic. We show that it is possible to build a hybrid classifier that will perform at least as well as the best available classifier for any target conditions. In some cases, the performance of the hybrid actually can surpass that of the best known classifier. This robust performance extends across a wide variety of comparison frameworks, including the optimization of metrics such as accuracy, expected cost, lift, precision, recall, and workforce utilization. The hybrid also is efficient to build, to store, and to update. The hybrid is based on a method for the comparison of classifier performance that is robust to imprecise class distributions and misclassification costs. The ROC convex hull (ROCCH) method combines techniques from ROC analysis, decision analysis and computational geometry, and adapts them to the particulars of analyzing learned classifiers. The method is efficient and incremental, minimizes the management of classifier performance data, and allows for clear visual comparisons and sensitivity analyses. Finally, we point to empirical evidence that a robust hybrid classifier indeed is needed for many real-world problems.  相似文献   

10.
The top-k similarity joins have been extensively studied and used in a wide spectrum of applications such as information retrieval, decision making, spatial data analysis and data mining. Given two sets of objects $\mathcal U$ and $\mathcal V$ , a top-k similarity join returns k pairs of most similar objects from $\mathcal U \times \mathcal V$ . In the conventional model of top-k similarity join processing, an object is usually regarded as a point in a multi-dimensional space and the similarity is measured by some simple distance metrics like Euclidean distance. However, in many applications an object may be described by multiple values (instances) and the conventional model is not applicable since it does not address the distributions of object instances. In this paper, we study top-k similarity join over multi-valued objects. We apply two types of quantile based distance measures, ?-quantile distance and ?-quantile group-base distance, to explore the relative instance distribution among the multiple instances of objects. Efficient and effective techniques to process top-k similarity joins over multi-valued objects are developed following a filtering-refinement framework. Novel distance, statistic and weight based pruning techniques are proposed. Comprehensive experiments on both real and synthetic datasets demonstrate the efficiency and effectiveness of our techniques.  相似文献   

11.
As powerful tools, machine learning and data mining techniques have been widely applied in various areas. However, in many real-world applications, besides establishing accurate black box predictors, we are also interested in white box mechanisms, such as discovering predictive patterns in data that enhance our understanding of underlying physical, biological and other natural processes. For these purposes, sparse representation and its variations have been one of the focuses. More recently, structural sparsity has attracted increasing attentions. In previous research, structural sparsity was often achieved by imposing convex but non-smooth norms such as ${\ell _{2}/\ell _{1}}$ and group ${\ell _{2}/\ell _{1}}$ norms. In this paper, we present the explicit ${\ell _2/\ell _0}$ and group ${\ell _2/\ell _0}$ norm to directly approach the structural sparsity. To tackle the problem of intractable ${\ell _2/\ell _0}$ optimizations, we develop a general Lipschitz auxiliary function that leads to simple iterative algorithms. In each iteration, optimal solution is achieved for the induced subproblem and a guarantee of convergence is provided. Furthermore, the local convergent rate is also theoretically bounded. We test our optimization techniques in the multitask feature learning problem. Experimental results suggest that our approaches outperform other approaches in both synthetic and real-world data sets.  相似文献   

12.
Uncertain data are inevitable in many applications due to various factors such as the limitations of measuring equipment and delays in data updates. Although modeling and querying uncertain data have recently attracted considerable attention from the database community, there are still many critical issues to be resolved with respect to conducting advanced analysis on uncertain data. In this paper, we study the execution of the probabilistic skyline query over uncertain data streams. We propose a novel sliding window skyline model where an uncertain tuple may take the probability to be in the skyline at a certain timestamp t. Formally, a Wp-Skyline(p, t) contains all the tuples whose probabilities of becoming skylines are at least p at timestamp t. However, in the stream environment, computing a probabilistic skyline on a large number of uncertain tuples within the sliding window is a daunting task in practice. In order to efficiently calculate Wp-Skyline, we propose an efficient and effective approach, namely the candidate list approach, which maintains lists of candidates that might become skylines in future sliding windows. We also propose algorithms that continuously monitor the newly incoming and expired data to maintain the skyline candidate set incrementally. To further reduce the computation cost of deciding whether or not a candidate tuple belongs to the skyline, we propose an enhanced refinement strategy that is based on a multi-dimensional indexing structure combined with a grouping-and-conquer strategy. To validate the effectiveness of our proposed approach, we conduct extensive experiments on both real and synthetic data sets and make comparisons with basic techniques.  相似文献   

13.
In a number of emerging streaming applications, the data values that are produced have an associated time interval for which they are valid. A useful computation over such streaming data is to produce a continuous and valid skyline summary. Previous work on skyline algorithms have only focused on evaluating skylines over static data sets, and there are no known algorithms for skyline computation in the continuous setting. In this paper, we introduce the continuous time-interval skyline operator, which continuously computes the current skyline over a data stream. We present a new algorithm called LookOut for evaluating such queries efficiently, and empirically demonstrate the scalability of this algorithm. In addition, we also examine the effect of the underlying spatial index structure when evaluating skylines. Whereas previous work on skyline computations have only considered using the R-tree index structure, we show that for skyline computations using an underlying quadtree has significant performance benefits over an R-tree index.  相似文献   

14.
Many recent applications involve processing and analyzing uncertain data. In this paper, we combine the feature of top-k objects with that of skyline to model the problem of top-k skyline objects against uncertain data. The problem of efficiently computing top-k skyline objects on large uncertain datasets is challenging in both discrete and continuous cases. In this paper, firstly an efficient exact algorithm for computing the top-k   skyline objects is developed for discrete cases. To address applications where each object may have a massive set of instances or a continuous probability density function, we also develop an efficient randomized algorithm with an ?‐approximation?approximation guarantee. Moreover, our algorithms can be immediately extended to efficiently compute p-skyline; that is, retrieving the uncertain objects with skyline probabilities above a given threshold. Our extensive experiments on synthetic and real data demonstrate the efficiency of both algorithms and the randomized algorithm is highly accurate. They also show that our techniques significantly outperform the existing techniques for computing p-skyline.  相似文献   

15.
In many applications involving multiple criteria optimal decision making, users may often want to make a personal trade-off among all optimal solutions for selecting one object that best fits their personal needs. As a key feature, skyline in a multi-dimensional space provides a minimal set of candidates for such purposes by removing every object that is not preferred by any (monotonic) utility/scoring function; that is, the skyline removes all objects not preferred by any user no matter how their preferences vary. Due to its importance, the problem of skyline computation and its variants have been extensively studied in the database literature. In this paper, we provide a comprehensive survey of skyline computation techniques. Specifically, we first introduce the skyline computation algorithms on traditional (exact) data where each object corresponds to a point in a multi-dimensional space. Then, we discuss the skyline models and effcient algorithms to handle uncertain data which is inherent in many important applications. Finally, we briefly describe a few variants of the skyline (e.g., skycube, k-skyband and reverse skyline) in this paper.  相似文献   

16.
Skyline operation is typical multicriteria decision making well documented in data engineering. The assumption of skyline operation is settled human preference, which may be subject to huge challenges in practical decision-making applications because it simplifies preference scenarios that are usually dynamic. This study establishes the mathematical formulation of dynamic preference in real settings. A decision approach called tolerant skyline operation (T-skyline) is completely developed, including its conceptual modeling, computation methods, and a skyline maintenance mechanism on a database. The method is established and its computation mechanism is designed, and both are evaluated through an empirical study of personnel selection and evaluation. We also analyze computation efficiency and system stability. The decision targets are fully achieved, the computation results are satisfactory, and the computation efficiency is rational. The effectiveness and advantages of the approach are significant, as illustrated in different real-world settings. Experiments facilitated the examination of the design and development of T-skyline operation by adopting real and public datasets to evaluate players in the National Basketball Association in the United States. The experiment results validate the practical viability of our decision model, which can inspire discussions in sport industries. The methodology used in this study is valuable for further academic research, particularly for the interdisciplinary investigation of decision making and data engineering.  相似文献   

17.
Many database applications that need to disseminate dynamic information from a server to various clients can suffer from heavy communication costs. Data caching at a client can help mitigate these costs, particularly when individual {rm PUSH}{hbox{-}}{rm PULL} decisions are made for the different semantic regions in the data space. The server is responsible for notifying the client about updates in the {rm PUSH} regions. The client needs to contact the server for queries that ask for data in the {rm PULL} regions. We call the idea of partitioning the data space into {rm PUSH}{hbox{-}}{rm PULL} regions to minimize communication cost data gerrymandering. In this paper, we present solutions to technical challenges in adopting this simple but powerful idea. We give a provably optimal-cost dynamic programming algorithm for gerrymandering on a single query attribute. We propose a family of efficient heuristics for gerrymandering on multiple query attributes. We handle the dynamic case in which the workloads of queries and updates evolve over time. We validate our methods through extensive experiments on real and synthetic data sets.  相似文献   

18.
Skyline查询能够有效地实现多目标最优化,而数据仓库中的OLAP也是针对多维数据进行分析,因此,针对Skyline查询在数据仓库中的应用,提出了数据仓库中雪花模式的Skyline-Join查询算法.该算法首先将子维表M-Join父维表,然后渐进选择式地对事实表和父维表进行连接.每次连接之前都对事实表进行分组和组内Skyline计算,删除组内非Skyline元组,这样可以减少许多不必要的连接操作,使得查询效率大大提高.通过实验证明,在事实表元组数量逐渐变大和维表个数逐渐增多的情况下,提出的算法比先Join后Skyline计算的naive算法效率上有明显改善.  相似文献   

19.
Skyline points and queries are important in the context of processing datasets with multiple dimensions. As skyline points can be viewed as representing marketable products that are useful for clients and business owners, one may also consider non-skyline points that are highly competitive with the current skyline points. We address the problem of continuously finding such potential products from a dynamic d-dimensional dataset, and formally define a potential product and its upgrade promotion cost. In this paper, we propose the CP-Sky algorithm, an efficient approach for continuously evaluating potential products by utilizing a second-order skyline set, which consists of candidate points that are closest to regular skyline points (also termed the first-order skyline set), to facilitate efficient computations and updates for potential products. With the knowledge of the second-order skyline set, CP-Sky enables the system to (1) efficiently find substitute skyline points from the second-order skyline set only if a first-order skyline point is removed, and (2) continuously retrieve the top-k potential products. Within this context, the Approximate Exclusive Dominance Region algorithm (AEDR) is proposed to reduce the computational complexity of determining a candidate set for second-order skyline updates over a dynamic data set without affecting the result accuracy. Additionally, we extend the CP-Sky algorithm to support the computations of top-k potential products. Finally, we present experimental results on data sets with various distributions to demonstrate the performance and utility of our approach.  相似文献   

20.
Current skyline evaluation techniques are mainly to find the outstanding tuples from a large dataset. In this paper, we generalize the concept of skyline query and introduce a novel type of query, the combinatorial skyline query, which is to find the outstanding combinations from all combinations of the given tuples. The past skyline query is a special case of the combinatorial skyline query. This generalized concept is semantically more abundant when used in decision making, market analysis, business planning, and quantitative economics research. In this paper, we first introduce the concept of the combinatorial skyline query (CSQ) and explain the difficulty in resolving this type of query. Then, we propose two algorithms to solve the problem. The experiments manifest the effectiveness and efficiency of the proposed algorithms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号