期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

DAWN: an efficient framework of DCT for data with error estimation

Ming-Jyh Hsieh Wei-Guang Teng Ming-Syan Chen Philip S. Yu 《The VLDB Journal The International Journal on Very Large Data Bases》2008,17(4):683-702

On-line analytical processing (OLAP) has become an important component in most data warehouse systems and decision support systems in recent years. In order to deal with the huge amount of data, highly complex queries and increasingly strict response time requirements, approximate query processing has been deemed a viable solution. Most works in this area, however, focus on the space efficiency and are unable to provide quality-guaranteed answers to queries. To remedy this, in this paper, we propose an efficient framework of DCT for dAta With error estimatioN, called DAWN, which focuses on answering range-sum queries from compressed OP-cubes transformed by DCT. Specifically, utilizing the techniques of Geometric series and Euler’s formula, we devise a robust summation function, called the GE function, to answer range queries in constant time, regardless of the number of data cells involved. Note that the GE function can estimate the summation of cosine functions precisely; thus the quality of the answers is superior to that of previous works. Furthermore, an estimator of errors based on the Brown noise assumption (BNA) is devised to provide tight bounds for answering range-sum queries. Our experiment results show that the DAWN framework is scalable to the selectivity of queries and the available storage space. With GE functions and the BNA method, the DAWN framework not only delivers high quality answers for range-sum queries, but also leads to shorter query response time due to its effectiveness in error estimation. 相似文献

2.

Approximate range–sum query answering on data cubes with probabilistic guarantees

Alfredo Cuzzocrea Wei Wang 《Journal of Intelligent Information Systems》2007,28(2):161-197

Approximate range aggregate queries are one of the most frequent and useful kinds of queries for Decision Support Systems (DSS), as they are widely used in many data analysis tasks. Traditionally, sampling-based techniques have been proposed to tackle this problem. However, their effectiveness degrade when the underlying data distribution is skewed. Another approach based on the outlier management can limit the effect of data skews but fails to address other requirements of approximate range aggregate queries, such as error guarantees and query processing efficiency. In this paper, we present a technique that provides approximate answers to range aggregate queries on OLAP data cubes efficiently, with theoretical guarantees on the errors. Our basic idea is to build different data structures to manage outliers and the rest of the data. Carefully chosen outliers are organized in a quad-tree based indexing data structure to provide efficient access for query processing. A query-workload adaptive, tree-like synopsis data structure, called T unable P artition-Tree (TP-Tree), is proposed to organize samples extracted from non-outlier data. Our experiments clearly demonstrate the merits of our technique, by comparing with previous well-known techniques. 相似文献

3.

Online aggregation with tight error bounds in dynamic environments

《Information and Software Technology》2006,48(9):869-875

OLAP is a category of database technology that allows analysts to gain insight into the aggregation of data by enabling them to gain access to a variety of different views of the information contained in a database. It is very important to provide analysts with guaranteed error bounds for approximate results to aggregation queries in enterprise applications such as decision support systems. We propose a general method of providing tight error bounds for approximate results to OLAP range-sum queries. We perform an extensive experiment on diverse data sets and examine the effectiveness of the proposed method for various data cube dimensions and query sizes. 相似文献

4.

一种OLAP应用系统的设计和实现

周龙郑诚《微机发展》2006,16(6):101-103

通过对数据仓库和OLAP概念及体系结构的分析,描述了一种OLAP应用系统的设计方案,并介绍了它的具体实现方法。基于数据仓库的查询,一般都是及时特定查询,要在严格的响应时间内执行复杂的查询,遍历百万上亿的记录,同时进行可能很复杂的搜索、连接和汇总的操作。查询的数据吞吐量和响应时间是判断数据仓库性能的重点。CUBE的计算是OLAP及时查询的基础,提高查询的速度需要对OLAP进行预先的计算。文中系统比较了一些计算立方体的算法,并运用到具体的系统当中。相似文献

5.

Answering form-based web queries using the data-mining approach

Xiaochun Yang Yiu-Kai Ng 《Journal of Intelligent Information Systems》2008,30(1):1-32

Web users often post queries through form-based interfaces on the Web to retrieve data from the Web; however, answers to these queries are mostly computed according to keywords entered into different fields specified in a query interface, and their precision and recall could be low. The precision and recall ratios in answering this type of query can be improved by considering closely related previous queries submitted through the same interface, along with their answers. In this paper, we present an approach for enhancing the retrieval of relevant answers to a form-based Web query by adopting the data-mining approach using previous, relevant queries and their answers. Experimental results on a randomly selected set of 3,800 documents retrieved from various Web sites show that our data-mining, query-rewriting approach achieves average precision and true positive ratios on rewritten queries in the upper 80% range, whereas the average false positive ratio is less than 2.0%. Work partially done during a visit to BYU and partially supported by National Natural Science Foundation of China No. 60503036 and Fok YingTong Education Foundation No. 104027. 相似文献

6.

Revisiting the cube lifecycle in the presence of hierarchies

Konstantinos Morfonios Yannis Ioannidis 《The VLDB Journal The International Journal on Very Large Data Bases》2010,19(2):257-282

On-line analytical processing (OLAP) typically involves complex aggregate queries over large datasets. The data cube has been proposed as a structure that materializes the results of such queries in order to accelerate OLAP. A significant fraction of the related work has been on Relational-OLAP (ROLAP) techniques, which are based on relational technology. Existing ROLAP cubing solutions mainly focus on “flat” datasets, which do not include hierarchies in their dimensions. Nevertheless, as shown in this paper, the nature of hierarchies introduces several complications into the entire lifecycle of a data cube including the operations of construction, storage, indexing, query processing, and incremental maintenance. This fact renders existing techniques essentially inapplicable in a significant number of real-world applications and mandates revisiting the entire cube lifecycle under the new perspective. In order to overcome this problem, the CURE algorithm has been recently proposed as an efficient mechanism to construct complete cubes over large datasets with arbitrary hierarchies and store them in a highly compressed format, compatible with the relational model. In this paper, we study the remaining phases in the cube lifecycle and introduce query-processing and incremental-maintenance algorithms for CURE cubes. These are significantly different from earlier approaches, which have been proposed for flat cubes constructed by other techniques and are inadequate for CURE due to its high compression rate and the presence of hierarchies. Our methods address issues such as cube indexing, query optimization, and lazy update policies. Especially regarding updates, such lazy approaches are applied for the first time on cubes. We demonstrate the effectiveness of CURE in all phases of the cube lifecycle through experiments on both real-world and synthetic datasets. Among the experimental results, we distinguish those that have made CURE the first ROLAP technique to complete the construction and usage of the cube of the highest-density dataset in the APB-1 benchmark (12 GB). CURE was in fact quite efficient on this, showing great promise with respect to the potential of the technique overall. 相似文献

7.

Enabling OLAP in mobile environments via intelligent data cube compression techniques

Alfredo Cuzzocrea Filippo Furfaro Domenico Saccà 《Journal of Intelligent Information Systems》2009,33(2):95-143

The main drawbacks of handheld devices (small storage space, small size of the display screen, discontinuance of the connection to the WLAN etc) are often incompatible with the need of querying and browsing information extracted from enormous amounts of data which are accessible through the network. In this application scenario, data compression and summarization have a leading role: data in a lossy compressed format can be transmitted more efficiently than the original ones, and can be effectively stored in handheld devices (setting the compression ratio accordingly). In this paper, we introduce a very effective compression technique for multidimensional data cubes, and the system Hand-OLAP, which exploits this technique to allow handheld devices to extract and browse compressed two-dimensional OLAP views coming from multidimensional data cubes stored on a remote OLAP server localized on the wired network. Hand-OLAP effectively and efficiently enables OLAP in mobile environments, and also enlarges the potentialities of Decision Support Systems by taking advantage from the “naturally” decentralized nature of such environments. The idea which the system is based on is: rather than querying the original multidimensional data cubes, it may be more convenient to generate a compressed OLAP view of them, store such view into the handheld device, and query it locally (off-line), thus obtaining approximate answers that are suitable for OLAP applications. 相似文献

8.

Continually Answering Constraint <Emphasis Type="Italic">k</Emphasis>-<Emphasis Type="Italic">NN</Emphasis> Queries in Unstructured P2P Systems

下载免费PDF全文

Bin Wang Xiao-Chun Yang Guo-Ren Wang Ge Yu Lei Chen X. Sean Wang and Xue-Min Lin 《计算机科学技术学报》2008,23(4):538-556

We consider the problem of efficiently computing distributed geographical k-NN queries in an unstructured peer-to-peer （P2P） system, in which each peer is managed by an individual organization and can only communicate with its logical neighboring peers. Such queries are based on local filter query statistics, and require as less communication cost as possible which makes it more difficult than the existing distributed k-NN queries. Especially, we hope to reduce candidate peers and degrade communication cost. In this paper, we propose an efficient pruning technique to minimize the number of candidate peers to be processed to answer the k-NN queries. Our approach is especially suitable for continuous k-NN queries when updating peers, including changing ranges of peers, dynamically leaving or joining peers, and updating data in a peer. In addition, simulation results show that the proposed approach outperforms the existing Minimum Bounding Rectangle （MBR）-based query approaches, especially for continuous queries. 相似文献

9.

Coding-based Join Algorithms for Structural Queries on Graph-Structured XML Document

Hongzhi Wang Jianzhong Li Wei Wang Xuemin Lin 《World Wide Web》2008,11(4):485-510

In many applications, XML documents need to be modelled as graphs. The query processing of graph-structured XML documents brings new challenges. In this paper, we design a method based on labelling scheme for structural queries processing on graph-structured XML documents. We give each node some labels, the reachability labelling scheme. By extending an interval-based reachability labelling scheme for DAG by Rakesh et al., we design labelling schemes to support the judgements of reachability relationships for general graphs. Based on the labelling schemes, we design graph structural join algorithms to answer the structural queries with only ancestor-descendant relationship efficiently. For the processing of subgraph query, we design a subgraph join algorithm. With efficient data structure, the subgraph join algorithm can process subgraph queries with various structures efficiently. Experimental results show that our algorithms have good performance and scalability. Support by the Key Program of the National Natural Science Foundation of China under Grant No.60533110; the National Grand Fundamental Research 973 Program of China under Grant No. 2006CB303000; the National Natural Science Foundation of China under Grant No. 60773068 and No. 60773063. 相似文献

10.

Approximate query processing using wavelets 总被引：7，自引：0，他引：7

Kaushik Chakrabarti Minos Garofalakis Rajeev Rastogi Kyuseok Shim 《The VLDB Journal The International Journal on Very Large Data Bases》2001,10(2-3):199-223

Approximate query processing has emerged as a cost-effective approach for dealing with the huge data volumes and stringent response-time requirements of today's decision support systems (DSS). Most work in this area, however, has so far been limited in its query processing scope, typically focusing on specific forms of aggregate queries. Furthermore, conventional approaches based on sampling or histograms appear to be inherently limited when it comes to approximating the results of complex queries over high-dimensional DSS data sets. In this paper, we propose the use of multi-dimensional wavelets as an effective tool for general-purpose approximate query processing in modern, high-dimensional applications. Our approach is based on building wavelet-coefficient synopses of the data and using these synopses to provide approximate answers to queries. We develop novel query processing algorithms that operate directly on the wavelet-coefficient synopses of relational tables, allowing us to process arbitrarily complex queries entirely in the wavelet-coefficient domain. This guarantees extremely fast response times since our approximate query execution engine can do the bulk of its processing over compact sets of wavelet coefficients, essentially postponing the expansion into relational tuples until the end-result of the query. We also propose a novel wavelet decomposition algorithm that can build these synopses in an I/O-efficient manner. Finally, we conduct an extensive experimental study with synthetic as well as real-life data sets to determine the effectiveness of our wavelet-based approach compared to sampling and histograms. Our results demonstrate that our techniques: (1) provide approximate answers of better quality than either sampling or histograms; (2) offer query execution-time speedups of more than two orders of magnitude; and (3) guarantee extremely fast synopsis construction times that scale linearly with the size of the data. Received: 7 August 2000 / Accepted: 1 April 2001 Published online: 7 June 2001 相似文献

11.

EXODuS: Exploratory OLAP over Document Stores

《Information Systems》2019

OLAP has been extensively used for a couple of decades as a data analysis approach to support decision making on enterprise structured data. Now, with the wide diffusion of NoSQL databases holding semi-structured data, there is a growing need for enabling OLAP on document stores as well, to allow non-expert users to get new insights and make better decisions. Unfortunately, due to their schemaless nature, document stores are hardly accessible via direct OLAP querying. In this paper we propose EXODuS, an interactive, schema-on-read approach to enable OLAP querying of document stores in the context of self-service BI and exploratory OLAP. To discover multidimensional hierarchies in document stores we adopt a data-driven approach based on the mining of approximate functional dependencies; to ensure good performances, we incrementally build local portions of hierarchies for the levels involved in the current user query. Users execute an analysis session by expressing well-formed multidimensional queries related by OLAP operations; these queries are then translated into the native query language of MongoDB, one of the most popular document-based DBMS. An experimental evaluation on real-world datasets shows the efficiency of our approach and its compatibility with a real-time setting. 相似文献

12.

Optimization in Data Cube System Design

Edward Hung David W. Cheung Ben Kao 《Journal of Intelligent Information Systems》2004,23(1):17-45

The design of an OLAP system for supporting real-time queries is one of the major research issues. One approach is to use data cubes, which are materialized precomputed multidimensional views of data in a data warehouse. We can derive a set of data cubes to answer each frequently asked query directly. However, there are two practical problems: (1) the maintenance cost of the data cubes, and (2) the query cost to answer those queries. Maintaining a data cube requires disk storage and CPU computation, so the maintenance cost is related to the total size as well as the total number of data cubes materialized. In most cases, materializing all data cubes is impractical. The maintenance cost may be reduced by merging some data cubes. However, the resulting larger data cubes will increase the query cost of answering some queries. If the bounds on the maintenance cost and the query cost are too strict, we help the user decide which queries to be sacrificed and not taken into consideration. We have defined an optimization problem in data cube system design. Given a maintenance-cost bound, a query-cost bound and a set of frequently asked queries, it is necessary to determine a set of data cubes such that the system can answer a largest subset of the queries without violating the two bounds. This is an NP-hard problem. We propose approximate Greedy algorithms GR, 2GM and 2GMM, which are shown to be both effective and efficient by experiments done on a census data set and a forest-cover-type data set. 相似文献

13.

一种高度浓缩和语义保持的数据立方

向隆刚龚健雅《计算机研究与发展》2007,44(5):837-844

Quotient Cube和QC-tree试图在浓缩一个数据立方尺寸的同时,保持该数据立方蕴涵的语义,但是,前者没有语义关系的存储,后者存储的语义关系是晦涩模糊的.为此提出了下钻立方结构,首次从语义角度考虑数据立方存储,存储的不是类的内容,而是类之间的直接下钻关系.下钻立方不仅能够极大地减小数据立方的存储尺寸,而且可以清晰地表达原数据立方蕴涵的下钻语义.此外,下钻立方具有较高的查询响应性能,这一点在范围查询中表现得尤其显著.实验和分析表明,下钻立方在存储尺寸和查询响应方面明显优于QC-tree,适于用来组织和存储数据立方. 相似文献

14.

用于聚集值近似查询的基于密度的树索引结构

许俭吴天轶王晨汪卫施伯乐《计算机科学》2005,32(11):99-103

如何快速有效地对数据立方体上的聚集查询给出近似的回答,是数据挖掘和数据仓库研究领域中的核心问题之一。现有大多数聚集查询算法在同一个数据立方体上只能支持某种特定的而非多种类型的聚集查询。本文给出了一种新的框架AdenTS,即基于密度的自适应树结构,它可以回答同一数据立方体上的各类聚集查询,也提出了一些近似和启发式技术,改善了查询结果和精度。实验结果表明,这种方法在支持的查询种类和性能上是更好的。相似文献

15.

Approximate Query Processing in Cube Streams

Ming-Jyh Hsieh Ming-Syan Chen Yu P.S. 《Knowledge and Data Engineering, IEEE Transactions on》2007,19(11):1557-1570

Data cubes have become important components in most data warehouse systems and decision support systems. In such systems, users usually pose very complex queries to the online analytical processing (OLAP) system, and systems usually have to deal with a huge amount of data because of the large dimensionality of the sets; thus, approximating query processing has emerged as a viable solution. Specifically, the applications of cube streams handle multidimensional data sets in a continuous manner in contrast to the traditional cube approximation. Such an application collects data events for cube streams online, generates snapshots with limited resources, and keeps the approximated information in a synopsis memory for further analysis. Compared to the OLAP applications, applications of cube streams are subject to many more resource constraints on both the processing time and the memory and cannot be dealt with by existing methods due to the limited resources. In this paper, we propose the DAWA algorithm, which is a hybrid algorithm of discrete cosine transform (DCT) for data and the discrete wavelet transform (DWT), to approximate cube streams. Our algorithm combines the advantages of the high compression rate of DWT and the low memory cost of DCT. Consequently, DAWA requires much smaller working buffer and outperforms both DWT-based and DCT-based methods in execution efficiency. Also, it is shown that DAWA provides a good solution for an approximate query processing of cube streams with a small working buffer and a short execution time. The optimality of the DAWA algorithm is theoretically proved and empirically demonstrated by our experiments. 相似文献

16.

一种P2P网络环境下的OLAP查询方案 总被引：1，自引：1，他引：0

下载免费PDF全文

周攀杨科华周利民《计算机工程与应用》2011,47(8):108-111

传统网络环境和P2P环境中,客户端向OLAP服务器提交OLAP查询,并从服务器获取查询结果,OLAP服务器的负载将随着客户端的增加而急剧增加。设计了一种基于P2P（Peer-to-Peer,点对点技术）技术的DQDC（Distributed Query Data Cube,多维数据集的分布式查询）算法,实现P2P网络中语义级的多节点Data Cube数据共享,从而提高系统整体的决策分析性能。相似文献

17.

Progressive evaluation of nested aggregate queries

Kian-Lee Tan Cheng Hian Goh Beng Chin Ooi 《The VLDB Journal The International Journal on Very Large Data Bases》2000,9(3):261-278

In many decision-making scenarios, decision makers require rapid feedback to their queries, which typically involve aggregates. The traditional blocking execution model can no longer meet the demands of these users. One promising approach in the literature, called online aggregation, evaluates an aggregation query progressively as follows: as soon as certain data have been evaluated, approximate answers are produced with their respective running confidence intervals; as more data are examined, the answers and their corresponding running confidence intervals are refined. In this paper, we extend this approach to handle nested queries with aggregates (i.e., at least one inner query block is an aggregate query) by providing users with (approximate) answers progressively as the inner aggregation query blocks are evaluated. We address the new issues pose by nested queries. In particular, the answer space begins with a superset of the final answers and is refined as the aggregates from the inner query blocks are refined. For the intermediary answers to be meaningful, they have to be interpreted with the aggregates from the inner queries. We also propose a multi-threaded model in evaluating such queries: each query block is assigned to a thread, and the threads can be evaluated concurrently and independently. The time slice across the threads is nondeterministic in the sense that the user controls the relative rate at which these subqueries are being evaluated. For enumerative nested queries, we propose a priority-based evaluation strategy to present answers that are certainly in the final answer space first, before presenting those whose validity may be affected as the inner query aggregates are refined. We implemented a prototype system using Java and evaluated our system. Results for nested queries with a level and multiple levels of nesting are reported. Our results show the effectiveness of the proposed mechanisms in providing progressive feedback that reduces the initial waiting time of users significantly without sacrificing the quality of the answers. Received April 25, 2000 / Accepted June 27, 2000 相似文献

18.

超大型压缩数据仓库的查询研究

张应龙盛立琨《计算机与现代化》2009,(6)

查询速度是联机分析处理中的一个关键性能指标,人们通过事先生成所有可能的聚集来提高查询速度,然而这样的完全物化是以存储空间为代价的.针对数据立方体数据分布特点和结合压缩技术,本文介绍如何最大化节省存储空间来进行完全物化,然后在此基础上对查询进行了研究,以达到最小存储空间以及较好的查询速度的目的. 相似文献

19.

Clustered Chain Path Index for XML Document: Efficiently Processing Branch Queries

Hongqiang Wang Jianzhong Li Hongzhi Wang 《World Wide Web》2008,11(1):153-168

Branch query processing is a core operation of XML query processing. In recent years, a number of stack based twig join algorithms have been proposed to process twig queries based on tag stream index. However, in tag stream index, each element is labeled separately without considering the similarity among elements. Besides, algorithms based on tag stream index perform inefficiently on large document. This paper proposes a novel index, named Clustered Chain Path Index, based on a novel labeling scheme. This index provides efficient support for processing branch queries. It also has the same cardinality as 1-index against tree structured XML document. Based on CCPI, efficient algorithms, KMP-Match-Path and Related-Path-Segment-Join, are proposed to process queries efficiently. Analysis and experimental results show that proposed query processing algorithms based on CCPI outperform other algorithms and have good scalability. This paper is partially supported by Natural Science Foundation of Heilongjiang Province, Grant No. zjg03-05 and National Natural Science Foundation of China, Grant No. 60473075 and Key Program of the National Natural Science Foundation of China, Grant No. 60533110. 相似文献

20.

Efficient Optimization of Multiple Subspace Skyline Queries

下载免费PDF全文

黄震华郭建奎孙圣力汪卫《计算机科学技术学报》2008,23(1):103-111

We present the first efficient sound and complete algorithm （i.e., AOMSSQ） for optimizing multiple subspace skyline queries simultaneously in this paper. We first identify three performance problems of the na/ve approach （i.e., SUBSKY） which can be used in processing arbitrary single-subspace skyline query. Then we propose a cell-dominance computation algorithm （i.e., CDCA） to efficiently overcome the drawbacks of SUBSKY. Specially, a novel pruning technique is used in CDCA to dramatically decrease the query time. Finally, based on the CDCA algorithm and the share mechanism between subspaces, we present and discuss the AOMSSQ algorithm and prove it sound and complete. We also present detailed theoretical analyses and extensive experiments that demonstrate our algorithms are both efficient and effective. 相似文献