首页 | 官方网站   微博 | 高级检索  
     

HEDC++: An Extended Histogram Estimator for Data in the Cloud
引用本文:史英杰,孟小峰,Pusheng Wang,干艳桃.HEDC++: An Extended Histogram Estimator for Data in the Cloud[J].计算机科学技术学报,2013,28(6):973-988.
作者姓名:史英杰  孟小峰  Pusheng Wang  干艳桃
作者单位:[1]School of Information, Renmin University of China, Beijing 100872, China [2]Department of Biomedical Informatics, Emory University, Atlanta 30322, U.S.A. [3]Department of Mathematics and Computer Science, Emory University, Atlanta 30322, U.S.A.
基金项目:This research was partially supported by the National Natural Science Foundation of China under Grant Nos. 61070055, 91024032, 91124001, the Fundamental Research Funds for the Central Universities of China, the Research Funds of Renmin University of China under Grant No. 11XNL010, and the National High Technology Research and Development 863 Program of China under Grant Nos. 2012AA010701, 2013AA013204.
摘    要:With increasing popularity of cloud-based data management, improving the performance of queries in the cloud is an urgent issue to solve. Summary of data distribution and statistical information has been commonly used in traditional databases to support query optimization, and histograms are of particular interest. Naturally, histograms could be used to support query optimization and efficient utilization of computing resources in the cloud. Histograms could provide helpful reference information for generating optimal query plans, and generate basic statistics useful for guaranteeing the load balance of query processing in the cloud. Since it is too expensive to construct an exact histogram on massive data, building an approximate histogram is a more feasible solution. This problem, however, is challenging to solve in the cloud environment because of the special data organization and processing mode in the cloud. In this paper, we present HEDC++, an extended histogram estimator for data in the cloud, which provides efficient approximation approaches for both equi-width and equi-depth histograms. We design the histogram estimate workflow based on an extended MapReduce framework, and propose novel sampling mechanisms to leverage the sampling efficiency and estimate accuracy. We experimentally validate our techniques on Hadoop and the results demonstrate that HEDC++ can provide promising histogram estimate for massive data in the cloud.

关 键 词:估计精度  HEDC  直方图  查询优化  统计信息  海量数据  数据管理  数据分布

HEDC++: An Extended Histogram Estimator for Data in the Cloud
Ying-Jie Shi,Xiao-Feng Meng,Fusheng Wang,Yan-Tao Gan.HEDC++: An Extended Histogram Estimator for Data in the Cloud[J].Journal of Computer Science and Technology,2013,28(6):973-988.
Authors:Ying-Jie Shi  Xiao-Feng Meng  Fusheng Wang  Yan-Tao Gan
Affiliation:1. School of Information, Renmin University of China, Beijing, 100872, China
2. Department of Biomedical Informatics, Emory University, Atlanta, 30322, USA
3. Department of Mathematics and Computer Science, Emory University, Atlanta, 30322, USA
Abstract:With increasing popularity of cloud-based data management, improving the performance of queries in the cloud is an urgent issue to solve. Summary of data distribution and statistical information has been commonly used in traditional databases to support query optimization, and histograms are of particular interest. Naturally, histograms could be used to support query optimization and efficient utilization of computing resources in the cloud. Histograms could provide helpful reference information for generating optimal query plans, and generate basic statistics useful for guaranteeing the load balance of query processing in the cloud. Since it is too expensive to construct an exact histogram on massive data, building an approximate histogram is a more feasible solution. This problem, however, is challenging to solve in the cloud environment because of the special data organization and processing mode in the cloud. In this paper, we present HEDC++, an extended histogram estimator for data in the cloud, which provides efficient approximation approaches for both equi-width and equi-depth histograms. We design the histogram estimate workflow based on an extended MapReduce framework, and propose novel sampling mechanisms to leverage the sampling efficiency and estimate accuracy. We experimentally validate our techniques on Hadoop and the results demonstrate that HEDC++ can provide promising histogram estimate for massive data in the cloud.
Keywords:histogram estimate  sampling  cloud computing  MapReduce
本文献已被 维普 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号