首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Parallelizing the Data Cube   总被引:1,自引:0,他引:1  
This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioning strategies create a small number of coarse tasks. This allows for sharing of prefixes and sort orders between different group-by computations. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting.The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array.We have implemented our parallel top-down data cube construction method in C++ with the MPI message passing library for communication and the LEDA library for the required graph algorithms. We tested our code on an eight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew. Comparison tests were performed on a SunFire 6800. The tests show that our partitioning strategies generate a close to optimal load balance between processors. The actual run times observed show an optimal speedup of p.  相似文献   

2.
Graphics processing units (GPUs) have an SIMD architecture and have been widely used recently as powerful general-purpose co-processors for the CPU. In this paper, we investigate efficient GPU-based data cubing because the most frequent operation in data cube computation is aggregation, which is an expensive operation well suited for SIMD parallel processors. H-tree is a hyper-linked tree structure used in both top-k H-cubing and the stream cube. Fast H-tree construction, update and real-time query response are crucial in many OLAP applications. We design highly efficient GPU-based parallel algorithms for these H-tree based data cube operations. This has been made possible by taking effective methods, such as parallel primitives for segmented data and efficient memory access patterns, to achieve load balance on the GPU while hiding memory access latency. As a result, our GPU algorithms can often achieve more than an order of magnitude speedup when compared with their sequential counterparts on a single CPU. To the best of our knowledge, this is the first attempt to develop parallel data cubing algorithms on graphics processors.  相似文献   

3.
This paper presents a parallel volume rendering algorithm that can render a 256×256×225 voxel medical data set at over 15 Hz and a 512×512×334 voxel data set at over 7 Hz on a 32-processor Silicon Graphics Challenge. The algorithm achieves these results by minimizing each of the three components of execution time: computation time, synchronization time, and data communication time. Computation time is low because the parallel algorithm is based on the recently-reported shear-warp serial volume rendering algorithm which is over five times faster than previous serial algorithms. The algorithm uses run-length encoding to exploit coherence and an efficient volume traversal to reduce overhead. Synchronization time is minimized by using dynamic load balancing and a task partition that minimizes synchronization events. Data communication costs are low because the algorithm is implemented for shared-memory multiprocessors, a class of machines with hardware support for low-latency fine-grain communication and hardware caching to hide latency. We draw two conclusions from our implementation. First, we find that on shared-memory architectures data redistribution and communication costs do not dominate rendering time. Second, we find that cache locality requirements impose a limit on parallelism in volume rendering algorithms. Specifically, our results indicate that shared-memory machines with hundreds of processors would be useful only for rendering very large data sets  相似文献   

4.
The volume of XML data has become enormous and still grows very quickly as many data have been typed in XML by virtue of its simplicity and extensibility. While a tree labeling algorithm has a crucial role in XML query processing, conventional algorithms are all sequential so that they fail to label a large volume of XML data in a timely manner. To address this issue, we devise parallel tree labeling algorithms for massive XML data. Specifically, we focus on how to efficiently label a single large XML file in parallel. We first propose parallel versions of two prominent tree labeling schemes based on the MapReduce framework. We then present techniques for runtime workload balancing and data repartition to solve performance issues caused by data skewness and MapReduce’s inherited limitation. Through extensive experiments with synthetic and real-world datasets on 15 nodes, we show that our parallel labeling algorithms are up to 17 times faster than conventional algorithms, providing strong durability against data skewness.  相似文献   

5.
A k-tree core of a tree network is a subtree with exactly k leaves that minimizes the total distance from vertices to the subtree. A k-tree center of a tree network is a subtree with exactly k leaves that minimizes the distance from the farthest vertex to the subtree. In this paper, two efficient parallel algorithms are proposed for finding a k-tree core and a k-tree center of a tree network, respectively. Both the proposed algorithms perform on the EREW PRAM in O(log n log n) time using O(n) work (time-processor product). Besides being efficient on the EREW PRAM, in the sequential case, our algorithm for finding a k-tree core of a tree network improves the two algorithms previously proposed  相似文献   

6.
一种并行处理多维连接和聚集操作的有效方法   总被引:1,自引:0,他引:1  
随着并行计算算法的完善和廉价、功能强大的多处理机系统的成熟,使得采用多处理机系统来并行处理多维数据仓库的连接和聚集操作成为当前有效提高OLAP查询处理性能的首选技术.为此,提出一种降低连接和聚集操作开销的并行算法PJAMDDC(parallel join and aggregation for multi-dimensional data cube).算法充分考虑了多维数据立方体的存储机制和多处理机分布系统的结构特点,在原有聚集计算多维数据立方体的搜索点阵逻辑结构的基础上,采用多维数据仓库的层次联合代理(hierarchy combined surrogate)和对立方体的搜索点阵进行加权的方法,使得立方体数据在多个处理机间的分配达到最佳的状态,从而在分割多维数据的同时,提高了并行处理多维连接和聚集操作的效率.算法实验评估表明,PJAMDDC算法并行处理多维数据仓库的连接和聚集操作是有效的.  相似文献   

7.
The cgmCUBE project: Optimizing parallel data cube generation for ROLAP   总被引:5,自引:0,他引:5  
On-line Analytical Processing (OLAP) has become one of the most powerful and prominent technologies for knowledge discovery in VLDB (Very Large Database) environments. Central to the OLAP paradigm is the data cube, a multi-dimensional hierarchy of aggregate values that provides a rich analytical model for decision support. Various sequential algorithms for the efficient generation of the data cube have appeared in the literature. However, given the size of contemporary data warehousing repositories, multi-processor solutions are crucial for the massive computational demands of current and future OLAP systems. In this paper we discuss the cgmCUBE Project, a multi-year effort to design and implement a multi-processor platform for data cube generation that targets the relational database model (ROLAP). More specifically, we discuss new algorithmic and system optimizations relating to (1) a thorough optimization of the underlying sequential cube construction method and (2) a detailed and carefully engineered cost model for improved parallel load balancing and faster sequential cube construction. These optimizations were key in allowing us to build a prototype that is able to produce data cube output at a rate of over one TeraByte per hour. Research supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).  相似文献   

8.
In this paper, we analyze four parallel sorting algorithms (bitonic, column, radix, and sample sort) with the LogP model. LogP characterizes the performance of modern parallel machines with a small set of parameters: the communication latency (L), overhead (o), bandwidth (g), and the number of processors (P). We develop implementations of these algorithms in Split-C, a parallel extension to C, and compare the performance predicted by LogP to actual performance on a CM-5 of 32 to 512 processors for a range of problem sizes. We evaluate the robustness of the algorithms by varying the distribution and ordering of the key values. We also briefly examine the sensitivity of the algorithms to the communication parameters. We show that the LogP model is a valuable guide in the development of parallel algorithms and a good predictor of implementation performance. The model encourages the use of data layouts which minimize communication and balanced communication schedules which avoid contention. With an empirical model of local processor performance, LogP predictions closely match observed execution times on uniformly distributed keys across a broad range of problem and machine sizes. We find that communication performance is oblivious to the distribution of the key values, whereas the local processor performance is not; some communication phases are sensitive to the ordering of keys due to contention. Finally, our analysis shows that overhead is the most critical communication parameter in the sorting algorithms  相似文献   

9.
高蕾  胡玉鹏 《计算机科学》2017,44(Z6):300-304
针对现有的无线传感器网络数据汇集算法延时较大的不足,对最小延时数据汇集树和传输调度问题进行了研究。提出一种基于度约束的汇集树构建算法(DCAT)。该算法按照BFS方式遍历图,当遍历到每个节点时,通过确定哪些节点与汇点更近来确定潜在母节点集合。然后,选择图中度数最小的潜在母节点作为当前被遍历节点的母节点。此外,为了在给定的汇集树上进行高效的数据汇集,还提出两种新的基于贪婪的TDMA传输调度算法:WIRES-G和DCAT-Greedy。利用随机生成的不同规模的传感器网络,参照当前最新算法,对所提方法的性能进行了全面评估。结果表明,与当前最优算法相比,将所提调度算法与所提汇集树构建算法结合起来,可显著降低数据汇集的延时。  相似文献   

10.
In recent years, the use of multi-view data has attracted much attention resulting in many multi-view batch learning algorithms. However, these algorithms prove expensive in terms of training time and memory when used on the incremental data. In this paper, we propose Multi-view Incremental Discriminant Analysis (MvIDA), which updates the trained model to incorporate new data samples. MvIDA requires only the old model and newly added data to update the model. Depending on the nature of the increments, MvIDA is presented as two cases, sequential MvIDA and chunk MvIDA. We have compared the proposed method against the batch Multi-view Discriminant Analysis (MvDA) for its discriminability, order independence, the effect of the number of views, training time, and memory requirements. We have also compared our method with single-view Incremental Linear Discriminant Analysis (ILDA) for accuracy and training time. The experiments are conducted on four datasets with a wide range of dimensions per view. The results show that through order independence and faster construction of the optimal discriminant subspace, MvIDA addresses the issues faced by the batch multi-view algorithms in the incremental setting.  相似文献   

11.
提出利用Cube中的维层次聚集树(dimension hierarchy aggregate tree,简称DHA-Tree)来对聚集Cube进行增量更新维护,在维层次聚集Cube中进行数据插入和删除等数据更新时,充分利用维层次聚集树中的维层次前缀,由下向上用更新前后的差值对受到更新结点影响的所有祖先结点进行增量更新.在插入新维数据时,在不需要重新构建聚集Cube就可以对聚集Cube进行增量更新,从而减少了Cube的更新时间.对基于维层次聚集树的聚集Cube与传统Cube进行了算法性能分析和比较,结果表明本文所提出的聚集Cube的增量更新算法性能最佳.  相似文献   

12.
无线传感器网络的数据通信模式问题是目前的研究热点,针对现有的无线传感器网络数据汇集算法延时较大这一不足,对最小延时数据汇集树和传输调度问题进行了研究。提出一种基于度约束的汇集树构建算法(DCAT)。该算法按照 BFS 方式遍历图,当遍历到每个节点时,通过确定哪些节点与汇点更近来确定潜在母节点集合。然后,选择图中度数最小的潜在母节点作为当前被遍历节点的母节点。此外,为了在给定的汇集树上进行高效地数据汇集,还提出两种新的基于贪婪的TDMA传输调度算法:WIRES-G 和 DCAT-Greedy。利用随机生成的不同规模的传感器网络,参照当前最新算法,对文中方法的性能进行了全面评估。结果表明,与当前最优算法相比,文中调度算法与文中汇集树构建算法结合起来,可显著降低数据汇集的延时。  相似文献   

13.
文章利用并行计算框架MapReduce,探索数据立方体的计算问题。数据立方体的计算存在两个关键问题,一个是计算时间的问题,另一个是立方体的体积问题。随着维度的增加,计算时间将呈现指数级的增长,立方体的体积也是如此。尽管MapReduce是一个优秀的并行计算框架,但在处理数据倾斜时,分区算法不够完善,导致一些计算任务时间过长,影响整个作业的完成时间。本文通过数据采样的方式,优化数据分区,实验结果表明,数据立方体的计算的性能明显提升。为解决数据立方体体积过大的问题,在Reduce阶段将最终的结果输出到基于NoSQL的HBase数据库进行存储,HBase方便水平扩展,同时也便于日后对数据立方体的查询。  相似文献   

14.
The ability to model the temporal dimension is essential to many applications. Furthermore, the rate of increase in database size and stringency of response time requirements has out-paced advancements in processor and mass storage technology, leading to the need for parallel temporal database management systems. In this paper, we introduce a variety of parallel temporal aggregation algorithms for the shared-nothing architecture; these algorithms are based on the sequential Aggregation Tree algorithm. We are particularly interested in developing parallel algorithms that can maximally exploit available memory to quickly compute large-scale temporal aggregates without intermediate disk writes and reads. Via an empirical study, we found that the number of processing nodes, the partitioning of the data, the placement of results, and the degree of data reduction effected by the aggregation impacted the performance of the algorithms. For distributed result placement, we discovered that Greedy Time Division Merge was the obvious choice. For centralized results and high data reduction, Pairwise Merge was preferred for a large number of processing nodes; for low data reduction, it only performed well up to 32 nodes. This led us to a centralized variant of Greedy Time Division Merge which was best for the remaining cases. We present a cost model that closely predicts the running time of Greedy Time Division Merge.  相似文献   

15.
通过把数据立方体中的维分为划分维和非划分维,视图中的数据被分成两部分,分别存储在关系和多维数组中。针对这种混合存储结构,我们设计了一个数据立方体生成算法,它结合了流水线聚集方法和多维数组聚集方法的优点,大大减少了流水线的条数和所需要的存储空间,加快了计算速度。并用一个实际数据集进行了实验,结果表明该算法适用于计算高维的数据立方体。  相似文献   

16.
In this paper, we propose a new parallel algorithm, named PMSPX, which mines maximal frequent sequences by using multiple samples to exclude infrequent candidates effectively. A frequent sequence is maximal if none of its supersequences is frequent. Unlike the traditional single-sample methods developed for mining frequent itemsets, PMSPX uses multiple samples. Thus, it can avoid or alleviate some problems inherent in the single-sample methods. We theoretically analyzed how to increase the minimum support level to prevent misestimating infrequent candidates as frequent in the mining of samples. PMSPX is a parallel version of our sequential MSPX algorithm, and it is developed on a cluster of workstations. In PMSPX, each processing node uses MSPX to find a candidate set of local maximal frequent sequences first, independently from other processing nodes. Then, a top-down search is performed, starting with all the candidates, in a synchronous manner to identify real maximal frequent sequences. This asynchronous local mining followed by synchronous global mining approach minimizes the synchronization and communication among the processing nodes. Three database partitioning methods are proposed to distribute the database across the processing nodes, so that their workloads are balanced and the data skewness of the whole database is preserved in the data partition of each node. A comprehensive analysis was performed on PMSPX and existing parallel sequence mining algorithms, and extensive experiments were conducted on PMSPX. PMSPX demonstrates very good speedup and scaleup properties. It also requires less communication and synchronization than other parallel algorithms.  相似文献   

17.
Pal Singh  J. Gupta  A. Levoy  M. 《Computer》1994,27(7):45-55
Recently, a new class of scalable, shared-address-space multiprocessors has emerged. Like message-passing machines, these multiprocessors have a distributed interconnection network and physically distributed main memory. However, they provide hardware support for efficient implicit communication through a shared address space, and they automatically exploit temporal locality by caching both local and remote data in a processor's hardware cache. In this article, we show that these architectural characteristics make it much easier to obtain very good speedups on the best known visualization algorithms. Simple and natural parallelizations work very well, the sequential implementations do not have to be fundamentally restructured, and the high degree of temporal locality obviates the need for explicit data distribution and communication management. We demonstrate our claims through parallel versions of three state-of-the-art algorithms: a recent hierarchical radiosity algorithm by Hanrahan et al. (1991), a parallelized ray-casting volume renderer by Levoy (1992), and an optimized ray-tracer by Spach and Pulleyblank (1992). We also discuss a new shear-warp volume rendering algorithm that provides the first demonstration of interactive frame rates for a 256×256×256 voxel data set on a general-purpose multiprocessor  相似文献   

18.
We provide a theoretical analysis of the communication requirements of parallel algorithms for molecular dynamic simulations. We describe two commonly used algorithms, space decomposition and force decomposition, and analyze their communication requirements; each is better in a distinct computation regime. We next introduce a new hybrid algorithm that further reduces communication. We show that the new algorithm is optimal, by providing a matching lower bound.  相似文献   

19.

Community detection (or clustering) in large-scale graphs is an important problem in graph mining. Communities reveal interesting organizational and functional characteristics of a network. Louvain algorithm is an efficient sequential algorithm for community detection. However, such sequential algorithms fail to scale for emerging large-scale data. Scalable parallel algorithms are necessary to process large graph datasets. In this work, we show a comparative analysis of our different parallel implementations of Louvain algorithm. We design parallel algorithms for Louvain method in shared memory and distributed memory settings. Developing distributed memory parallel algorithms is challenging because of inter-process communication and load balancing issues. We incorporate dynamic load balancing in our final algorithm DPLAL (Distributed Parallel Louvain Algorithm with Load-balancing). DPLAL overcomes the performance bottleneck of the previous algorithms and shows around 12-fold speedup scaling to a larger number of processors. We also compare the performance of our algorithm with some other prominent algorithms in the literature and get better or comparable performance . We identify the challenges in developing distributed memory algorithm and provide an optimized solution DPLAL showing performance analysis of the algorithm on large-scale real-world networks from different domains.

  相似文献   

20.
This paper considers the problem of constructing data aggregation trees in wireless sensor networks (WSNs) for a group of sensor nodes to send collected information to a single sink node. The data aggregation tree contains the sink node, all the source nodes, and some other non-source nodes. Our goal of constructing such a data aggregation tree is to minimize the number of non-source nodes to be included in the tree so as to save energies. We prove that the data aggregation tree problem is NP-hard and then propose an approximation algorithm with a performance ratio of four and a greedy algorithm. We also give a distributed version of the approximation algorithm. Extensive simulations are performed to study the performance of the proposed algorithms. The results show that the proposed algorithms can find a tree of a good approximation to the optimal tree and has a high degree of scalability.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号