共查询到20条相似文献,搜索用时 46 毫秒
1.
A flexible processor mapping technique toward data localization for block-cyclic data redistribution
Array redistribution is usually needed for more efficiently executing a data-parallel program on distributed memory multicomputers.
To minimize the redistribution data transfer cost, processor mapping techniques were proposed to reduce the amount of redistributed
data elements. Theses techniques demand that the beginning data elements on a processor not be redistributed in the redistribution.
On the other hand, for satisfying practical computation needs, a programmer may require other data elements to be un-redistributed
(localized) in the redistribution. In this paper, we propose a flexible processor mapping technique for the Block-Cyclic redistribution
to allow the programmer to localize the required data elements in the redistribution. We also present an efficient redistribution
method for the redistribution employing our proposed technique. The data transfer cost reduction and system performance improvement
for the redistributions with data localization are analyzed and presented in our experimental results. 相似文献
2.
Ching-Hsien Hsu Yeh-Ching Chung Don-Lin Yang Chyi-Ren Dow 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(7):743-757
In many scientific applications, array redistribution is usually required to enhance data locality and reduce remote memory access in many parallel programs on distributed memory multicomputers. Since the redistribution is performed at runtime, there is a performance trade-off between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present a generalized processor mapping technique to minimize the amount of data exchange for BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) array redistribution and vice versa. The main idea of the generalized processor mapping technique is first to develop mapping functions for computing a new rank of each destination processor. Based on the mapping functions, a new logical sequence of destination processors can be derived. The new logical processor sequence is then used to minimize the amount of data exchange in a redistribution. The generalized processor mapping technique can handle array redistribution with arbitrary source and destination processor sets and can be applied to multidimensional array redistribution. We present a theoretical model to analyze the performance improvement of the generalized processor mapping technique. To evaluate the performance of the proposed technique, we have implemented the generalized processor mapping technique on an IBM SP2 parallel machine. The experimental results show that the generalized processor mapping technique can provide performance improvement over a wide range of redistribution problems 相似文献
3.
4.
Array redistribution is usually required for more efficiently executing a data-parallel program on distributed memory multi-computers.
In performing array redistribution using synchronous communication mode, data communications among the processors should be
properly arranged to avoid incurring higher data transfer cost. Some efficient communication scheduling methods for the Block-Cyclic
redistribution have been proposed. On the other hand, the processor mapping technique can help reduce the data transfer cost
of redistribution. To avoid degrading the benefit of data transfer cost reduction, it is needed to construct optimal communication
schedules for the redistribution in which the processor mapping technique is applied. In this paper, we present a unified
approach to constructing optimal communication schedules for the processor mapping technique applied Block-Cyclic redistribution.
The proposed method is founded on the processor mapping technique and can more efficiently construct the required communication
schedules than other optimal scheduling methods. 相似文献
5.
6.
Yeh-Ching Chung Ching-Hsien Hsu Sheng-Wen Bai 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(4):359-377
Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance trade-off between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present a basic-cycle calculation technique to efficiently perform BLOCK-CYCLIC(S) to BLOCK-CYCLIC(t) redistribution. The main idea of the basic-cycle calculation technique is, first, to develop closed forms for computing source/destination processors of some specific array elements in a basic-cycle, which is defined as icm(s,t)/gcd(s,t). These closed forms are then used to efficiently determine the communication sets of a basic-cycle. From the source/destination processor/data sets of a basic-cycle, we can efficiently perform a BLOCK-CYCLIC(s) to BLOCK-CYCLIC(t) redistribution. To evaluate the performance of the basic-cycle calculation technique, we have implemented this technique on an IBM SP2 parallel machine, along with the PITFALLS method and the multiphase method. The cost models for these three methods are also presented. The experimental results show that the basic-cycle calculation technique outperforms the PITFALLS method and the multiphase method for most test samples 相似文献
7.
Array redistribution is usually required to enhance algorithm performance in many parallel programs on distributed memory multicomputers. Since it is performed at run-time, there is a performance tradeoff between the efficiency of new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present efficient algorithms for BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) and BLOCK-CYCLIC(r) to BLOCK-CYCLIC(kr) redistribution. The most significant improvement of our methods is that a processor does not need to construct the send/receive data sets for a redistribution. Based on the packing/unpacking information that derived from the BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) redistribution and vice versa, a processor can pack/unpack array elements into (from) messages directly. To evaluate the performance of our methods, we have implemented our methods along with the Thakur's methods and the PITFALLS method on an IBM SP2 parallel machine. The experimental results show that our algorithms outperform the Thakur's methods and the PITFALLS method for all test samples. This result encourages us to use the proposed algorithms for array redistribution. 相似文献
8.
Ching-Hsien Hsu Ming-Hao Chen Chao-Tung Yang Kuan-Ching Li 《Parallel and Distributed Systems, IEEE Transactions on》2006,17(11):1226-1241
Dynamic data redistribution is used to enhance data locality and algorithm performance by reducing interprocessor communication in many parallel scientific applications on distributed memory multicomputers. Since the redistribution is performed at runtime, there is a performance tradeoff between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present a processor replacement scheme to minimize the cost of interprocessor data exchange during runtime. The main idea of the proposed technique is to develop a replacement function for reordering logical processors in the destination phase. Based on the replacement function, a realigned sequence of destination processors can be derived and is then used to perform data decomposition in the receiving phase. Together with local matrix and compressed CRS vectors transposition schemes, the interprocessor communication can be eliminated during runtime. A significant improvement of this approach is that the realignment of data can be performed without interprocessor communication for special cases. The second contribution of the present technique is that the complicated communication sets generation could be simplified by applying local matrix transposition. Consequently, the indexing cost could be reduced significantly. The proposed techniques can be applied in both dense and sparse applications. A generalized symmetric redistribution algorithm is also presented in this work. To analyze the efficiency of the proposed technique, the theoretical analysis proves that up to (p-1)/p data transmission cost can be saved. For general cases, the symmetric redistribution algorithm saves 1/p communication overheads compared with the traditional method. Experimental results also show that the proposed techniques provide superior performance in most data redistribution instances 相似文献
9.
In many scientific applications, dynamic data redistribution is used to enhance algorithm performance and achieve data locality in parallel programs on distributed memory multi-computers. In this paper, we present a new method, Compressed Diagonals Remapping (CDR) technique aims to the efficiency of runtime data redistribution on banded sparse matrices. The main idea of the proposed technique is first to compress the source matrix into a Compressed Diagonal Matrix (CDM) form. Based on the compressed diagonal matrix, a one-dimensional local and global index transformation method can be applied to carry out data redistribution on the compressed diagonal matrix. This process is identical to redistribute data in the two-dimensional banded sparse matrix. The CDR technique uses an efficient one-dimensional indexing scheme to perform data redistribution on banded sparse matrix. A significant improvement of this approach is that a processor does not need to determine the complicated sending or receiving data sets for dynamic data redistribution. The indexing cost is reduced significantly. The second advantage of the present techniques is the achievement of optimal packing/unpacking stages consequent upon the consecutive attribute of column elements in a compressed diagonal matrix. Another contribution of our methods is the ability to handle sparse matrix redistribution under two disjoint processor grids in the source and destination phases. A theoretical model to analyze the performance of the proposed technique is also presented in this paper. To evaluate the performance of our methods, we have implemented the present techniques on an IBM SP2 parallel machine along with the v2m algorithm and a dense redistribution strategy. The experimental results show that our technique provides significant improvement for runtime data redistribution of banded sparse matrices in most test samples. 相似文献
10.
Modifying the data distribution over the course of a program to adapt to variations in the data access patterns may leads to significant computational benefits in many scientific applications. Therefore, dynamic realignment of data is used to enhance algorithm performance in parallel programs on distributed memory machines. This paper presents a new method aims to the efficiency of block-cyclic data realignment of sparse matrix. The main idea of the proposed technique is first todevelop closed forms for generating the Vector Index Set (VIS) of each source/destination processor. Based on the vector index set and the nonzero structure of sparse matrix, two efficient algorithms,vector2message (v2m) and message2vector (m2v) can be derived. The proposed technique uses v2m to extract nonzero elements from source compressed structures and packs them into messages in the source stage; and uses m2v to unpack each received messages and construct the destination matrix in the destination stage. A significant improvement of this approach is that a processor does not need to determine the complicated sending or receiving data sets for dynamic data redistribution. The indexing cost is reduced obviously. The second advantage of the present techniques is the achievement of optimal packing/unpacking stages consequent upon the informative VIS tables. Another contribution of our methods is the ability to handle sparse matrix redistribution under two disjoint processor grids in the source and destination phases. A theoretical model to analyze the performance of the proposed technique is also presented in this work. To evaluate the performance of our methods, we have implemented the present algorithms on an IBM SP2 parallel machine along with the Histogram method and a dense redistribution strategy. The experimental results show that our technique provides significant improvement for runtime data redistribution of sparse matrices in most test samples. 相似文献
11.
The design of specialized processing array architectures, capable of executing any given arbitrary algorithm, is proposed. An approach is adopted in which the algorithm is first represented in the form of a dataflow graph and then mapped onto the specialized processor array. The processors in this array execute the operations included in the corresponding nodes (or subsets of nodes) of the dataflow graph, while regular interconnections of these elements serve as edges of the graph. To speed up the execution, the proposed array allows the generation of computation fronts and their cancellation at a later time, depending on the arriving data operands; thus it is called a data-driven array. The structure of the basic cell and its programming are examined. Some design details are presented for two selected blocks, the instruction memory and the flag array. A scheme for mapping a dataflow graph (program) onto a hexagonally connected array is described and analyzed. Two distinct performance measures-mapping efficiency and array utilization-and some performance results are discussed 相似文献
12.
Experience with a Hybrid Processor: K-Means Clustering 总被引:2,自引:0,他引:2
Maya Gokhale Jan Frigo Kevin Mccabe James Theiler Christophe Wolinski Dominique Lavenier 《The Journal of supercomputing》2003,26(2):131-148
We discuss hardware/software co-processing on a hybrid processor for a compute- and data-intensive multispectral imaging algorithm, k-means clustering. The experiments are performed on two models of the Altera Excalibur board, the first using the soft IP core 32-bit NIOS 1.1 RISC processor, and the second with the hard IP core ARM processor. In our experiments, we compare performance of the sequential k-means algorithm with three different accelerated versions. We consider granularity and synchronization issues when mapping an algorithm to a hybrid processor. Our results show that speedup of 11.8X is achieved by migrating computation to the Excalibur ARM hardware/software as compared to software only on a Gigahertz Pentium III. Speedup on the Excalibur NIOS is limited by the communication cost of transferring data from external memory through the processor to the customized circuits. This limitation is overcome on the Excalibur ARM, in which dual-port memories, accessible to both the processor and configurable logic, have the biggest performance impact of all the techniques studied. 相似文献
13.
Neungsoo Park Prasanna V.K. Raghavendra C.S. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(12):1217-1240
Run-time array redistribution is necessary to enhance the performance of parallel programs on distributed memory supercomputers. In this paper, we present an efficient algorithm for array redistribution from cyclic(x) on P processors to cyclic(Kx) on Q processors. The algorithm reduces the overall time for communication by considering the data transfer, communication schedule, and index computation costs. The proposed algorithm is based on a generalized circulant matrix formalism. Our algorithm generates a schedule that minimizes the number of communication steps and eliminates node contention in each communication step. The network bandwidth is fully utilized by ensuring that equal-sized messages are transferred in each communication step. Furthermore, the time to compute the schedule and the index sets is significantly smaller. It takes O(max(P, Q)) time and is less than 1 percent of the data transfer time. In comparison, the schedule computation time using the state-of-the-art scheme (which is based on the bipartite matching scheme) is 10 to 50 percent of the data transfer time for similar problem sizes. Therefore, our proposed algorithm is suitable for run-time array redistribution. To evaluate the performance of our scheme, we have implemented the algorithm using C and MPI on an IBM SP2. Results show that our algorithm performs better than the previous algorithms with respect to the total redistribution time, which includes the time for data transfer, schedule, and index computation 相似文献
14.
《Journal of Parallel and Distributed Computing》2000,60(3):297-319
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains, including digital signal processing, image processing, and computer vision. The parameters of the performance for such stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which data sets are processed). These two criteria are distinct since multiple data sets can be pipelined or processed in parallel. The central contribution of this research is a new algorithm to determine a processor mapping for a chain of tasks that optimizes latency in the presence of a throughput constraint. We also discuss how this algorithm can be applied to solve the converse problem of optimizing throughput with a latency constraint. The problem formulation uses a general and realistic model of intertask communication and addresses the entire problem of mapping, which includes clustering tasks into modules, assigning of processors to modules, and possible replicating of modules. The main algorithms are based on dynamic programming and their execution time complexity is polynomial in the number of processors and tasks. The entire framework is implemented as an automatic mapping tool in the Fx parallelizing compiler for a dialect of High Performance Fortran. 相似文献
15.
视觉跟随是机器人领域中一个比较重要的部分,可以应用在仓储搬运、安防、军事等多种领域。由于传统算法存在当背景比较复杂的情况下无法有效跟踪目标、对跟踪目标和外部环境的分辨率要求高所以只能进行辅助跟踪、计算量大无法满足实时性要求等问题。本文采用KCF算法设计,设计了基于ROS的移动机器人视觉跟随系统,利用循环矩阵在傅里叶空间可对角化的性质,从而使得矩阵运算被转化成元素的点乘,减少了计算量从而提高了运算速度,满足了算法的实时性要求。经过实验和数据分析,本移动机器人可以实时有效的跟随指定目标实现视觉跟随功能。 相似文献
16.
《Journal of Parallel and Distributed Computing》2000,60(2):189-216
This paper presents compilation techniques used to compress holes, which are caused by the nonunit alignment stride in a two-level data-processor mapping. Holes are the memory locations mapped by useless template cells. To fully utilize the memory space, memory holes should be removed. In a two-level data-processor mapping, there is a repetitive pattern for array elements mapped onto processors. We classify blocks into classes and use a class table to record the distribution of each class in the first repetitive data distribution pattern. Similarly, data distribution on a processor also has a repetitive pattern. We use a compression table to record the distribution of each block in the first repetitive data distribution pattern on a processor. By using a class table and a compression table, hole compression can be easily and efficiently achieved. Compressing holes can save memory usage, improve spatial locality and further improve system performance. The proposed method is efficient, stable, and easy to implement. The experimental results do confirm the advantages of our proposed method over existing methods. 相似文献
17.
1.引言目前分布式处理系统的并行编译技术还处于研究阶段,对于实际应用程序还难以获得高性能。在程序中充分发掘并行性和极小化通信的开销是需要协同考虑的两个方面。考虑全局数据分布和计算分割时,在应用程序中仍然可能存在这样一些情况:在程序中按全局优化数据分布策略,某数组采用一种数据分布方式会取得好的效果,但对某些循环而言,该数 相似文献
18.
19.
Two techniques to perform an irregularly structured Gröbner basis computation (a basic method used in symbolic polynomial manipulation) on distributed memory machines are developed. The first technique, based on relaxation of dependencies present in the sequential computation, exploits coarse-grain parallelism. In this so-called relaxation approach, at every step, each processor reduces a local pair if available, communicates the result and status information from other processors, and updates the local set of pairs and basis. The basis is replicated on each processor while the set of pairs is distributed across the processors. The computation terminates when no pairs are left to be reduced on each processor. A relaxation algorithm based on this approach, along with its experimental results, are provided. The other technique, named quasi-barrier, is developed to enhance the performance of the relaxation algorithm. Using this technique, load balance and performance can be improved by synchronizing p processors when a fraction of the active tasks are completed. The performance enhancement is significant for large numbers of processors when the distribution of pair reduction times is close to exponential. The experimental results obtained on Intel Paragon and IBM SP2 demonstrate the effectiveness of these techniques. 相似文献
20.
Optimizations for Efficient Array Redistribution on Distributed Memory Multicomputers 总被引:2,自引:0,他引:2
Shankar Ramaswamy Barbara Simons Prithviraj Banerjee 《Journal of Parallel and Distributed Computing》1996,38(2):217
Appropriate data distribution has been found to be critical for obtaining good performance on distributed memory multicomputers such as the Thinking Machines CM-5, Intel Paragon, and IBM SP-1/SP-2. It has also been found that some programs need to change their distributions during execution for better performance (redistribution). This work focuses on automatically generating efficient routines for redistribution. We present a new mathematical representation for regular distributions called FALLS and then discuss algorithms for redistribution based on this representation. One of the significant contributions of this work is being able to handle arbitrary source and target processor sets while performing redistribution. Another important contribution is the ability to handle an arbitrary number of dimensions for the array involved in the redistribution in a scalable manner. Our implementation of these techniques is based on the MPI communication library. The results presented show the efficiency and scalability of our redistribution algorithm. 相似文献