首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
3.
Automatic data allocation to minimize communication on SIMD machines   总被引:1,自引:0,他引:1  
Straightforward compilation of array operations onto massively parallel SIMD machines results in a significant amount of interprocessor data motion. Careful allocation of data across the processors eliminates much of this interprocessor data motion. Researchers are working on extending programming languages to include user directives for specifying good data allocation. Our focus is to automate the data allocation through compiler techniques to achieve portability without sacrificing efficiency. These techniques can be used to fully automate the data allocation process or can be integrated with alignment directives.We present here a complete compiler algorithm for the automatic layout of data to minimize interprocessor data motion. Arrays are aligned by mapping them onto the processors based on their usage. Arrays may be mapped differently in different sections of the program, eliminating much of the interprocessor data motion resulting from a static mapping of arrays. We describe an integrated technique for determining the alignment of arrays locally within regions of the program and minimizing communication globally among these regions. This technique starts with the alignments specified by the directives, if any, and determines the alignment for the remaining arrays.The algorithms proposed in this paper were used in the SIMD compilers at Compass, Inc. Preliminary results from the initial implementation of the data optimization techniques described here suggest a significant decrease of the interprocessor data motion. More analysis is required to better understand the range of expected gains and the conditions under which those gains are achieved.The research described in this paper was supported in part by the Defense Advanced Research Projects Agency under contracts N00014-87K-0825, F19628-92-C-0045, and N00014-91-J-1698 and in part by a National Science Foundation Presidential Young Investigator Award, grant MIP-8657531, with matching funds from General Electric Corporation, IBM Corporation, and AT&T.  相似文献   

4.
PeiZong Lee 《Parallel Computing》1995,21(12):1895-1923
It is widely accepted that distributed memory parallel computers will play an important role in solving computation-intensive problems. However, the design of an algorithm in a distributed memory system is time-consuming and error-prone, because a programmer is forced to manage both parallelism and communication. In this paper, we present techniques for compiling programs on distributed memory parallel computers. We will study the storage management of data arrays and the execution schedule arrangement of Do-loop programs on distributed memory parallel computers. First, we introduce formulas for representing data distribution of specific data arrays across processors. Then, we define communication cost for some message-passing communication operations. Next, we derive a dynamic programming algorithm for data distribution. After that, we show how to improve the communication time by pipelining data, and illustrate how to use data-dependence information for pipelining data. Jacobi's iterative algorithm and the Gauss elimination algorithm for linear systems are used to illustrate our method. We also present experimental results on a 32-node nCUBE-2 computer.  相似文献   

5.
一种面向分布主存多处理机的有效数据分布方法   总被引:1,自引:0,他引:1       下载免费PDF全文
本文针对分布主存多处理机中的数据分布问题,在程序已经过并行性分析的基础之上,提出了一种基于数据变换技术的有效数据分布方法。该方法能对多个嵌套循环中具有一般仿射数组下标的任意维数组进行有效的数据分布,并且该方法还考虑了偏移常量的对准问题,从而能使得数据通信量尽量小。实验结果表明了该方法的有效性。  相似文献   

6.
本文分析了面向分布存储SIMD/MIMD并行机的并行程序的优化数据安放问题,在FORALL程序模型和MESH通信模型之上,研究了数据分解过程中减少通信代价的优化要求.我们使用维偏好图描述并行数组之间的对准需求,通过消除维偏好图中的冲突,可得到维对准图.一个维对准图就对应一个数据安放方案.维对准图的总代价越大,对应的通信代价就越小.文中给出了求最大代价维对准目的一个近似算法.  相似文献   

7.
In this paper, we propose a new automatic data alignment model called segmented alignment. The conventional data alignment model, such as that used in High-Performance Fortran (HPF), aligns arrays with the whole index domain. The principle of our proposed segmented alignment is to allow alignment relations within delimited index domains. We first provide motivating examples to illustrate how code fragments of HPF with EOSHIFT or CSHIFT operations, or produced by synthesis operations can benefit from our enhanced alignment scheme. Second, we show that this new model can be implemented in HPF-like languages by adding WHEN and IN constructs to them. In addition, we show that the new proposed schemes for WHEN and IN constructs can be emulated using standard HPF syntax. Finally, we address issues related to automatic data alignment for the new proposed model, and present an algorithm to automatically align programs using our segmented alignment scheme. Since the optimal algorithm to do this is NP-hard, a practical heuristic is also given. Our experiments were performed on a DEC Alpha Farm with HPF environments. Our experiments confirm our theory that our proposed alignment scheme can significantly enhance not only the performance of HPF code fragments with EOSHIFT or CSHIFT operations, but also that of codes produced by synthesis operations.  相似文献   

8.
基于分簇的传感器网络数据聚集估算机制   总被引:2,自引:0,他引:2  
谢磊  陈力军  陈道蓄  谢立 《软件学报》2009,20(4):1023-1037
提出一种基于簇结构的传感器网络数据聚集估算机制CASA(clustering-based approximate scheme for data aggregation).在保证用户对数据精确度需求的前提下,CASA 通过最小化网络通信开销以及协调节点间的负载均衡,有效地提高了估算机制的节能性能.CASA 采用最优的分簇规模参数,在基于分簇的网内聚集估算架构中能够最小化网络节点的总体通信开销.此外,CASA 考虑到部署区域感知数据变化率的差异性,采用自适应的误差分配方案来进一步降低网络节点的通信开销,维护节点间的负载均衡.模拟实验结果表明,CASA 估算机制能够显著地提升传感器网络网内数据聚集机制的节能性能,同时保证聚集数据的精确程度.  相似文献   

9.
Partitioning and mapping of nested loops for linear array multicomputers   总被引:1,自引:1,他引:0  
In distributed-memory multicomputers, minimizing interprocessor communication is the key to the efficient execution of parallel programs. In order to reduce the amount of communication overhead, parallel programs on multicomputers must be carefully scheduled by parallelizing compilers. This paper proposes some compilation techniques for partitioning and mapping nested loops with constant data dependences onto linear array multicomputers. First, a systematic partition strategy is proposed to project ann-dimensional computational structure, representing ann-nested loop, onto a line to form a one-dimensional projected structure with low communication overhead. Then, a mapping algorithm is proposed for mapping the partitioned loops onto linear arrays in a way that balances the workload and minimizes the communication cost among processors. Finally, parallel execution codes can be automatically generated for such linear array multicomputers.  相似文献   

10.
In distributed memory multicomputers, local memory accesses are much faster than those involving interprocessor communication. For the sake of reducing or even eliminating the interprocessor communication, the array elements in programs must be carefully distributed to local memory of processors for parallel execution. We devote our efforts to the techniques of allocating array elements of nested loops onto multicomputers in a communication-free fashion for parallelizing compilers. We first analyze the pattern of references among all arrays referenced by a nested loop, and then partition the iteration space into blocks without interblock communication. The arrays can be partitioned under the communication-free criteria with nonduplicate or duplicate data. Finally, a heuristic method for mapping the partitioned array elements and iterations onto the fixed-size multicomputers under the consideration of load balancing is proposed. Based on these methods, the nested loops can execute without any communication overhead on the distributed memory multicomputers. Moreover, the performance of the strategies with nonduplicate and duplicate data for matrix multiplication is studied  相似文献   

11.
基于环结构的传感器网络多分辨率数据存储机制   总被引:2,自引:0,他引:2  
谢磊  陈力军  陈道蓄  谢立 《软件学报》2009,20(12):3163-3178
提出了一套基于环结构的传感器网络多分辨率数据存储机制,结合层次结构的存储查询方案,有效地利用了环结构的特性高效、节能地支持事件信息的不同分辨率的存储和查询操作,并采用优化的环结构参数,在基于环的层次结构数据存储架构中能够最小化网络节点的总体通信能耗.同时,对环结构多分辨率数据存储机制的相关性能从节能性、负载均衡性等多个角度进行了具体理论分析.模拟实验结果表明,基于环的层次结构存储机制能够高效、节能地支持传感器网络事件数据的多分辨率存储和查询操作.  相似文献   

12.
Dynamic data redistribution is used to enhance data locality and algorithm performance by reducing interprocessor communication in many parallel scientific applications on distributed memory multicomputers. Since the redistribution is performed at runtime, there is a performance tradeoff between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present a processor replacement scheme to minimize the cost of interprocessor data exchange during runtime. The main idea of the proposed technique is to develop a replacement function for reordering logical processors in the destination phase. Based on the replacement function, a realigned sequence of destination processors can be derived and is then used to perform data decomposition in the receiving phase. Together with local matrix and compressed CRS vectors transposition schemes, the interprocessor communication can be eliminated during runtime. A significant improvement of this approach is that the realignment of data can be performed without interprocessor communication for special cases. The second contribution of the present technique is that the complicated communication sets generation could be simplified by applying local matrix transposition. Consequently, the indexing cost could be reduced significantly. The proposed techniques can be applied in both dense and sparse applications. A generalized symmetric redistribution algorithm is also presented in this work. To analyze the efficiency of the proposed technique, the theoretical analysis proves that up to (p-1)/p data transmission cost can be saved. For general cases, the symmetric redistribution algorithm saves 1/p communication overheads compared with the traditional method. Experimental results also show that the proposed techniques provide superior performance in most data redistribution instances  相似文献   

13.
Homogeneous processor arrays are emerging in tera-scale computation and effective fault tolerance techniques are essential to improving the reliability of such complex integrated circuits. We study the degradable processor arrays to achieve fault tolerance by employing reconfiguration. Three bypass schemes and three rerouting schemes are proposed to reconfigure three-dimensional processor arrays with defective processors to achieve target arrays without faults. A heuristic algorithm is proposed to construct a target array on the selected rows and columns. It is also proved that the proposed greedy plane rerouting algorithm (GPR) produces maximum target array. In addition, the problem of constructing the communication efficient array is considered in this paper. An algorithm is proposed to refine the communication among processors within the target array constructed by GPR. Experimental study shows that the proposed algorithm GPR produces target arrays with higher harvest and lower degradation on the host arrays with fault density no more than 5%. In addition, the communication performance is significantly optimized by reducing the number of long interconnects, and the average improvement is about 34% for all cases considered in this paper.  相似文献   

14.
Extraction of frequent patterns in transaction-oriented database is crucial to several data mining tasks such as association rule generation, time series analysis, classification, etc. Most of these mining tasks require multiple passes over the database and if the database size is large, which is usually the case, scalable high performance solutions involving multiple processors are required. This paper presents an efficient scalable parallel algorithm for mining frequent patterns on parallel shared nothing platforms. The proposed algorithm is based on one of the best known sequential techniques referred to as Frequent Pattern (FP) Growth algorithm. Unlike most of the earlier parallel approaches based on different variants of the Apriori Algorithm, the algorithm presented in this paper does not explicitly result in having entire counting data structure duplicated on each processor. Furthermore, the proposed algorithm introduces minimum communication (and hence synchronization) overheads by efficiently partitioning the list of frequent elements list over processors. The experimental results show scalable performance over different machine and problem sizes. The comparison of implementation results with existing parallel approaches show significant gains in the speedup. On an 8-processor machine, we report an average speedup of 6 for different problem sizes.  相似文献   

15.
Optimizing communication is a key issue in generating efficient SPMD codes in compiling distributed arrays on data parallel languages, such as High Performance Fortran. In HPF, the array distribution may involve alignment and cyclic(k)-distribution such that the enumeration of the local set and the enumeration of the communication set exhibit regular patterns which can be modeled as integer lattices. In the special case of unit-strided alignment, many techniques of the communication set enumeration have been proposed, while in the general case of the non-unit-strided alignment, inspector-like run-time codes are needed to build repeating pattern table or to scan over local elements such that the communication set can be constructed. Unlike other works on this problem of the general alignment and cyclic(k) distribution, our approach derives an algebraic solution for such an integer lattice that models the communication set by using the Smith-Normal-Form analysis, therefore, efficient enumeration of the communication set can be generated. Based on the integer lattice, we also present our algorithm for the SPMD code generation. In our approach, when the parameters are known, the SPMD program can be efficiently constructed without any inspector-like run-time codes  相似文献   

16.
This paper considers the problem of assigning the tasks of a distributed application to the processors of a distributed system such that the sum of execution and communication costs is minimized. Previous work has shown this problem to be tractable for a system of two processors or a linear array of N processors, and for distributed programs of serial parallel structures. Here we focus on the assignment problem on a homogeneous network, which is composed of N functionally-identical processors, each with its own memory. Some processors in the network may have unique resources, such as data files or certain peripheral devices. Certain tasks may have to use these unique resources; they are called attached tasks. The tasks of a distributed program should therefore be assigned so as to make use of specific resources located at certain processors in the network while minimizing the amount of interprocessor communication. The assignment problem in such a homogeneous network is known to be NP-hard even for N=3, thus making it intractable for a network with a medium to large number of processors. We therefore focus on task assignment in general array networks, such as linear arrays, meshes, hypercubes, and trees. We first develop a modeling technique that transforms the assignment problem in an array or tree into a minimum-cut maximum-flow problem. The assignment problem is then solved for a general array or tree network in polynomial time  相似文献   

17.
Multi-dimensional sparse array operations can be used in the atmosphere and ocean sciences, the image processing, and etc., and have been an extensively investigated problem. Therefore, it becomes an important issue to propose efficient data distribution schemes for multi-dimensional sparse arrays. In our previous work, we have proposed two data distribution schemes Compress Followed Send (CFS) and Encoding-Decoding (ED) for sparse arrays based on the traditional matrix representation (TMR) scheme. We have proposed another scheme, called extended Karnaugh map representation (EKMR), to represent sparse arrays. The EKMR scheme can obtain better performance than the TMR scheme for some sparse array operations. Hence, in this paper, we want to propose efficient data distribution schemes for EKMR-based sparse arrays. We extend the CFS and the ED schemes for TMR-based sparse arrays to EKMR-based sparse arrays first. Then, we compare the performance of these two schemes with that of the Send Followed Compress (SFC), which is an intuitive data distribution scheme for sparse arrays. Finally, we compare these three schemes for EKMR-based sparse arrays with those of TMR-based sparse arrays, respectively. Both the theoretical analysis and the experimental tests were conducted. From the theoretical analysis and the experimental results, we can see that the ED scheme is superior to the CFS scheme that is superior to the SFC scheme for most of testing EKMR-based sparse arrays; the performance of these three schemes for EKMR-based sparse arrays is better than that of TMR-based sparse arrays for all of testing cases, respectively.  相似文献   

18.
采用加矩形窗的积累互相关法和基于Fourier变换频域移位性质的最小熵法进行一维距离像包络对齐。针对包络对齐算法数据量大、复杂度高、运行时间长等缺点,提出一种应用于多核处理器的包络对齐并行算法。该方法利用OpenMP编译指导指令#pragma omp section和#pragma omp for对积累互相关算法和最小熵算法进行多线程并行优化。理论分析和仿真实验表明,该方法大大提升了算法的执行效率。  相似文献   

19.
In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization. This paper presents a library called CHAOS, which helps users implement irregular programs on distributed-memory message-passing machines, such as the Paragon, Delta, CM-5 and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient index translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsumes the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirection arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life adaptive applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC. Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper demonstrates how CHAOS can be effectively used in such a framework. By embedding CHAOS primitives in the Syracuse Fortran 90D/HPF compiler, kernels taken from the CHARMM and DSMC codes have been automatically parallelized.  相似文献   

20.
Aggregate data objects (such as arrays) are distributed across the processor memories when compiling a data-parallel language for a distributed-memory machine. The mapping determines the amount of communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: analignmentthat maps all the objects to an abstract template, followed by adistributionthat maps the template to the processors. This paper describes algorithms for solving the various facets of the alignment problem: axis and stride alignment, static and mobile offset alignment, and replication labeling. We show that optimal axis and stride alignment is NP-complete for general program graphs and give a heuristic method that can explore the space of possible solutions in a number of ways. We show that some of these strategies can give better solutions than a simple greedy approach proposed earlier. We also show how local graph contractions can reduce the size of the problem significantly without changing the best solution. This allows more complex and effective heuristics to be used. We show how to model the static offset alignment problem using linear programming, and we show that loop-dependent mobile offset alignment is sometimes necessary for optimum performance. We describe an algorithm with for determining mobile alignments for objects withindoloops. We also identify situations in which replicated alignment is either required by the program itself or can be used to improve performance. We describe an algorithm based on network flow that replicates objects so as to minimize the total amount of broadcast communication in replication.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号