首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 45 毫秒
1.
Cross-iterations data dependences in DOACROSS loops require explicit data synchronizations to enforce them. However, the composite effect of some data synchronizations may cover the other dependences and make the enforcement of those covered dependences redundant. In this paper, we propose an efficient and general algorithm to identify redundant synchronizations in multiply nested DOACROSS loops which may have multiple statements and loop-exit control branches. Eliminating redundant synchronizations in DOACROSS loops allows more efficient execution of such loops. We also address the issues of enforcing data synchronizations in iterations near the boundary of the iteration space. Because some dependences may not exist in those boundary iterations, it adds complexity in determining the redundant synchronizations for those boundary iterations. The necessary and sufficient condition under which the synchronization is uniformly redundant is also studied. These results allow a parallelizing compiler to generate efficient data synchronization instructions for DOACROSS loops  相似文献   

2.
Many computationally-intensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiply-nested loops that have a regular stencil of data dependences. Tiling is a well-known compiler optimization that improves performance on such loops, particularly for computers with a multilevel hierarchy of parallelism and memory. Most previous work on tiling is limited in at least one of the following ways: they only handle nested loops of depth two, orthogonal tiling, or rectangular tiles. In our work, we tile loop nests of arbitrary depth using polyhedral tiles. We derive a prediction formula for the execution time of such tiled loops, which can be used by a compiler to automatically determine the tiling parameters that minimizes the execution time. We also explain the notion of rise, a measure of the relationship between the shape of the tiles and the shape of the iteration space generated by the loop nest. The rise is a powerful tool in predicting the execution time of a tiled loop. It allows us to reason about how the tiling affects the length of the longest path of dependent tiles, which is a measure of the execution time of a tiling. We use a model of the tiled iteration space that allows us to determine the length of the longest path of dependent tiles using linear programming. Using the rise, we derive a simple formula for the length of the longest path of dependent tiles in rectilinear iteration spaces, a subclass of the convex iteration spaces, and show how to choose the optimal tile shape.  相似文献   

3.
An efficient algorithm to remove redundant dependences in simple loops with constant dependences is presented. Dependences constrain the parallel execution of programs and are typically enforced by synchronization instructions. The synchronization instructions represent a significant part of the overhead in the parallel execution of a program. Some program dependences are redundant because they are covered by other dependences. It is shown that unlike with single loops, in the case of nested loops, a particular dependence may be redundant at some iterations but not redundant at others, so that the redundancy of a dependence may not be uniform over the entire iteration space. A sufficient condition for the uniformity of redundancy in a doubly nested loop is developed  相似文献   

4.
5.
6.
7.
We examine the effectiveness of optimizations aimed to allowing distributed machine to efficiently compute inner loops over globally defined data structures. Our optimizations are specifically targeted toward loops in which some array references are made through a level of indirection. Unstructured mesh codes and sparse matrix solvers are examplese of programs with kernels of this sort. Experimental data that quantify the performance obtainable using the methods discussed here are included.  相似文献   

8.
9.
The mathematical model for the parallelization, or “space-time mapping,” of loop nests is the polyhedron model. The presence ofwhile loops in the nest complicates matters for two reasons: (1) the parallelized loop nest does not correspond to a polyhedron but instead to a subset that resembles a (multidimensional) comb and (2) it is not clear when the entire loop nest has terminated. We describe a communication scheme which can deal with both problems and which can be added to the parallel target loop nest by a compiler. This work has been presented in part at the conference CONPAR 94-VAPP VI (see Ref. 1)  相似文献   

10.
An instruction window that can tolerate latencies to DRAM memory is prohibitively complex and power hungry. To avoid having to build such large windows, runahead execution uses otherwise-idle clock cycles to achieve an average 22 percent performance improvement for processors with instruction windows of contemporary sizes. This technique incurs only a small hardware cost and does not significantly increase the processor's complexity.  相似文献   

11.
刘晓娴  赵荣彩  赵捷  徐金龙 《软件学报》2014,25(6):1154-1168
发掘DOACROSS 循环中蕴含的并行性,选择合适的策略将其并行执行,对提升程序的并行性能非常重要.流水并行方式是规则DOACROSS 循环并行的重要方式.自动生成性能良好的流水并行代码是一项困难的工作,并行编译器对程序自动并行时常常对DOACROSS 循环作保守处理,损失了DOACROSS 循环包含的并行性,限制了程序的并行性能.针对上述问题,设计了一种选择计算划分循环层和循环分块层的启发式算法,给出了一个基于流水并行代价模型的循环分块大小计算公式,并使用计数信号量进行并行线程之间的同步,实现了基于OpenMP 的规则DOACROSS 循环流水并行代码的自动生成.通过对有限差分松弛法(finite difference relaxation,简称FDR)的波前(wavefront)循环和时域有限差分法(finite difference time domain,简称FDTD)中典型循环以及程序Poisson,LU 和Jacobi 的测试,算法自动生成的流水并行代码能够在多核处理器上获得明显的性能提升,使用的流水分块大小计算公式能够较为精确地计算出循环流水并行时的最佳分块大小.自动生成的流水并行代码与基于手工选择的最优分块大小的流水并行代码相比,加速比达到手工选择加速比的89%.  相似文献   

12.
The wavelength-based machine, or simply w-machine, is an optical computational model, which is designed based on simultaneous movement of several wavelengths in a single light ray, and simultaneous effect of simple optical devices on them. In this paper, we investigate nonuniform complexity classes of w-machine, based on three complexity measures, namely, size, time, and word length. We show that the class of languages which can be generated by constant size nonuniform w-machines contain infinitely many Turing undecidable languages. Also, we show that polynomial size nonuniform w-machines generate all NP languages, and every NP-hard language requires at least polynomial time and polynomial size nonuniform w-machines to be generated. We prove that the class of languages which can be generated by polynomial size nonuniform w-machines is equal to NP/poly, and almost all languages require exponential size and polynomial time nonuniform w-machines to be generated.  相似文献   

13.
This paper presents a time stamp algorithm for runtime parallelization of general DOACROSS loops that have indirect access patterns. The algorithm follows the INSPECTOR/EXECUTOR scheme and exploits parallelism at a fine-grained memory reference level. It features a parallel inspector and improves upon previous algorithms of the same generality by exploiting parallelism among consecutive reads of the same memory element. Two variants of the algorithm are considered: One allows partially concurrent reads (PCR) and the other allows fully concurrent reads (FCR). Analyses of their time complexities derive a necessary condition with respect to the iteration workload for runtime parallelization. Experimental results for a Gaussian elimination loop, as well as an extensive set of synthetic loops on a 12-way SMP server, show that the time stamp algorithms outperform iteration-level parallelization techniques in most test cases and gain speedups over sequential execution for loops that have heavy iteration workloads. The PCR algorithm performs best because it makes a better trade-off between maximizing the parallelism and minimizing the analysis overhead. For loops with light or unknown iteration loads, an alternative speculative runtime parallelization technique is preferred  相似文献   

14.
Computing and Visualization in Science - We present a technique for proving convergence of h and hp adaptive finite element methods through comparison with certain reference refinement schemes...  相似文献   

15.
The paper discusses problems related to the field called program profiling. It presents a retrospective review of solving the code profiling problem, which reduces essentially to the frequency analysis of the sequential program code execution (SPCE). A new approach to solving this problem is proposed. It is based on the Monte Carlo method, which makes it possible to assess the number of program runs required for the estimation of the frequency of execution of its commands with a given accuracy.  相似文献   

16.
We provide a formalism for comparing the average execution time of distributed protocols in a manner independent of the properties of the network on which the protocols are executed. The formalism takes into account computation time, the time to transfer information, the time spent by a site waiting to synchronize and the overlap among them. We define transformations on protocols which may change the synchronization structure, the information transferred or the amount of local computation. We show that if a sequence of such transformations can be applied to a protocol to obtain another protocol, then the final protocol runs at least as fast as the initial protocol. Different sets of tranformations give rise to different notions of comparison. We develop two such notions and explore their properties. We also give results which relate our notion of protocol transformation to transformational techniques for protocol design.This work was supported by NSF under Grant CCR87-01671 and CCR89-01966. This work was performed while the first author was at SUNY, Stony Brook.  相似文献   

17.
18.
International Journal on Software Tools for Technology Transfer - Program termination is a fundamental research topic in program analysis. In this paper, we present a new complete polynomial-time...  相似文献   

19.
《Advanced Robotics》2013,27(10):1059-1079
Acquiring models of the environment belongs to the fundamental tasks of mobile robots. In the past, several researchers have focused on the problem of simultaneous localization and mapping (SLAM). Classical SLAM approaches are passive in the sense that they only process the perceived sensor data and do not influence the motion of the mobile robot. In this paper, we present a novel integrated approach that combines autonomous exploration with simultaneous localization and mapping. Our method uses a grid-based version of the FastSLAM algorithm and considers at each point in time actions to actively close loops during exploration. By re-entering already visited areas, the robot reduces its localization error and in this way learns more accurate maps. Experimental results presented in this paper illustrate the advantage of our method over previous approaches that lack the ability to actively close loops.  相似文献   

20.
Three independent research groups have proposed three apparently different architectures for an improved phase-locked loop (PLL) structure through three entirely different approaches. It is shown that all three PLLs are structurally and mathematically the same. A linear analysis is performed and concludes that the available theory for the design of conventional PLLs can be leveraged into the design of the new class of PLLs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号