首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 230 毫秒
1.
迭代空间交错条块并行Gauss-Seidel算法   总被引:1,自引:0,他引:1  
胡长军  张纪林  王珏  李建江 《软件学报》2008,19(6):1274-1282
针对并行GS(Gauss-Seidel)迭代算法中数据局部性差、同步和通信开销大的问题,首先改进传统GS迭代,提出了多层对称GS迭代算法.然后给出了以迭代空间条块序作为执行序的串行执行模型.该模型通过对迭代空间进行"时滞"划分,对迭代空间条块内部多次迭代计算,提高算法的数据局部性.最后提出一种基于迭代空间条块的并行执行模型.该模型改进了迭代空间网格划分,并通过网格条块重排序减少了cache缺失率、通信启动和同步次数.实验结果表明,迭代空间交错条块并行算法比传统的区域分解方法和红黑排序并行算法具有更好的并行效率和可扩展性.  相似文献   

2.
高效的并行有限差分Stencil 算法对于求解大型线性方程组是十分重要的.针对并行有限差分Stencil 算法中数据局部性差、同步和通信开销大的问题.首先改进传统有限差分Stencil 算法,提出了多层对称遍历有限差分Stencil 算法.然后给出了以迭代空间条块序作为执行序的串行算法,通过沿时间轴对迭代空间进行时滞划分,在不改变迭代算法性质的同时,对迭代空间条块内部多次迭代计算,提高算法的数据局部性.最后提出一种基于迭代空间条块的并行算法,该算法利用改进的多面体模型对迭代空间网格划分,并通过网格条块重排序减少了Cache 缺失率、通信启动和同步次数.理论分析和实验结果表明,该并行模型比传统的区域分解方法和红黑排序并行算法具有更好的数据局部性,并行效率和可扩展性.  相似文献   

3.
划分是把程序中不同的计算和数据分配到并行处理系统的不同处理机来充分利用并行系统的计算资源、提高程序处理速度的一种优化技术.划分的效果对程序在并行系统上的执行效率将产生至关重要的影响,因此划分问题一直是并行领域研究的一个热点.但是应用程序的一些特性,如非紧密嵌套循环、一条语句对非只读数组的多次引用间存在重叠、不同语句对同一数组不同步长的引用,给有效解决划分问题设置了极大的障碍.已有的划分算法无法对具有这些特征的程序进行自动划分.虽然在对具有这些特征的程序进行手工优化过程中,存在一些直观上的划分策略,但这些策略无法应用到编译器中来指导编译器完成对程序的自动划分.文中根据这类程序的特点,提出了一种基于代表元的划分算法.该算法通过使用程序中对划分计算产生实际影响的数组引用作为代表元素构造各种划分的限制条件,完成程序的划分.同时通过寻找最大一致性数据划分方向有效减少了程序划分过程中的数据重组织通信.该算法已经在AFT2004中实现,并对应用程序获得了很好的效果.  相似文献   

4.
数据划分是在当前主流高性能计算平台上高效并行化应用程序的关键技术,它包括数据分割和处理机分配两个主要部分.Line-Sweep计算模式被众多科学工程计算核心采用,目前该计算模式的并行化主要采用多重数据划分.多重数据划分能保证各处理机的计算量、访存量和通讯量相等,但在某些情况下也会导致访存量和通讯量过多,因此无法保证性能最优.为解决这一缺陷,文中提出均衡数据划分,进一步放松对数据分割和处理器分配的非本质约束,以利于在计算、访存和通讯这3种开销之间达到最佳平衡.文中给出生成最佳均衡数据划分的算法,它包含3个关键技术:首先建立性能模型,在该模型中均衡数据划分的性能只与数据分割方式有关;接着基于该模型缩减数据分割方式的搜索空间,并以该模型为判据搜索性能最佳的数据分割方式;最后设计处理机分配函数以满足均衡数据划分的条件.均衡数据划分被应用于NPB并行测试包中的SP程序和高分子材料计算程序LineABC.实验结果表明,当均衡数据划分与多重数据划分的数据分割方式相同时,二者性能基本一致;当两种数据分割方式不同时(对于SP和LineABC,这种情况所占比例分别高达38.7%和37.9%),采用均衡数据划分的SP程序和LineABC程序的并行效率比多重数据划分平均分别高出44.45%和22.15%.  相似文献   

5.
基于Hadoop分布式计算平台,给出一种适用于大数据集的并行挖掘算法。该算法对非结构化的原始大数据集以及中间结果文件进行垂直划分以确保能够获得完整的频繁项集,将各个垂直分块数据分配给不同的Hadoop计算节点进行处理,以减少各个计算节点的存储数据,进而减少各个计算节点执行交集操作的次数,提高并行挖掘效率。实验结果表明,给出的并行挖掘算法解决了大数据集挖掘过程中产生的大量数据通信、中间数据以及执行大量交集操作的问题,算法高效、可扩展。  相似文献   

6.
基于数据空间融合的全局计算与数据划分方法   总被引:2,自引:1,他引:2  
夏军  杨学军 《软件学报》2004,15(9):1311-1327
计算与数据划分问题是影响并行程序在分布主存多处理机中执行性能的重要因素,也是并行编译优化的重点.针对该问题,提出了一套关于数据空间融合的理论框架,并基于该框架给出了一种有效的全局计算与数据划分方法,用于分布主存计算环境中的计算与数据划分问题的求解.该方法能够尽量开发计算空间的并行度,利用数据融合技术优化数据分布,并能搜寻优化的全局计算与数据划分.该方法还能很自然地与数据复制以及偏移常量的对准结合在一起,从而使得数据通信量尽可能地小.实验结果表明了所提出方法的有效性.  相似文献   

7.
逐次超松弛迭代方法被广泛应用于油藏数值模拟中压力方程的求解.其并行实现是提高模拟速度的重要途径.传统并行方案大都只是在一次迭代内进行数据划分,而没有进一步将数据划分与迭代空间划分相结合,故针对SOR算法和SMP(symmetric multi-processors)系统的特点,以OpenMP为并行化实现工具,提出了基于SMP的并行逐次超松弛迭代方法(parallelSOR).方法通过改变不同迭代步内数据点的更新次序,使不同区域内的数据点可以并行执行多次迭代.总结出针对三维油藏区域在数据空间划分和迭代空间合并上相对较优的策略,分析了迭代过程中网格块的生长形状.与传统的并行策略相比,该方法具有可减小同步开销、改进数据局部性、cache命中率高等优点.实验结果表明,该方法具有较高的加速比和效率.  相似文献   

8.
现有多面体编译工具往往使用一些简单的启发式策略来寻找最优的语句合并,对于不同的待优化程序,需要手工调整循环合并策略以获得最佳性能.针对这一问题,面向多核CPU目标平台,文中提出了一种基于数据重用分析的循环合并策略.该策略避免了不必要的且会影响数据局部性利用的合并限制:针对调度的不同阶段,提出了面向不同并行层次的并行性合并限制;对于数组访问关系较为复杂的语句,提出了面向CPU高速缓存优化的分块性合并限制.相较于以往的合并策略,该策略在计算合并收益时考虑到了空间局部性的变化.文中基于LLVM编译框架中的多面体编译模块Polly实现了这一策略,并选用Polybench等测试套件中的部分测试用例进行测试.实验结果表明,相较于现有的多种合并策略,在单核执行情况下,测试用例平均获得了14.9%~62.5%的性能提升;在多核执行情况下,多个测试用例平均获得了19.7%~94.9%的性能提升,在单个测试用例中最高获得了1.49x~3.07x的加速效果.  相似文献   

9.
循环倾斜是程序优化中一种循环变换的手段,它改变空间迭代形式,将循环存在的跨迭代的并行用传统的并行标识出来,使得循环可以并行执行。但是循环倾斜后,并行执行的数据在内存中是离散的,而且每次迭代执行的次数是不一致的。为了更有效地利用SIMD,本文提出一种基于全局数据重组的循环倾斜优化方法。首先分析循环倾斜优化,针对数据离散的问题实现全局数据重组,改善数据局部性,循环易于向量化操作;针对迭代执行次数不一致问题,实现非满载向量操作,使尾循环得以向量执行。最后选择wavefront程序进行测试,优化后,程序计算可以获得平均10.73倍的加速效果。  相似文献   

10.
在并行化编译中,代码生成属于编译器的后端,决定着并行程序的执行效率.数据划分将计算循环中被重定义或没被读引用的数据映射到处理器,按照数据划分生成通信代码会产生冗余通信.提出了利用数组数据流分析求解暴露集,并建立计算划分、循环迭代以及暴露集的不等式限制系统,最后通过FME(fourier Motzkin elimination)消元生成数据分布代码的优化算法.测试结果表明该算法对数据分布的优化效果明显.  相似文献   

11.
基于投影分层技术的嵌套循环空间局部性优化方法   总被引:3,自引:0,他引:3  
从数据访问轨迹入手,探讨了利用数据变换来改善数据访问局部性的本质,提出了一种新的优化数据访问的投影分层技术以及基于它的数据变换框架.该框架主要利用投影技术来优化数据访问的空间局部性,并同时利用数据分层技术来解决因投影而带来的数据重叠问题.该数据变换框架不仅能处理仿射数组下标,而且还能处理许多非仿射的更复杂的数组下标,同时它还能简单直接地确定数据元素的最优存储布局以及优化数据访问的数据变换短阵,并能使访问间距尽量小.实验结果表明它是有效的.  相似文献   

12.
In this paper, we present a fine-grained parallel implementation of the MPEG-2 video encoder an the Intel Paragon XP/S parallel computer. We use a data-parallel approach and exploit parallelism within each frame, unlike some of the previous approaches that employ multiple processing of several disjoint video sequences. This makes our encoder suitable for real-time applications where the complete video sequence may not be present on the disk and may become available on a frame-by-frame basis with time. The Express parallel programming environment is employed as the underlying message-passing system making our encoder portable across a wide range of parallel and distributed architectures. The encoder also provides control over various parameters such as the number of processors in each dimension, the size of the motion search window, buffer management, and bitrate. Moreover, it has the flexibility to allow the inclusion of fast and new algorithms for different stages of the codec into the program, replacing current algorithms. Comparisons of execution times, speedups, and frame encoding rates using different numbers of processors are provided. An analysis of frame data distribution among multiple processors is also presented. In addition, our study reveals the degrees of parallelism and bottlenecks in the various computational modules of the MPEG-2 algorithm. We have used two motion estimation techniques and five different video sequences for our experiments. Using maximum parallelism by dividing one block per processor, an encoding rate higher than 30 frames/s has been achieved.  相似文献   

13.
Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which prevent parallelism. In addition, performance losses result from cache conflicts in fused loops. In this paper, we present new techniques to: (1) allow fusion of loop nests in the presence of fusion-preventing dependences, (2) maintain parallelism and allow the parallel execution of fused loops with minimal synchronization, and (3) eliminate cache conflicts in fused loops. We describe algorithms for implementing these techniques in compilers. The techniques are evaluated on a 56-processor KSR2 multiprocessor and on a 18-processor Convex SPP-1000 multiprocessor. The results demonstrate performance improvements for both kernels and complete applications. The results also indicate that careful evaluation of the profitability of fusion is necessary as more processors are used  相似文献   

14.
This paper presents a data layout optimization technique for sequential and parallel programs based on the theory of hyperplanes from linear algebra. Given a program, our framework automatically determines suitable memory layouts that can be expressed by hyperplanes for each array that is referenced. We discuss the cases where data transformations are preferable to loop transformations and show that under certain conditions a loop nest can be optimized for perfect spatial locality by using data transformations. We argue that data transformations can also optimize spatial locality for some arrays without distorting temporal/spatial locality exhibited by others. We divide the problem of optimizing data layout into two independent subproblems: 1) determining optimal static data layouts, and 2) determining data transformation matrices to implement the optimal layouts. By postponing the determination of the transformation matrix to the last stage, our method can be adapted to compilers with different default layouts. We then present an algorithm that considers optimizing parallelism and spatial locality simultaneously. Our results on eight programs on two distributed shared-memory multiprocessors, the Convex Exemplar SPP-2000 and the SGI Origin 2000, show that the layout optimizations are effective in optimizing spatial locality and parallelism  相似文献   

15.
《Parallel Computing》1997,23(10):1405-1420
Performance prediction is necessary in order to deal with multi-dimensional performance effects on parallel systems. The compiler-generated analytical model developed in this paper accounts for the effects of cache behavior, CPU execution time and message passing overhead for real programs written in high level data-parallel languages. The performance prediction technique is shown to be effective in analyzing several non-trivial data-parallel applications as the problem size and number of processors vary. We leverage technology from the Maple symbolic manipulation system and the S-PLUS statistical package in order to present users with critical performance information necessary for performance debugging, architectural enhancement and procurement of parallel systems. The usability of these results is improved through specifying confidence intervals as well as predicted execution times for data-parallel applications.  相似文献   

16.
The computational speed of individual processors in distributed memory computers is increasing faster than the communication speed of the interconnection networks. This has led to the general perception among developers of compilers for data-parallel languages that overlapping communications with computations is an important optimization. We demonstrate that communication–computation overlap has limited utility. Overlapping communications with computations can never more than double the speed of a parallel application, and in practice the relative improvement in speed is usually far less than that. Most parallel algorithms have computational requirements that grow faster than their communication requirements. When this is the case, the gain from communication–computation overlap asymptotically approaches zero as the problem size increases.  相似文献   

17.
I report on a new version of the magnetohydrodynamics code NIRVANA1 which is targeted at the study of astrophysical problems. The new version allows for distributed-memory simulations supporting adaptive mesh refinement. Numerical algorithms include dissipative terms (viscosity, Ohmic diffusion, thermal heat conduction) in a conservative form. Domain decomposition is preferably block-wise in case of unigrid applications but adopts space-filling curve techniques for adaptive mesh applications with a hierarchical block-structured mesh. The code architecture facilitates workload balancing among processors for arbitrary mesh refinement depths maintaining intra-level data locality via space-filling curve mappings and, at the same time, ensuring inter-level data locality by applying a novel technique called block sharing. This way, it is demonstrated that comparable performance can be achieved for problems with locally highly refined grid. The data transfer between processors extensively utilizes the coarse-granularity concept of parallel computing and makes use of the MPI library. Conservation properties of the numerical method carry over to the parallel framework. In particular, the solenoidality condition for the magnetic field is preserved to roundoff precision applying the constrained transport machinery. This paper has its focus of discussion on implementation details related to the parallelization and on a code performance analysis.  相似文献   

18.
一种基于线性代数的计算和数据自动分解算法   总被引:1,自引:0,他引:1  
在针对分布内存体系结构的并行识别技术中,如何对计算和数据进行合理分解,以增加数据引用的本地化、减少处理器间的通信是提高并行程序性能的关键。本文通过对Anderson-lam分解算法完整性的补充,给出了一种可实现无通信的计算划分和数据分布算法,并阐述了对该算法在工程实践中的一些优化考虑。  相似文献   

19.
Sequential minimal optimization (SMO) is one popular algorithm for training support vector machine (SVM), but it still requires a large amount of computation time for solving large size problems. This paper proposes one parallel implementation of SMO for training SVM. The parallel SMO is developed using message passing interface (MPI). Specifically, the parallel SMO first partitions the entire training data set into smaller subsets and then simultaneously runs multiple CPU processors to deal with each of the partitioned data sets. Experiments show that there is great speedup on the adult data set and the Mixing National Institute of Standard and Technology (MNIST) data set when many processors are used. There are also satisfactory results on the Web data set.  相似文献   

20.
现代的计算机处理器和计算机系统实现了很多先进技术,要利用这些技术更需要编译器的支持以取得高性能。GCC中Tree-SSA优化框架提供了一个功能强大的程序分析框架。增强的数据依赖分析信息允许编译器变换一个算法以取得更大的局部性,提高资源的利用率以增大吞吐量,提高性能。该文对数据依赖、矩阵变换、循环变换进行了研究,分析了它们的特点,算法和性能,陈述了GCC中循环变换的现状,对以后的研究做出了一定的展望。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号