首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
多核处理器上共享缓存使用效率,即程序局部性是影响并行程序性能的关键因素之一。提出了以足迹为基础的局部性理论。介绍了缺失率、重用距离和足迹之间的转化关系,并利用足迹可组合性特征建立了并行程序局部性预测模型。  相似文献   

2.
大规模并行应用程序的性能优化和并行化的关键瓶颈之一在于多核CPU中越来越深和越来越复杂的存储层次。文中系统地分析和总结了当前主要多核CPU和并行程序设计语言中的局部性设计方法,提出了两种局部性,即横向局部性和纵向局部性,从这两种局部性的视角深入分析了当前的主要并行程序设计语言的局部性设计机制,进一步总结对比了其优缺点,并指出了新一代并行程序设计语言应具有的特点,重点提出了新语言应同时综合考虑两种局部性支持的设计机制的研究观点。  相似文献   

3.
与串行程序相比,并行程序调试会遇到新的问题。首先并行程序往往需要长时间运行,从而导致并行程序调试是一个尤其费时的过程;其次并行程序调试过程中,某一次调试出现的错误在下次调试的时候不一定出现,给错误跟踪带来了很大困难。本文针对这两个问题,设计和实现了一个中间件系统,在并行调试工具XMPI中使能BLCR检查点系统的。通过该中间件,在使用XMPI调试大型MPI并行程序的时候,减少调试阶段并行程序运行时间,并且可以更好跟踪并行程序错误,提高并行程序开发效率。  相似文献   

4.
高岚  王锐  钱德沛 《软件学报》2013,24(6):1390-1402
多核处理器并行程序的确定性重放是实现并行程序调试的有效手段,对并行编程有重要意义。但由于多核架构下存在共享访存不同步问题,并行程序确定性重放的研究依然面临多方面的挑战,给并行程序的调试带来很大困难,严重影响了多核架构下并行程序的普及和发展。分析了多核处理器造成并行程序确定性重放难以实现的关键因素,总结了确定性重放的评价指标,综述了近年来学术界对并行程序确定性重放的研究。根据总结的评价指标,从纯软件方式和硬件支持方式对目前的确定性重放方法进行了分析与对比,并在此基础上对多核架构下并行程序的确定性重放未来的研究趋势和应用前景进行了展望。  相似文献   

5.
为了使Petri网技术能够应用于MPI并行程序的正确性和性能的验证,提出了Petri网共享合成运算构建MPI并行程序Petri网的算法。对分布式并行处理系统MPI并行程序的结构与消息传递过程进行分析,给出并行程序基本语句与传递函数的Petri网,将Petri网共享合成运算从两个Petri网的共享合成运算推广到并行程序的多个Petri网的共享合成运算,给出了推广定理和证明。提出了共享合成构建MPI并行程序Petri网的算法,并在消息传递并行系统中给出构建MPI并行程序Petri网的应用示例。实验结果表明,共享合成运算是构建MPI并行程序Petri网模型的一种有效方法。  相似文献   

6.
确定性并行技术   总被引:1,自引:0,他引:1  
由于执行个体之间的同步、竞争和干扰,并行程序的执行存在着不确定性问题,即程序在相同输入下多次执行可能得到不同的结果.不确定性给并行程序在开发、调试、测试、容错和安全等方面都带来了挑战,严重降低了并行程序的可靠性,阻碍了并行程序的发展.确定性并行技术通过控制并行程序执行个体间的同步、竞争和干扰,使程序的执行结果仅依赖于输入.确定性并行技术能够从根本上解决了目前并行程序存在的诸多问题,提升了并行程序的可靠性,给并行程序的发展带来了新的机遇.文中调查、分析和比较了目前主流的确定性并行技术和方法,分析了弱内存一致性对确定性并行系统的影响,并对未来确定性并行技术的发展趋势做出了展望.  相似文献   

7.
该文引入speedup作为并行程序的性能评测指标,分析了并行程序在不同类型和不同数量的客户虚拟机中运行的性能差异,实验表明,MPI并行程序在xVM虚拟化环境中的运行性能接近非虚拟化本地主机的性能,在半虚拟化环境中的并行程序性能超过全虚拟化环境中的并行程序性能。  相似文献   

8.
介绍了并行程序概念设计系统PPCDS(Paralle Program Conceptually Designing System)的设计和实现方法。该系统支持并行程序概念设计方法,可以有效地减少并行编程的复杂性,提高并行程序的开发效率。  相似文献   

9.
并行程序由于自身的复杂性使得并行程序的调试相比串行程序要困难的多,因此用可视化的性能分析工具来辅助并行程序的调试显得非常重要,以此来帮助程序员找到程序的性能瓶颈,为并行程序的优化提供指导和建议。本文在研究MPE性能分析机理的基础上,介绍了一种实用的MPI并行程序可视化性能分析方法,并用实例详细说明了并行程序实时可视化性能分析和事后可视化性能分析过程。  相似文献   

10.
张延园  刘敏 《微机发展》1997,7(5):17-19
在并行程序的开发过程中,常常会出现负载不平衡、通讯开销过大、同步等待等一些导致计算机系统性能降低的因素。为了克服这些问题,及时对并行程序进行性能分析是十分重要的.在[1]、[2]、[3]中虽然对并行程序的性能分析作了一些研究,但都没有实现对并行程序的全局住分析,作者在对并行程序的运行状态进行分析的基础上,研究和开发了一个住能分析系统,它能自动地提取描述程序运行过程的真实数据,依据这些数据描述并行程序的各种性能指标,并对影响并行程序运行性能的原因作出直观的图形表述。  相似文献   

11.
In a multiprogrammed system, when the operating system switches contexts, in addition to the cost for handling the processes being swapped out and in, the cache performance of processors also can be affected. If frequent context switching replaces the data loaded into cache memory before they are completely reused, the programs suffer from cache misses due to the damage in cache locality. In particular, for the programs with good cache locality, such as blocked programs, a scheduling mechanism of keeping cache locality against context switching is essential to achieve good processor utilization. To solve this requirement, we propose a preemption-safe policy to exploit the cache locality of blocked programs in a multiprogrammed system. The proposed policy delays context switching until a block is fully reused, but also compensates for the monopolized processor time on processor scheduling mechanisms. Our simulation results show that in a situation where blocked programs are run on multiprogrammed shared-memory multiprocessors, the proposed policy improves the performance of these programs due to a decrease in cache misses. In such situations, it also has a beneficial impact on the overall system performance due to the enhanced processor utilization.  相似文献   

12.
Exploiting cache locality of parallel programs at runtime is a complementary approach to a compiler optimization. This is particularly important for those applications with dynamic memory access patterns. We propose a memory-layout oriented technique to exploit cache locality of parallel loops at runtime on Symmetric Multiprocessor (SMP) systems. Guided by application-dependent and targeted architecture-dependent hints, our system, called Cacheminer, reorganizes and partitions a parallel loop using the memory-access space of its execution. Through effective runtime transformations, our system maximizes the data reuse in each partitioned data region assigned in a cache, and minimizes the data sharing among the partitioned data regions assigned to all caches. The executions of tasks in the partitions are scheduled in an adaptive and locality-presented way to minimize the execution time of programs by trading off load balance and locality. We have implemented the Cacheminer runtime library on two commercial SMP servers and an SimCS simulated SMP. Our simulation and measurement results show that our runtime approach can achieve comparable performance with the compiler optimizations for programs with regular computation and memory-access patterns, whose load balance and cache locality can be well optimized by the tiling and other program transformations. However, our experimental results show that our approach is able to significantly improve the memory performance for the applications with irregular computation and dynamic memory access patterns. These types of programs are usually hard to optimize by static compiler optimizations  相似文献   

13.
This paper presents the results of a simulation study of cache coherency issues in parallel implementations of functional programming languages. Parallel graph reduction uses a heap shared between processors for all synchronisation and communication. We show that a high degree of spatial locality is often present and that the rate of synchronisation is much greater than for imperative programs. We propose a modified coherency protocol with static cache line ownership and show that this allows locality to be exploited to at least the level of a conventional protocol, but without the unnecessary serialisation and network transactions this usually causes. The new protocol avoids false sharing, and makes it possible to reduce the number of messages exchanged, but relies on increasing the size of the cache lines exchanged to do so. It is, therefore, of most benefit with a high-bandwidth interconnection network with relatively high communication latencies or message handling overheads.  相似文献   

14.
流编程模型是一种近年来被广泛研究的并行编程模型,它在基于软件管理的流式存储器,如流寄存器文件的流体系结构上得到了良好的应用.但同时也有研究指出流编程模型同样适合于基于硬件管理的一致性cache的体系结构.流编程模型目前最重要的应用背景GPGPU在发展中也逐渐引入通用的数据cache,因此发掘流程序的cache局部性就成为在这类体系结构上提高流程序性能的关键.由于流程序特殊的执行模型,其重用向局部性转化的过程与传统的串行程序不一致,无法直接使用传统的局部性分析方法直接对流程序进行分析.在深入分析了重用向局部性转化过程的基础上,提出了"迭代序"的概念用于描述流和串行程序重用向局部性转化时的不同,同时结合流程序的执行特点面向并行扩展了传统的局部性分析理论,给出了基于迭代序的局部性分析方法.此外,结合局部性分析模型还提出了两种流程序的cache局部性优化方法.在GPGPUSim模拟平台上进行的验证结果表明对流程序局部性的定量分析是有效的,并且提出的优化方法也可以有效改善流程序的cache局部性,提高流程序的性能.  相似文献   

15.
多处理机系统循环间数据重用的cache优化*   总被引:2,自引:0,他引:2  
cache的使用缓解了CPU和主存储器之间速度差距太大的矛盾,同时,也使cache的命中率成为影响多处理机系统性能发挥的重要因素.人们对如何加强数据的局部性,提高cache命中率,使多处理机系统的性能得到更好的发挥进行了积极的探索.但过去的工作主要集中于如何加强并行循环内的数据局部性,减少甚至消除并行循环内真假共享cache行所引起的cache抖动,对多处理机系统中循环间数据重用的开发和利用却少有论述.该文对如何开发和利用这些循环间数据重用进行了分析和讨论,并提出了一些切实可行、易于实现的方法.这些方法的  相似文献   

16.
The traditional dynamic random-access memory (DRAM) storage medium can be integrated on chips via modern emerging 3D-stacking technology to architect a DRAM shared cache in multicore systems. Compared with static random-access memory (SRAM), DRAM is larger but slower. In the existing research, a lot of work has been devoted to improving the workload performance using SRAM and stacked DRAM together in shared cache systems, ranging from SRAM structure improvement to optimizing cache tags and data access. However, little attention has been paid to designing a shared cache scheduling scheme for multiprogrammed workloads with different memory footprints in multicore systems. Motivated by this, we propose a hybrid shared cache scheduling scheme that allows a multicore system to utilize SRAM and 3D-stacked DRAM efficiently, thus achieving better workload performance. This scheduling scheme employs (1) a cache monitor, which is used to collect cache statistics; (2) a cache evaluator, which is used to evaluate the cache information during the process of programs being executed; and (3) a cache switcher, which is used to self-adaptively choose SRAM or DRAM shared cache modules. A cache data migration policy is naturally developed to guarantee that the scheduling scheme works correctly. Extensive experiments are conducted to evaluate the workload performance of our proposed scheme. The experimental results showed that our method can improve the multiprogrammed workload performance by up to 25% compared with state-of-the-art methods (including conventional and DRAM cache systems).  相似文献   

17.
The workload of multimedia applications has a strong impact on cache memory performance, since the locality of memory references embedded in multimedia programs differs from that of traditional programs. In many cases, standard cache memory organization achieves poorer performance when used for multimedia. A widely-explored approach to improve cache performance is hardware prefetching, which allows the pre-loading of data in the cache before they are referenced. However, existing hardware prefetching approaches are unable to exploit the potential improvement in performance, since they are not tailored to multimedia locality. In this paper we propose novel effective approaches to hardware prefetching to be used in image processing programs for multimedia. Experimental results are reported for a suite of multimedia image processing programs including MPEG-2 decoding and encoding, convolution, thresholding, and edge chain coding.  相似文献   

18.
Technological advancements in the silicon industry, as predicted by Moore’s law, have resulted in an increasing number of processor cores on a single chip, giving rise to multicore, and subsequently many-core architectures. This work focuses on identifying key architecture and software optimizations to attain high performance from tiled many-core architectures (TMAs)—an architectural innovation in the multicore technology. Although embedded systems design is traditionally power-centric, there has been a recent shift toward high-performance embedded computing due to the proliferation of compute-intensive embedded applications. The TMAs are suitable for these embedded applications due to low-power design features in many of these TMAs. We discuss the performance optimizations on a single tile (processor core) as well as parallel performance optimizations, such as application decomposition, cache locality, tile locality, memory balancing, and horizontal communication for TMAs. We elaborate compiler-based optimizations that are applicable to TMAs, such as function inlining, loop unrolling, and feedback-based optimizations. We present a case study with optimized dense matrix multiplication algorithms for Tilera’s TILEPro64 to experimentally demonstrate the performance and performance per watt optimizations on TMAs. Our results quantify the effectiveness of algorithmic choices, cache blocking, compiler optimizations, and horizontal communication in attaining high performance and performance per watt on TMAs.  相似文献   

19.
In recent years, processor technology has evolved towards multicore processors, which include multiple processing units (cores) in a single package. Those cores, having their own private caches, often share a higher level cache memory dedicated to each processor die. This multi-level cache hierarchy in multicore processors raises the importance of cache utilization problem. Assigning parallel-running software components with common data to processor cores that do not share a common cache increases the number of cache misses. In this paper we present a novel approach that uses model-based information to guide the OS scheduler in assigning appropriate core affinities to software objects at run-time. We build graph models of software and cache hierarchies of processors and devise a graph matcher algorithm that provides mapping between these two graphs. Using this mapping we obtain candidate core sets that each software object can be affiliated with at run-time. These affiliations are determined based on the idea that software components that have the potential to share common data at run-time should run on cores that share a common cache. We also develop an object dispatcher algorithm that keeps track of object affiliations at run-time and dispatches objects by using the information from the compile-time graph matcher. We apply our approach on design pattern implementations and two different application program running on servers using CFS scheduling. Our results show that cache-aware dispatching based on information obtained from software model, decreases number of cache misses significantly and improves CFS’ scheduling performance.  相似文献   

20.
Many large programs operate on collection types. Extensive libraries are available in many programming languages, such as the C++ Standard Template Library, which make programming with collections convenient. Extending programming languages to provide collection queries as first class constructs in the language would not only allow programmers to write queries explicitly in their programs but it would also allow compilers to leverage the wealth of experience available from the database domain to optimize such queries. This paper describes an approach to reduce the run time of programs involving explicit collection queries by performing run time query optimization that is effective for single runs of a program. In addition, it also leverages a cache to store previously computed results. The proposed approach relies on histograms built from the data at run time to estimate the selectivity of joins and predicates in order to construct query plans. Information from earlier executions of the same query during run time is leveraged during the construction of the query plans, even when the data has changed between these executions. An effective cache policy is also determined for caching the results of join (sub) queries. The cache is maintained incrementally, when the underlying collections change, and use of the cache space is optimized by a cache replacement policy. Our approach has been implemented within the Java Query Language (JQL) framework using AspectJ. Our approach demonstrated that its run time query optimization in integration with caching sub query result significantly improves the run time of programs with explicit queries over equivalent programs performing collection operations by iterating over those collections. This paper evaluates our approach using synthetic as well as real world Robocode programs by comparing it to JQL as a benchmark. Experimental results show that our approach performs better than the JQL approach with respect to the program run time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号