首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
推测多线程(speculative multithreading,简称SpMT)技术是一种实现非规则程序自动并行化的有效途径.然而,基于控制流图和分支预测技术的线程划分方法,不可避免地会受到划分路径上所存在的控制依赖和数据依赖的制约.目前,在传统的线程划分算法中存在的一个重要问题是,在对划分路径进行选取时只考虑了控制依赖影响却不能有效地综合考虑数据依赖的影响,进而导致不能选取最佳的划分路径.因此,针对传统方法中这种依赖评估方法效率低下的问题,设计并实现了一种基于路径优化的线程划分算法.该算法通过引入基于程序切片技术的预计算方法,建立一种路径评估方法来评估程序间的控制和数据依赖.同时,引入控制线程体大小的启发式规则,以便有效地解决负载不平衡的问题.基于Olden测试集的测试结果表明,所提出的算法可以有效地对非规则程序进行划分,其平均加速比可以达到1.83.  相似文献   

2.
研究人脑的工作记忆系统的测试程序涉及到两个同时执行的并行动作.为了保证测试结果的精度和有效性.测试程序采用多线程技术实现了在工作记忆测试过程中显示字符序列和记录被测试者的反应时间这两个动作的并行执行.测试程序在主线程记录测试者的反应时间,而在另外一个线程中显示测试序列,两个线程的并行执行保证了测试结果的精度.  相似文献   

3.
推测多线程(Speculative Multithreading,SpMT)技术是一种实现非规则程序自动并行化的有效途径.然而,如何有效评估由诸如控制、数据依赖等因素导致的多种并行开销并实现最优线程划分一直是制约加速比性能提升的关键问题.基于启发式规则的传统划分方法虽然可以取得一定的加速效果,但由于启发式规则只能对多种并行开销进行定性评估,因而导致只能得到经验上较优的线程划分.针对传统划分方法的局限性,文中首次提出并实现了一种基于模糊聚类的线程划分方法.在该方法中,作者首先提出一种评估模型来定量评估各种并行开销,然后通过深入分析各种并行开销来确定最佳的线程解搜索空间,最终利用聚类方法实现有效线程解空间搜索以求取更优的线程划分.基于Olden程序集的测试结果表明,文中提出的线程划分方法可以有效地对非规则程序进行划分,其平均加速比可达到1.85.  相似文献   

4.
针对子程序结构的线程级推测并行性分析   总被引:3,自引:0,他引:3  
线程级推测技术为开发更多的线程级并行性,充分利用多核加速传统上难以手工或自动并行化的串行程序提供可行的技术途径.然而,这种技术的性能严重地依赖于线程划分方案.有研究表明,仅推测执行循环所产生的并行性是不够的.但推测执行子程序结构比循环结构要难.本文提出寻找适于推测并行执行的子程序结构的基本判定依据;通过运行由Simplescalar工具集改造得到的动态剖析工具ProRV、ProFun和SPEC CPU2000基准测试程序,我们对子程序结构线程化推测执行的适合性进行详细分析,给出具有指导意义的实验分析方法和实验数据.我们发现:①无返回值的子程序结构占据程序整体执行时间的大约40%;返回稀疏整型的子程序结构占据了程序整体执行时间的大约10%,对其返回值的预测成功率在70%左右.对于其他返回值类型的子程序结构,由于对其返回值的预测成功率过低,我们认为不适合作为线程划分的对象.②简单的last-value的值预测方案对于返回值的预测是简单而且足够有效的.③访存数据依赖普遍存在于子程序与其后继代码之间,显式同步机制对于针对子程序结构的线程级推测是必要的.  相似文献   

5.
一种基于线程的数据预取方法   总被引:1,自引:0,他引:1       下载免费PDF全文
多线程、多核处理器的推广受限于应用。目前,大部分应用尤其是桌面应用都是单线程程序,不能充分利用多线程处理器提供的多个现场并行执行来提高速度。使用空闲现场加速单线程应用是目前研究的一个热点,研究主要集中在提高传统串行应用存储访问的效率和分支预测的精度。在基于线程的数据预取方法中,数据预取线程是从主线程的执执行踪迹中提取的。它们使用空闲的现场,和主线程并行执行,在主线程需要数据之前把数据取到离处理器更近的存储层次。基于线程的数据预取方法能够有效地解决传统数据预取方法难以处理的诸多问题,如不规则内存访问模式。本文具体分析了应用程序中访存行为的特点,结合控制流处理,设计并验证了一种基于线程的数据预取方法TDP。模拟结果显示,使用TDP可以获得7%左右的性能提升。  相似文献   

6.
研究人脑的工作记忆系统的测试程序涉及到两个同时执行的并行动作。为了保证测试结果的精度和有效性,测试程序采用多线程技术实现了在工作记忆测试过程中显示字符序列和记录被测试者的反应时间这两个动作的并行执行。测试程序在主线程记录测试者的反应时间,而在另外一个线程中显示测试序列,两个线程的并行执行保证了测试结果的精度。  相似文献   

7.
基于事务性执行的投机并行多线程是一种适合未来多核微处理器架构的新型并行程序设计和编译技术.但在此基础上的并行程序执行过程更为复杂,程序执行过程的模拟成为关键问题之一.本文提出利用二进制代码级动态插桩技术对投机并行多线程程序进行功能性模拟,设计并实现了完整的软件平台,可精确地模拟和监控并行程序的线程级投机执行过程,检测访存冲突,从而实现投机并行多线程的语义.该软件平台同时可以作为进一步研究投机多线程并行程序真实执行过程的基础,并有效支持投机并行多线程编译器的设计和分析.  相似文献   

8.
该文介绍了线程集成,一种在通用单片微处理器或微控制器上低耗并行执行的新方法,后级编译技术有效地插入多个控制线程,并提供细粒度的多个线程而不用上下文切换的方法,这样允许用软件完成实时的功能来代替专用外围硬件。该文研究了在主线程中集成实时客户线程时的代码转移,生成的集成线程能满足所有的实时性,线程集成的概念和代码转移被应用到实际中来检验这种方法的可行性。  相似文献   

9.
在并行优化编译器的并行识别过程中,许多串行代码无法找到全局一致的分解结果,数据重分布无可避免,有必要寻找一种有效的方法求解计算和数据的动态分解。该文研究了单个嵌套循环计算与数据分解算法以及分解结果表示方法,提出一种在多个嵌套循环间求解数据线性一致分布的动态分解算法,结合程序的结构分析和程序的控制流信息,用于通用串行代码的并行分解过程,可以同时给出串行代码的计算划分和数据分布结果。  相似文献   

10.
多线程处理器的推广受限于应用,目前大部分应用尤其是桌面应用都是单线程程序,不能充分利用多线程处理器提供的多个现场,并行执行以提高速度.使用空闲现场加速单线程应用是目前研究的一个热点,研究主要集中在提高传统串行应用存储访问的效率和分支预测的精度.在基于线程的数据预取方法TDP中,数据预取线程是从主线程的执行踪迹中提取的,它们使用空闲的现场,和主线程并行执行.由于数据预取线程仅仅包括和预取相关的指令,它们比主线程执行要快,可以在主线程需要数据之前,把数据取到离处理器更近的存储层次.基于线程的数据预取方法能够有效地解决传统数据预取方法难以处理的诸多问题,如不规则内存访问模式.研究控制相关对TDP的影响,具体分析使用错误前瞻的数据预取方法:通过在预取线程中加入分支指令,并用它们控制预取线程的执行过程.通过研究发现,在某些情况下即使控制前瞻已经被证实是错误的,继续执行预取线程可以获得更好的预取效果.模拟结果显示,使用错误前瞻可以获得5%的性能提升.  相似文献   

11.
Efficient performance tuning of parallel programs is often hard. Optimization is often done when the program is written as a last effort to increase the performance. With sequential programs each (executed) code segment will affect the completion time. In the case of a parallel program executed on a multiprocessor this is not always true, due to dependencies between the different threads. Thus, certain code segments of the execution may not affect the completion time of the program. Optimization of such code segments will not increase the performance. In this paper we present an approach to optimize performance by finding the extended critical path of the multithreaded program. The extended critical path analysis is a generalization of the critical path analysis in the sense that it also deals with more threads than processors. We have implemented the extended critical path analysis in a performance optimization tool. The tool allows the user to determine the extended critical path of a multithreaded application written for the Solaris operating system for any number of processors based on execution on a single processor workstation.  相似文献   

12.
Speculative multithreading (SpMT) promises to be an effective mechanism for parallelizing nonnumeric programs, which tend to have irregular and pointer-intensive data structures and complex flows of control. Proper thread formation is crucial for obtaining good speedup in an SpMT system. This paper presents a compiler framework for partitioning a sequential program into multiple threads for parallel execution in an SpMT system. This framework is very general and supports speculative threads, nonspeculative threads, loop-centric threads, and out-of-order thread spawning. It is therefore useful for compiling for a wide variety of SpMT architectures. For effective partitioning of programs, the compiler uses profiling, interprocedural pointer analysis, data dependence information, and control dependence information. The compiler is implemented on the SUIF-MachSUIF platform. A simulation-based evaluation of the generated threads shows that the use of nonspeculative threads and nonloop speculative threads provides a significant increase in speedup for nonnumeric programs.  相似文献   

13.
With the trend of growing number of integrated processing cores on Chip Multiprocessors, researchers are working hard to increase the available parallelism of software programs so as to efficiently harness the growing computing power. One noticeable direction among these efforts is speculative multi-threading (SpMT), a.k.a thread level speculation, which aims to extract thread level parallelism by splitting a sequential execution thread into several finer ones and execute them on parallel. A SpMT thread is in speculative status before it “knows” all its input data are correct. A speculative thread needs to write to the L1 cache, but its output might be discarded if the speculation eventually fails. However, another speculative thread may have already read in such speculative output. Therefore, some mechanism is needed to support speculative read and write. And because the SpMT threads are extracted from a single thread, they usually share lots of data, thus there might be intense data coherence among the L1 caches. It would be very complicated to support data coherence and speculation together. This Paper proposes a shared write buffer among the SpMT cores. We are able to confine the speculative read and write in the SWB, thus the speculation will not interference with coherence, and the L1 cache design could be drastically simplified. Experiments show that the SWB can capture a big portion of inter-core data sharing, reduce cache coherence, and drastically improve data access performance of SpMT threads.  相似文献   

14.
The construction of large software systems is always achieved through assembly of independently written components — program modules. For these software components to work together, they must share a common set of data types and principles for representing structured data such as arrays of values and files. This common set of tools for creating and operating on data objects is provided by the infrastructure of the computer system: the hardware, operating system and runtime code. Because the nature and properties of these tools are crucial for correct operation of software components and their inter-operation, it is essential to have a precise specification that may be used for verifying correctness of application software on one hand, and to verify correctness of system behavior on the other. We call such a specification a program execution model (PXM). It is evident that the properties of the PXM implemented by a computer system can have serious impact on the ability of application programmers to practice modular software construction. This paper discusses the concept of program execution models and presents a set of principles that a PXM must satisfy to provide a sound basis for modular software construction. Because parallel program execution on computer systems with many processing units is an essential part of contemporary computing environments, the expression of parallelism and modular software construction using components involving parallel operations is included in this treatment. The conclusion is that it is possible to build computer systems that implement a PXM within which any parallel program may be used, unmodified, as a component for building more substantial parallel programs.  相似文献   

15.
Kasi Anantha  Fred Long 《Software》1990,20(6):537-554
There are two principal methods used to exploit the parallelism available on a parallel machine: the program to be executed can be optimized by hand, or the program can be automatically converted to parallel machine code by a compiler. The first method usually derives parallelism at the procedure level; a parallel program is written in a high-level language and typically has various modules executing in parallel. By contrast, the compiler methodically transforms the program into parallel code using various transformations, such as code movement. The automatic conversion of a program to parallel code is called compaction or parallelization. This paper describes the evolution of a new compaction program and presents a new algorithm for determining legal code movements. A simulator of the target architecture was used to estimate the execution times of a sample suite of programs before and after compaction. The results verify that substantial advantages arise from applying this compaction technique.  相似文献   

16.
Speculative multithreading (SpMT) is a thread-level automatic parallelization technique, which partitions sequential programs into multithreads to be executed in parallel. This paper presents different thread partitioning strategies for nonloops and loops. For nonloops, we propose a cost estimation based on combined run-time effects of various speculation factors to predict the resulting performance of candidate threads to guide the thread partitioning. For loops, we parallelize all the profitable loops that can potentially offer additional performance benefits by multilevel spawning in loop bodies, loop iterations, and inner loops. Then we select a proper thread boundary located in the front of loop branch instruction to reduce invalid spawning threads that waste core resources. Experimental results show that the proposed approach can obtain a significant increase in speedup and Olden benchmarks reach a performance improvement of 6.62 % on average.  相似文献   

17.
Although considerable technology has been developed for debugging and developing sequential programs, producing verifiably correct parallel code is a much harder task. In view of the large number of possible scheduling sequences, exhaustive testing is not a feasible method for determining whether a given parallel program is correct; nor have there been sufficient theoretical developments to allow the automatic verification of parallel programs. PTOOL, a tool being developed at Rice University in collaboration with users at Los Alamos National Laboratory, provides an alternative mechanism for producing correct parallel code. PTOOL is a semi-automatic tool for detecting implicit parallelism in sequential Fortran code. It uses vectorizing compiler techniques to identify dependences preventing the parallelization of sequential regions. According to the model supported by PTOOL, a programmer should first implement and test his program using traditional sequential debugging techniques. Then, using PTOOL, he can select loop bodies that can be safely executed in parallel. At Los Alamos, we have been interested in examining the role of dependence-analysis tools in the parallel programming process. Therefore, we have used PTOOL as a static debugging tool to analyze parallel Fortran programs. Our experiences using PTOOL lead us to conclude that dependence-analysis tools are useful to today's parallel programmers. Dependence-analysis is particularly useful in the development of asynchronous parallel code. With a tool like PTOOL, a programmer can guarantee that processor scheduling cannot affect the results of his parallel program. If a programmer wishes to implement a partially parallelized region through the use of synchronization primitives, however, he will find that dependence analysis is less useful. While a dependence-analysis tool can greatly simplify the task of writing synchronization code, the ultimate responsibility of correctness is left to the programmer.This work was performed under the auspices of the U.S. Department of Energy.  相似文献   

18.
PC clusters have emerged as viable alternatives for high-performance, low-cost computing. In such an environment, sharing data among processes is essential. Accessing the shared data, however, may often stall parallel executing threads. We propose a novel data representation scheme where an application data entity can be incarnated into a set of objects that are distributed in the cluster. The runtime support system manages the incarnated objects and data access is possible only via an appropriate interface. This distributed data representation facilitates parallel accesses for updates. Thus, tasks are subject to few limitations and application programs can harness high degrees of parallelism. Our PC cluster experiments prove the effectiveness of our approach.  相似文献   

19.
Using GPUs as general-purpose processors has revolutionized parallel computing by providing, for a large and growing set of algorithms, massive data-parallelization on desktop machines. An obstacle to their widespread adoption, however, is the difficulty of programming them and the low-level control of the hardware required to achieve good performance. This paper proposes a programming approach, SafeGPU, that aims to make GPU data-parallel operations accessible through high-level libraries for object-oriented languages, while maintaining the performance benefits of lower-level code. The approach provides data-parallel operations for collections that can be chained and combined to express compound computations, with data synchronization and device management all handled automatically. It also integrates the design-by-contract methodology, which increases confidence in functional program correctness by embedding executable specifications into the program text. We present a prototype of SafeGPU for Eiffel, and show that it leads to modular and concise code that is accessible for GPGPU non-experts, while still providing performance comparable with that of hand-written CUDA code. We also describe our first steps towards porting it to C#, highlighting some challenges, solutions, and insights for implementing the approach in different managed languages. Finally, we show that runtime contract-checking becomes feasible in SafeGPU, as the contracts can be executed on the GPU.  相似文献   

20.
There are billions of lines of sequential code inside nowadays’ software which do not benefit from the parallelism available in modern multicore architectures. Automatically parallelizing sequential code, to promote an efficient use of the available parallelism, has been a research goal for some time now. This work proposes a new approach for achieving such goal. We created a new parallelizing compiler that analyses the read and write instructions, and control-flow modifications in programs to identify a set of dependencies between the instructions in the program. Afterwards, the compiler, based on the generated dependencies graph, rewrites and organizes the program in a task-oriented structure. Parallel tasks are composed by instructions that cannot be executed in parallel. A work-stealing-based parallel runtime is responsible for scheduling and managing the granularity of the generated tasks. Furthermore, a compile-time granularity control mechanism also avoids creating unnecessary data-structures. This work focuses on the Java language, but the techniques are general enough to be applied to other programming languages. We have evaluated our approach on 8 benchmark programs against OoOJava, achieving higher speedups. In some cases, values were close to those of a manual parallelization. The resulting parallel code also has the advantage of being readable and easily configured to improve further its performance manually.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号