期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Thread-Sensitive Instruction Issue for SMT Processors

《Computer Architecture Letters》2004,3(1):5-5

Simultaneous Multi Threading (SMT) is a processor design method in which concurrent hardware threads share processor resources like functional units and memory. The scheduling complexity and performance of an SMT processor depend on the topology used in the fetch and issue stages. In this paper, we propose a thread sensitive issue policy for a partitioned SMT processor which is based on a thread metric. We propose the number of ready-to-issue instructions of each thread as priority metric. To evaluate our method, we have developed a reconfigurable SMT-simulator on top of the SimpleScalar Toolset. We simulated our modeled processor under several workloads composed of SPEC benchmarks. Experimental results show around 30% improvement compared to the conventional OLDEST_FIRST mixed topology issue policy. Additionally, the hardware implementation of our architecture with this metric in issue stage is quite simple. 相似文献

2.

The Impact of Speculative Execution on SMT Processors

Dongsoo Kang Chen Liu Jean-Luc Gaudiot 《International journal of parallel programming》2008,36(4):361-385

By executing two or more threads concurrently, Simultaneous MultiThreading (SMT) architectures are able to exploit both Instruction-Level Parallelism (ILP) and Thread-Level Parallelism (TLP) from the increased number of in-flight instructions that are fetched from multiple threads. However, due to incorrect control speculations, a significant number of these in-flight instructions are discarded from the pipelines of SMT processors (which is a direct consequence of these pipelines getting wider and deeper). Although increasing the accuracy of branch predictors may reduce the number of instructions so discarded from the pipelines, the prediction accuracy cannot be easily scaled up since aggressive branch prediction schemes strongly depend on the particular predictability inherently to the application programs. In this paper, we present an efficient thread scheduling mechanism for SMT processors, called SAFE-T (Speculation-Aware Front-End Throttling): it is easy to implement and allows an SMT processor to selectively perform speculative execution of threads according to the confidence level on branch predictions, hence preventing wrong-path instructions from being fetched. SAFE-T provides an average reduction of 57.9% in the number of discarded instructions and improves the instructions per cycle (IPC) performance by 14.7% on average over the ICOUNT policy across the multi-programmed workloads we simulate. This paper is an extended version of the paper, “Speculation Control for Simultaneous Multithreading,” which appeared in the Proceedings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, New Mexico, April 2004. 相似文献

3.

一种有效的同时多线程处理器取指控制机制 总被引：1，自引：0，他引：1

何立强刘志勇《计算机学报》2006,29(4):535-543

同时多线程处理器通过每时钟周期从多个运行的线程取指令执行,极大地提高了处理器的性能.分支预测器的预测精度和取指策略的效率是影响同时多线程处理器性能的关键.通过将一个基于值的分支预测器和一个基于线程推进速度的取指策略相结合,提出一种新的取指控制机制.该结构的硬件开销较小,实现复杂度较低.实验结果表明,该取指控制机制有效地提高了处理器的性能,其相对于传统取指控制机制的性能加速比为28%且该加速比也高于目前基于流缓冲区和基于分支分类器的取指控制机制. 相似文献

4.

基于EPIC的同时多线程处理器取指策略

下载免费PDF全文

贾小敏孙彩霞张民选《计算机工程》2007,33(4):256-258

EPIC硬件简单，同时多线程易于开发线程级并行，在EPIC上实现同时多线程可以结合二者的优点。取指策略对同时多线程处理器的性能有重要影响。该文介绍了几种有代表性的超标量同时多线程处理器取指策略，分析了这些策略在EPIC同时多线程处理器上的适用性，提出了一种新的适用于EPIC的取指策略SICOUNT。分析表明SICOUNT策略可以充分利用EPIC软硬件协同的优势，在选择取指线程时使用编译器所提供的停顿信息，能更精确地估计各个线程的流动速度，使取出指令的质量更高。相似文献

5.

基于负载瞬时IPC性能的同时多线程处理器取指策略

何立强刘志勇《计算机学报》2007,30(4):629-637

同时多线程处理器在每时钟周期从多个线程读取指令执行,极大地提高了指令吞吐率.文中简单介绍了SMT技术,讨论了常用的取指策略,比较了各策略在提高性能方面的优劣.给出特定负载下理论上的最优取指策略,在此基础上提出一种基于负载瞬时IPC性能的动态取指策略IPCBFP.实验表明,该策略可以有效地提高负载的性能,平均加速比对于两线程负载可以达到17%,对于四线程负载可以达到8%.该策略还具有平均占用指令队列项少,指令队列冲突率低的特点,而且,对降低SMT的Cache失效率和TLB失效率方面也有一定的作用. 相似文献

6.

使用取指策略控制同时多线程处理器中个体线程的性能

孙彩霞张民选《计算机学报》2008,31(2):309-317

当前,对同时多线程(Si multaneous Multithreading,SMT)处理器取指策略的研究大都集中在总体性能的优化上.文中提出一种新颖的SMT处理器取指策略(Controlling Performance of Individual Thread,CPIT),用于控制个体线程的执行.结果表明,对于模拟的所有负载,CPIT在94%以上的情况下都能保证受控线程获得期望性能.而对于失败的情况,受控线程的平均性能偏差不超过1.25%.此外,CPIT策略对处理器总体性能的影响并不大.与ICOUNT这种以优化性能为目标的取指策略相比,总体性能的平均降低不超过3%,而除受控线程外的其他线程的性能平均只降低了1.75%. 相似文献

7.

浮点与整数资源区别分配的SMT处理器取指策略

《计算机工程》2017,(4):46-51

在同时多线程处理器中,各线程对于浮点和整数资源需求不同,合理分配线程的共享资源是提升处理器整体性能的重要因素。为此,提出一种浮点与整数资源区别分配的取指策略,合理分配各个线程对于浮点和整数资源的使用情况。实验结果表明,与ICOUNT,STALL等策略相比,该策略在算术平均IPC和调和平均IPC方面均取得一定的性能提升,同时其在处理浮点和整数混合型程序时也具有优势。相似文献

8.

一种提高同时多线程VLIW处理器中取指单元吞吐率的方法

下载免费PDF全文

万江华陈书明《计算机工程与科学》2007,29(6):97-101

在同时多线程处理器中,提高取指单元的吞吐率意味着各线程之间的Cache竞争更加激烈,而这种竞争又制约着取指单元吞吐率的提高。本文针对当前超长指令字体系结构的新特点,提出了一种同时提高取指单元和处理器吞吐率的方法。该方法通过尽可能早地作废取指流水线中的无效地址,减少了由无效取指导致的程序Cache冲突,也提高了整个处理器的性能。实验结果表明,该方法使处理器和取指单元的吞吐率均相对提高了12%～23%,而一级程序Cache的失效率则略微增加甚至降低。另外,它还能够减少10%～25%的一级程
程序Cache读访问,从而降低了处理器的功耗。相似文献

9.

Accelerating DAG-Style Job Execution via Optimizing Resource Pipeline Scheduling

下载免费PDF全文

Duan Yubin Wang Ning Wu Jie 《计算机科学技术学报》2022,37(4):852-868

Journal of Computer Science and Technology - The volume of information that needs to be processed in big data clusters increases rapidly nowadays. It is critical to execute the data analysis in a... 相似文献

10.

同时多线程处理器共享资源的特性分析

下载免费PDF全文

黄彩霞《计算机工程与科学》2009,31(8)

同时多线程处理器中同时执行的线程共享处理器中的资源,而这些有限的共享资源在线程之间的分配状况将决定每个线程执行的性能和处理器的总体性能。如何根据不同类别共享资源的特性对它们进行合理有效分配成为同时多线程处理器研究的重要课题之一。本文对同时多线程处理器中各类共享资源的特性进行深入研究与分析,分析结果表明,队列类共享资源的分配方式对每个线程执行的性能和SMT处理器的总体性能具有至关重要的影响。因此,同时多线程处理器中共享资源分配的关键在于控制队列类共享资源的分配。相似文献

11.

Scheduling MapReduce Jobs on Identical and Unrelated Processors

Fotakis Dimitris Milis Ioannis Papadigenopoulos Orestis Vassalos Vasilis Zois Georgios 《Theory of Computing Systems》2020,64(5):754-782

Theory of Computing Systems - We consider non-preemptive scheduling of MapReduce jobs consisting of multiple map-reduce rounds so as to minimize their average weighted completion time on identical... 相似文献

12.

指令调度中的寄存器重命名技术

张军超张兆庆《计算机工程》2005,31(23):8-10

指令间的依赖关系是阻碍指令调度发挥作用，进而影响指令级并行的主要障碍。寄存器重命名是解决控制依赖和数据依赖的一种重要技术。研究并实现了一种指令调度中的寄存器重命名技术。它在164．gzip和186．crafty上分别取得了约5％和3％的加速比。相似文献

13.

Extra Processors versus Future Information in Optimal Deadline Scheduling

Chiu-Yuen Koo Tak-Wah Lam Tsuen-Wan "Johnny" Ngan Kar-Keung To 《Theory of Computing Systems》2004,37(3):323-341

This paper is concerned with the design of online scheduling algorithms that exploit extra resources. In particular, it studies how to make use of multiple processors to counteract the lack of future information in online deadline scheduling. Our results extend the previous work that are primarily based on using a faster processor to obtain a performance guarantee. The challenge arises from the fact that jobs are sequential in nature and cannot be executed on more than one processor at the same time. Thus, a faster processor can speed up a job while multiple unit-speed processors cannot. 相似文献

14.

An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors

Pierre Michaud André Seznec Stéphan Jourdan 《International journal of parallel programming》2001,29(1):35-58

The performance of superscalar processors depends on many parameters with correlated effects. This paper explores the relations between some of these parameters, and more particularly, the requirement in instruction fetch bandwidth. We introduce new enhancements to increase the bandwidth of conventional instruction fetch engines. However, experiments show that the performance does not increase proportionally to the fetch. Once the measured IPC is half the instruction fetch bandwidth, increasing the fetch bandwidth brings very little improvement. In order to better understand this behavior, we develop a model from the empirical observation that the available instruction parallelism grows as the square root of the instruction window size. From the model, we derive that the fetch bandwidth requirement grows as the square root of the distance between mispredicted branches. We also verify experimentally that, to double the IPC, one should both double the fetch bandwidth and decrease the number of mispredicted branches fourfold. 相似文献

15.

ORC的全局指令调度技术

杨书鑫张兆庆《计算机学报》2004,27(5):577-586

IA-64是一种崭新的体系结构．它为挖掘程序中潜在的指令级并行提供了丰富的硬件支持,例如：大寄存器组、(控制／数据)投机、谓词等．Itanium是IA-64的一个具体实现．该文作者将Bernstein的基于超标量处理机的全局指令调度算法应用于显式并行(EPIC)的Itanium处理机上．在结合Itanium处理机特性的同时,作者对Bernstein的算法有以下两点创新：(1)应用层次化区域．相对于传统的扁平区域,这样的区域具有很强的灵活性并提供了调度器大小合适的调度范围,使其既能充分利用硬件资源又能够有效地控制调度的时间和空间开销．(2)集成P—Readyr指令调度．P—Ready是在与Bernstein算法框架差异很大的上下文中提出的．P—Ready指令调度能够把优先级高的指令尽早调度即使这条指令并没有在所有经过它的执行路径上解除数据依赖．集成P—Readyr指令调度到Betnstein的算法框架上是十分有意义的．作者在“基于Itanium处理机的开放源码编译器ORC”中实现了该文介绍的算法,实验结果显示全局指令调度器对CPU2000int基准测试例平均有8．4％的运行时加速比．作为应用层次化区域的优越性的一个反映,调度指令跨越嵌套循环最高可取得12．9％的运行时加速比．此外,P—Ready指令调度对CPU2000int的测试例平均有1．37％的运行时加速比,最高可达7．6％．相似文献

16.

代码优化与指令调度的集成 总被引：1，自引：0，他引：1

连瑞琦吴承勇张兆庆《计算机学报》2001,24(7):694-701

在开发指令级并行性的编译器中,如果代码优化和指令调度各自独立进行,将导致代码优化效果的下降甚至产生副作用,文中针对这一问题,提出了代码优化和指令调度集成的思想,在此思想的基础上,介绍了一个适合于代码优化集成的指令调度算框架;并从优化的有效性、是否可逆和优化机会的产生等方面进行了分析,选出了适合集成入指令调度的传统优化种类;最后给出了这些优化的具体集成方法,该文提出的方法已经在一个指令级并行编译器上进行了实验,实验数据证明,这种优化集成方法能使优化的效果明显改善。相似文献

17.

指令调度中推断和推测技术的研究

叶崴马杰侯朝焕《微计算机应用》2006,27(6):691-693

编译器提高程序并行性的主要障碍是：频繁的控制转移和模棱两可的内存访问。推断和推测是vliw处理器体系结构的新特点，为了消除分支或访存对指令级并行性识别的影响。指令调度是编译器挖掘程序指令级并行性的关键技术之一，本文论述了如何在指令调度中有效地利用推断和推测技术，提高程序的性能。相似文献

18.

一种动态VLIW调度机制的研究和实现 总被引：2，自引：0，他引：2

下载免费PDF全文

李云照王志英沈立《计算机工程与科学》2008,30(7):90-93

VLIW结构是开发ILP的一种重要手段,其优点是结构规整简单、硬件复杂度低。但是,完全依靠编译器进行指令调度的机制限制了VLIW结构性能的提高。本文提出了一种基于确定指令延迟的动态VLIW调度机制,该机制利用大部分指令执行时间确定的特点,根据运行时信息重新调度指令的执行顺序,以进一步开发ILP。在FPGA上的实验结果表明,该机制具有线性的硬件复杂度。相似文献

19.

Optimizing Dual-Core Execution for Power Efficiency and Transient-Fault Recovery

《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1080-1093

Dual-core execution (DCE) is an execution paradigm proposed to utilize chip multiprocessors to improve the performance of single-threaded applications. Previous research has shown that DCE provides a complexity-effective approach to building a highly scalable instruction window and achieves significant latency-hiding capabilities. In this paper, we propose to optimize DCE for power efficiency and/or transient-fault recovery. In DCE, a program is first processed (speculatively) in the front processor and then reexecuted by the back processor. Such reexecution is the key to eliminating the centralized structures that are normally associated with very large instruction windows. In this paper, we exploit the computational redundancy in DCE to improve its reliability and its power efficiency. The main contributions include: 1) DCE-based redundancy checking for transient-fault tolerance and a complexity-effective approach to achieving full redundancy coverage and 2) novel techniques to improve the power/energy efficiency of DCE-based execution paradigms. Our experimental results demonstrate that, with the proposed simple techniques, the optimized DCE can effectively achieve transient-fault tolerance or significant performance enhancement in a power/energy-efficient way. Compared to the original DCE, the optimized DCE has similar speedups (34 percent on average) over single-core processors while reducing the energy overhead from 93 percent to 31 percent. 相似文献

20.

Experiences with Cooperating Register Allocation and Instruction Scheduling

Cindy Norris Lori L. Pollock 《International journal of parallel programming》1998,26(3):241-283

Compile-time reordering of low level instructions is successful in achieving large increases in performance of programs on fine grain parallel machines. However, because of the interdependences between instruction scheduling and register allocation, a lack of cooperation between the scheduler and register allocator can result in generating code that contains excess register spills and/or a lower degree of parallelism than actually achievable. This paper describes a strategy for providing cooperation between register allocation and both global and local instruction scheduling. We experimentally compare this strategy with other cooperative and uncooperative scenarios. 相似文献