期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

陆伯鹰尹宝林《计算机工程与应用》2001,37(12):121-124

指令调度是优化编译技术中一项关键技术,对于VLIW体系结构的CPU,指令调度显得尤为重要。指令调度是在保证语义正确的前提下,改变指令的执行顺序,减少流水线中的空闲周期,从而提高CPU性能的一种优化方法。文章着重分析了优化编译中的指令调度问题,提出了一个指令调度算法和DAG图的一种化简方法,证明了算法的正确性,分析了算法的效率,比较了生成的新指令序列和最优的指令序列总的执行时间的差别。同时,针对目前流行的编译器GCC的指令调度算法中存在的问题,提出了一个较好的解决途径。相似文献

2.

寄存器堆互连的VLIW结构及其指令调度算法

周志雄何虎杨旭张延军孙义和《计算机学报》2008,31(1):127-132

超长指令字(Very Long Instruction Word,VLIW)处理器一般采用总线互连的多簇结构,每个簇中的功能单元共享一个本地寄存器堆,簇间采用总线传输数据,以避免功能单元增多时,全连通结构的延时、面积和功耗的快速增长;但簇间数据共享时的拷贝和延时,使得处理器在性能上有所下降.文中提出了一种寄存器堆互连的多簇VLIW结构,采用寄存器堆来连接各个簇,从而可以避免簇间数据传输的延时和额外的数据拷贝操作.同时也提出了针对这种结构的指令调度算法,以提高指令调度的性能.实验结果表明,与全连通的VLIW结构相比,寄存器堆互连结构在性能上仅有13%左右的性能下降,代码长度则基本不变;这都优于总线互连的多簇结构. 相似文献

3.

一种面向VLIW指令压缩的寄存器分配算法 总被引：1，自引：0，他引：1

朱少波姚庆栋洪享史册《计算机工程》2003,29(20):154-156

针对VLIW结构的指令压缩方法，通过对编译中间代码的深入分析和总结，提出一种改进的寄存器分配算法，该算法在线性扫描的基础上，对寄存器的选择添加约束条件，应用该算法能够使得目标代码中寄存器的编号尽量靠近，从而达到更好的压缩效果。相似文献

4.

面向VLIW处理器的分支调度优化算法

时磊吴潇涂登彪程工刘峰余翠玲任彦《计算机工程与应用》2012,48(21):41-44

分支调度是一种有效消除分支指令延迟的指令调度技术,对于提升VLIW类处理器的性能非常重要。提出了一个针对分支延迟槽的指令调度优化算法。该算法面向VLIW体系结构,根据程序依赖图选择合适的候选指令序列;通过建立代价收益模型为分支延迟槽产生一个收益较大的指令调度序列。实验数据表明,分支调度算法可以平均提升12.9%的应用程序性能。相似文献

5.

一种基于ILP的流水线调度算法

郝勇樊晓桠《微机发展》1998,8(1):6-8

提出一种基于整数规划 (IL P)的流水线调度算法 ,可以在给定的流水时延槽之下优化资源需求 ,保证流水线长度的最小化 ;同时 ,在各个工作步中的操作与调度的先后顺序无关 ,整体性能优良 ,支持链式 (Chaining)操作 ,较适合于流水线数据通路的综合。相似文献

6.

面向VLIW结构的寄存器压力敏感表调度算法*

王红梅王敏张铁军单睿侯朝焕《计算机应用研究》2009,26(11):4039-4041

为了改善寄存器压力问题,提出一种寄存器压力敏感的指令调度算法。该算法在传统表调度算法的基础上采用关键路径为优先级函数,并考虑在寄存器压力区域内调整非关键节点的调度时机,在应用程序性能不损失的情况下达到了减小寄存器压力的目的。相似文献

7.

一种基于VLIW结构的高性能变长指令发射机制

杨惠陈书明《计算机研究与发展》2013,50(10)

指令压缩技术能够克服传统超长指令字(very long instruction word,VLIW)结构的指令高速缓冲(cache)中长指令字密度低的缺陷,使长指令字中的各条指令能紧密地排列在高速缓冲行(cache line)中,但可能导致长指令字分置于两个cache line,使其不能同时参与取指与发射,从而成为处理器的性能瓶颈.受到分置cache line的影响,传统提升循环效率的软件流水方法性能下降.高性能变长指令发射窗的机制能够解决分离指令字带来的取指发射问题,为取指流水线提供高效连续的指令流,特别地,该机制缓存循环的一次迭代,硬件支持循环的软件流水,有效地增强VLIW结构的数字信号处理器(digital signal processor,DSP)的性能.通过搭建时钟精确的处理器仿真模型,并基于DSP/IMG库上进行仿真,结果显示,采用两级指令发射窗机制,平均性能提高约21.89％. 相似文献

8.

一种对函数调用优化的方法

陆伯鹰谭文安《计算机工程与应用》2002,38(18):99-101

如何对函数调用进行优化对于C语言非常重要,因为函数调用的花费很大并且在代码中经常出现。文章讨论了一种对函数调用的优化处理的途径及实现中的几个关键问题和它们的解决方法。相似文献

9.

微处理器指令综合算法研究

郝勇李贵山《微处理机》1997,(1):46-48

指令是计算机软件与硬件的接口，指令集定义的好坏，直接影响到系统的整体性能，笔者给出一种自动生成指令集的方法，特别适用于流水线微处理器的指令的设计。相似文献

10.

一种面向作业的快速调度算法S

黄启春陈奇俞瑞钊《软件学报》1999,10(10):93

Job oriented scheduling (JOS) has been the most commonly used technique in actual job shop scheduling. It loads jobs one by one onto machines. In this paper, the authors present a fast scheduling algorithm of computer-based JOS system, the algorithm assigns feasible schedule start and finish times to the operations of a job by loading them forward or backward onto the capacity constrained machines. The computation time to find the feasible time slot on the machine is reduced by log and modify each machine’s feasible time slot. Thus, the computational efficiency is substantially improved. Experimental testing shows that the algorithm has significant merits for large size problems. 相似文献

11.

An FPGA-based low-cost VLIW floating-point processor for CNC applications

《Microprocessors and Microsystems》2017

In the high-speed free-form surface machining, the real-time motion planning and interpolation is a challenging task. This paper presents the design and implementation of a dedicated processor for the interpolation task in computerized numerical control (CNC) machine tools. The jerk-limited look-ahead motion planning and interpolation algorithm has been integrated in the interpolation processor to achieve smooth motion in the high-speed machining. The processor features a compactly designed floating-point parallel computing architecture, which employs a 3-stage pipelined reduced instruction set computer (RISC) core and a very long instruction word (VLIW) floating-point arithmetic unit. A new asynchronous execution mechanism has been employed in the processor to allow multi-cycle instructions to be performed in parallel. The proposed processor has been verified on a low-cost field programmable gate array (FPGA) chip in a prototype controller. Experimental result has demonstrated the significant improvement of the computing performance with the interpolation processor in the free-form surface machining. 相似文献

12.

Instruction scheduling and transformation for a VLIW unified reduced instruction set computer/digital signal processor processor with shared register architecture

Cheng‐Yu Lee Min‐Chin Hung Rong‐Guey Chang 《Concurrency and Computation》2014,26(1):134-151

The popularity of multimedia applications made them a major theme in embedded systems. The key component for supporting multimedia application well is embedded processor. Thus, we have designed and implemented an embedded processor, called UniDual processor, to achieve this objective. Its key features are the integration of instructions of reduced instruction set computers (RISCs) and digital signal processors (DSPs) as well as the support of special instruction set and shared‐based clustered register architecture. However, an important issue of UniDual that remains open is how to efficiently allocate registers. In this paper, we present a scheduling and instruction transformation approach to resolve the aforementioned issue. The proposed approach schedules instructions and then transforms overlapped instructions into RISC and DSP instructions by taking communication overhead and hardware limitations into account. Compared with the greedy approach, the evaluation shows that our work is relatively effective in performance and code size reduction. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

13.

魂芯分簇VLIW DSP上指令调度的优化

《微型机与应用》2017,(11)

魂芯DSP处理器是一款32 bit静态超标量、分簇结构的、支持SIMD的VLIW处理器。魂芯DSP芯片有4个执行簇和3个内存块,但簇间数据传输和寻址会占用总线带宽。魂芯DSP上每个簇中有大量的计算部件,但是现有的编译器框架中指令调度算法是针对非分簇结构的,无法充分利用魂芯DSP的分簇结构特点,产生出高效的指令级并行代码。根据魂芯处理器架构分簇的特点,提出了在魂芯DSP上进行指令分簇和指令调度的启发式算法,并且在开源Open64编译器框架上进行了实现。实验结果表明,该算法在魂芯DSP编译器上的实现可以显著提高一些在DSP上有着计算密集型程序的性能。相似文献

14.

Pragmatic integrated scheduling for clustered VLIW architectures

Rahul Nagpal Y. N. Srikant 《Software》2008,38(3):227-257

Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Scheduling for clustered architectures involves spatial concerns (where to schedule) as well as temporal concerns (when to schedule). Various clustered VLIW configurations, connectivity types, and inter‐cluster communication models present different performance trade‐offs to a scheduler. The scheduler is responsible for resolving the conflicting requirements of exploiting the parallelism offered by the hardware and limiting the communication among clusters to achieve better performance. In this paper, we describe our experience with developing a pragmatic scheme and also a generic graph‐matching‐based framework for cluster scheduling based on a generic and realistic clustered machine model. The proposed scheme effectively utilizes the exact knowledge of available communication slots, functional units, and load on different clusters as well as future resource and communication requirements known only at schedule time. The proposed graph‐matching‐based framework for cluster scheduling resolves the phase‐ordering and fixed‐ordering problem associated with earlier schemes for scheduling clustered VLIW architectures. The experimental evaluation in the context of a state‐of‐art commercial clustered architecture (using real‐world benchmark programs) reveals a significant performance improvement over the earlier proposals, which were mostly evaluated using compiled simulation of hypothetical clustered architectures. Our results clearly highlight the importance of considering the peculiarities of commercial clustered architectures and the hard‐nosed performance measurement. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献

15.

Pipelining and bypassing in a VLIW processor 总被引：1，自引：0，他引：1

Abnous A. Bagherzadeh N. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(6):658-664

This short note describes issues involved in the bypassing mechanism for a very long instruction word (VLIW) processor and its relation to the pipeline structure of the processor. The authors first describe the pipeline structure of their processor and analyze its performance and compare it to typical RISC-style pipeline structures given the context of a processor with multiple functional units. Next they study the performance effects of various bypassing schemes in terms of their effectiveness in resolving pipeline data hazards and their effect on the processor cycle time 相似文献

16.

An efficient list scheduling algorithm for time placement problem 总被引：1，自引：0，他引：1

Abdellatif Mtibaa Author Vitae Bouraoui Ouni^{Author Vitae} 《Computers & Electrical Engineering》2007,33(4):285-298

相似文献

17.

A time-predictable VLIW processor and its compiler support

Jun Yan Wei Zhang 《Real-Time Systems》2008,38(1):67-84

Time predictability is an important requirement for real-time embedded application domains such as automotive, air transportation, and multimedia processing. However, the architectural design of modern microprocessors mainly concentrates on improving the average-case performance, which can significantly compromise the time predictability and can make accurate worst-case performance analysis extremely difficult if not impossible. This paper studies the time predictability of VLIW (Very Long Instruction Word) processors and its compiler support. We analyze the impediments to time predictability for VLIW processors and propose compiler-based techniques to address these problems with minimal disturbance on the VLIW hardware design. The VLIW compiler is enhanced to support full if conversion, hyperblock scheduling, and intra-block nop insertion to enable efficient WCET (Worst Case Execution Time) analysis for VLIW processors. Our experiments indicate that the time-predictability of VLIW processor can be improved significantly.

Wei ZhangEmail:

相似文献

18.

Shared processor scheduling

Dariusz Dereniowski Wiesław Kubiak 《Journal of Scheduling》2018,21(6):583-593

We study the shared processor scheduling problem with a single shared processor to maximize total weighted overlap, where an overlap for a job is the amount of time it is processed on its private and shared processor in parallel. A polynomial-time optimization algorithm has been given for the problem with equal weights in the literature. This paper extends that result by showing an \(O(n \log n)\)-time optimization algorithm for a class of instances in which non-decreasing order of jobs with respect to processing times provides a non-increasing order with respect to weights—this instance generalizes the unweighted case of the problem. This algorithm also leads to a \(\frac{1}{2}\)-approximation algorithm for the general weighted problem. The complexity of the weighted problem remains open. 相似文献