期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Improving performance and energy efficiency of embedded processors via post-fabrication instruction set customization

Hamid Noori Farhad Mehdipour Koji Inoue Kazuaki Murakami 《The Journal of supercomputing》2012,60(2):196-222

Encapsulating critical computation subgraphs as application-specific instruction set extensions is an effective technique to enhance the performance and energy efficiency of embedded processors. However, the addition of custom functional units to the base processor is required to support the execution of custom instructions. Although automated tools have been developed to reduce the long design time needed to produce a new extensible processor for each application, short time-to-market, significant non-recurring engineering and design costs are issues. To address these concerns, we introduce an adaptive extensible processor in which custom instructions are generated and added after chip-fabrication. To support this feature, custom functional units (CFUs) are replaced by a reconfigurable functional unit (RFU). The proposed RFU is based on a matrix of functional units which is multi-cycle with the capability of conditional execution. To generate more effective custom instructions, they are extended over basic blocks and hence, multiple-exits custom instruction and intuition behind it are introduced. Conditional execution capability has been added to the RFU to support the multi-exit feature of custom instructions. Because the proposed RFU has limitations on hardware resources (i.e., connections and processing elements), an integrated mapping-temporal partitioning framework is proposed to guarantee that the generated custom instructions can be mapped on the RFU (mappable custom instructions). Experimental results show that multi-exit custom instructions enhance the performance and energy efficiency by an average of 32% and 3% compared to custom instructions limited to one basic block, respectively. A maximum speedup of 4.9, compared to a single-issue embedded processor, and an average speedup of 1.9 was achieved on MiBench benchmark suite. The maximum and average energy saving are 56% and 22%, respectively. These performance and energy efficiency are obtained at the cost of 30% area overhead. 相似文献

2.

An architecture framework for an adaptive extensible processor

Hamid Noori Farhad Mehdipour Kazuaki Murakami Koji Inoue Morteza Saheb Zamani 《The Journal of supercomputing》2008,45(3):313-340

To improve the performance of embedded processors, an effective technique is collapsing critical computation subgraphs as application-specific instruction set extensions and executing them on custom functional units. The problem with this approach is the immense cost and the long times required to design a new processor for each application. As a solution to this issue, we propose an adaptive extensible processor in which custom instructions (CIs) are generated and added after chip-fabrication. To support this feature, custom functional units are replaced by a reconfigurable matrix of functional units (FUs). A systematic quantitative approach is used for determining the appropriate structure of the reconfigurable functional unit (RFU). We also introduce an integrated framework for generating mappable CIs on the RFU. Using this architecture, performance is improved by up to 1.33, with an average improvement of 1.16, compared to a 4-issue in-order RISC processor. By partitioning the configuration memory, detecting similar/subset CIs and merging small CIs, the size of the configuration memory is reduced by 40%. 相似文献

3.

基于FPGA快速实现定制化RISC-V处理器

陆松蒋句平任会峰《计算机工程与科学》2022,44(10):1747-1752

随着RISC-V指令集的流行,出现了一批应用于IoT智能硬件、嵌入式系统、人工智能芯片、安全设备及高性能计算等不同领域的开源和商业IP软核。性能、功耗和面积三者之间的平衡需要指令集可裁剪、易扩展,以及软件开发环境的配套支持。为此,按照增加自定义指令、扩展ALU功能单元、连接控制信号和数据通路、FPGA原型验证、定制交叉编译环境和应用程序测试的流程,基于FPGA快速实现了定制化RISC-V处理器。以加速矩阵运算为例,基于FPGA在开源IP蜂鸟E203上设计了一条计算向量内积的自定义指令,并在FPGA上进行了原型验证。应用测试程序表明,定制化的RISC-V处理器的计算性能有显著提升,矩阵乘法运算的性能加速比达到了5.3~7.6。相似文献

4.

基于指令行为的Cache可靠性评估研究

周学海余洁李曦王志刚《计算机研究与发展》2007,44(4):553-559

软错误由高能粒子撞击所产生,对处理器的可靠性产生很大的损害.随着处理器设计目标转向低功耗、高性能和低供电电压,软错误的发生日益频繁,处理器的可靠性研究也随之受到越来越多的关注.针对传统的基于注错仿真的可靠性评估方法效率低的缺陷,提出了一套系统的cache可靠性评估方法,以可靠性指标之一--体系结构易受损因子(architectural vulnerability factor,AVF))--为研究对象,一方面,基于指令行为分析应用程序运行过程中对最终结果不产生影响的指令,从而确定对cache的AVF产生作用的指令;另一方面,根据cache的存储类型、所采取的写策略,结合cache中数据/指令阵列和地址标识阵列的特点,对cache上的各种相邻操作组合对AVF的影响进行了研究,从而完成AVF评估所需的信息分析.实验部分对PISA体系结构指令cache中的指令阵列进行了AVF评估,说明了该方法的有效性. 相似文献

5.

Reliability of data processing and fault compensation in unreliable arithmetic processors

《Microprocessors and Microsystems》2016

In logical circuits, like arithmetic operations in a processor system, arbitrary faults become a more tremendous aspect in future. Modern manufacturing processes lead to less reliability and higher vulnerability of software execution to soft-errors. The correctness of certain results is important especially for safety–critical applications whose reliability depends on the fault-free execution of each single instruction and the dependencies between them. The more complex a software is the more unreliable the outcome is. But, there is a contrary effect. If the probability for multiple faults increases, there is also the chance that two faults compensate each other and the result is correct again. This paper presents the basic ideas for such a reliability evaluation of a software's data flow with arbitrary soft-errors and the effect of fault compensation. Further, this evaluation provides a possibility to compare different implementations of a data flow with respect to the reliability. This is shown by the comparison of two different error codes as alternatives for coded data processing. 相似文献

6.

Automatic Design of Application Specific Instruction Set Extensions Through Dataflow Graph Exploration

Clark Nathan Zhong Hongtao Tang Wilkin Mahlke Scott 《International journal of parallel programming》2003,31(6):429-449

General-purpose processors are often incapable of achieving the challenging cost, performance, and power demands of high-performance applications. To meet these demands, most systems employ a number of hardware accelerators to off-load the computationally demanding portions of the application. As an alternative to this strategy, we examine customizing the computation capabilities of a processor for a particular application. The processor is extended with hardware in the form of a set of custom function units and instruction set extensions. To effectively identify opportunities for creating custom hardware, a dataflow graph design space exploration engine heuristically identifies candidate computation subgraphs without artificially constraining their size or shape. The engine combines estimates of performance gain, cost, and inherent limitations of the processor to grow candidate graphs in profitable directions while pruning unprofitable paths. This paper describes the dataflow graph exploration engine and evaluates its effectiveness across a set of embedded applications. 相似文献

7.

基于指令流混合比与功能单元匹配的软错误脆弱性控制方法

唐柳黄樟钦张会兵《计算机应用研究》2017,34(1)

发射队列是处理器流水线的关键结构,,降低发射队列对软错误的敏感性已成为微处理器可靠性设计不可忽视的问题。本文提出一种在处理器流水线前端实施的软错误脆弱性控制方法,该方法在不改变功能单元的情况下,根据指令流混合比与功能单元配置的匹配情况,调节发射队列中的指令类型比例,降低指令在发射队列中的等待时间,从而降低发射队列的体系结构脆弱因子,缓解软错误敏感性。实验结果表明该方法平均减低发射队列的架构易感因子2.8%左右,IPC/AVF提高约4.9%。相似文献

8.

Using Underutilized CPU Resources to Enhance Its Reliability

Timor Avi Mendelson Avi Birk Yitzhak Suri Neeraj 《Dependable and Secure Computing, IEEE Transactions on》2010,7(1):94-109

Soft errors (or Transient faults) are temporary faults that arise in a circuit due to a variety of internal noise and external sources such as cosmic particle hits. Though soft errors still occur infrequently, they are rapidly becoming a major impediment to processor reliability. This is due primarily to processor scaling characteristics. In the past, systems designed to tolerate such faults utilized costly customized solutions, entailing the use of replicated hardware components to detect and recover from microprocessor faults. As the feature size keeps shrinking and with the proliferation of multiprocessor on die in all segments of computer-based systems, the capability to detect and recover from faults is also desired for commodity hardware. For such systems, however, performance and power constitute the main drivers, so the traditional solutions prove inadequate and new approaches are required. We introduce two independent and complementary microarchitecture-level techniques: Double Execution and Double Decoding. Both exploit the typically low average processor resource utilization of modern processors to enhance processor reliability. Double Execution protects the Out-Of-Order part of the CPU by executing each instruction twice. Double Decoding uses a second, low-performance low-power instruction decoder to detect soft errors in the decoder logic. These simple-to-implement techniques are shown to improve the processor's reliability with relatively low performance, power, and hardware overheads. Finally, the resulting “excessive” reliability can even be traded back for performance by increasing clock rate and/or reducing voltage, thereby improving upon single execution approaches. 相似文献

9.

基于敏感寄存器替换的电路软错误率与开销最优化

孙岩张民选李少青高昌垒《计算机研究与发展》2011,48(1)

随着集成电路的发展,逻辑电路对放射性粒子引起的软错误越来越敏感.现有的电路加固技术通常会带来较大的面积开销.综合考虑电路的软错误率和面积开销,提出一种新的电路加固评估指标FAP,并提出基于贪婪算法的寄存器替换技术,通过将电路的部分敏感寄存器替换为冗余寄存器来免疫电路中的软错误.针对贪婪算法有时不能达到可靠性和开销整体最优的局限,进一步提出可靠性-开销最优的启发式替换算法.实验结果表明,基于贪婪算法的寄存器替换技术只需50%的面积开销就可降低90%的电路软错误率;而可靠性-开销最优的启发式替换算法只需45%左右的面积开销,电路软错误率就降低达90%以上.与其他已有技术相比,电路软错误免疫技术在可靠性和面积开销间达到了更好的折中. 相似文献

10.

嵌入式RISC-V乱序执行处理器的研究与设计

李雨倩焦继业刘有耀郝振和《计算机工程》2021,47(2):261-267,284

为满足嵌入式设备小面积高性能的需求,设计一种基于开源RISC-V指令集的32位可综合乱序处理器。处理器包括分支预测、相关性处理等关键技术,支持RISC-V基本整数运算、乘除法以及压缩指令集。采用具有顺序单发射、乱序执行、乱序写回等特性的三级流水线结构,运用哈佛体系结构及AHB总线协议,可满足并行访问指令与数据的需求。在Artix-7（XC7A35T-L1CSG324I）FPGA开发板上以50 MHz时钟频率完成功能验证,测试功耗为7.9 mW。实验结果表明,在SMIC 110 nm的ASIC技术节点上进行综合分析,并在同等条件下与ARM Cortex-M3等处理器进行对比,该系统面积减少64%,功耗降低0.57 mW,可用于小面积低功耗的嵌入式领域。相似文献

11.

The QC-2 parallel Queue processor architecture

Ben A. Abderazek Arquimedes CanedoAuthor VitaeTsutomu YoshinagaAuthor Vitae Masahiro SowaAuthor Vitae 《Journal of Parallel and Distributed Computing》2008

Queue based instruction set architecture processor offers an attractive option in the design of embedded systems. In our previous work, we proposed a novel queue processor architecture as a starting point for hardware/software design space exploration for embedded applications. In this paper, we present a high performance 32-bit Synthesizable QueueCore (QC-2)—an improved and optimized version of the produced order parallel Queue processor (PQP), with single precision floating-point support. The QC-2 core also implements a novel technique used to extend immediate values and memory instruction offsets that were otherwise not representable because of bit-width constraints in the PQP processor. 相似文献

12.

基于AHB总线的RISC-V微处理器设计与实现

下载免费PDF全文

郝振和焦继业李雨倩《计算机工程与应用》2020,56(20):52-58

在嵌入式应用中,为了满足小面积低功耗的设计需求,设计了一种支持RISC-V指令集架构的微处理器,系统采用2级流水结构,实现了RV32IMAC指令集。处理器采用AHB总线作为片上互连总线,可方便调用外部IP核进行功能拓展。在VCS环境下验证了该微处理器的逻辑功能,仿真结果表明该微处理器能够正常稳定运行。在面积、功耗和性能等方面与蜂鸟E203处理器以及ARM Cortex-M系列处理器进行了对比,该设计比蜂鸟E203处理器面积小了6%,功耗和性能上与Cortex-M0处理器相当。分析结果表明该处理器较适合在小面积、低功耗的嵌入式应用领域进行开发。相似文献

13.

嵌入式系统软硬件协同验证中软件验证方法 总被引：1，自引：0，他引：1

王世好王歆民刘明业《计算机研究与发展》2005,42(3):514-519

随着集成电路及计算机技术的发展,嵌入式系统设计变得越来越复杂．复杂的嵌入式系统设计,通常采用验证的手段检验系统设计的正确性,硬件验证通常是在硬件设计描述的基础上建立用于模拟硬件功能的硬件模拟器;软件验证常用的方法是建立处理器功能模型(指令集模拟器ISS),逐条解释嵌入式软件在目标机器上的执行过程,产生模拟输出,驱动外围电路(即硬件设计)．指令集模拟器从底层时序关系模拟嵌入式软件在目标CPU上运行过程．对于复杂嵌入式系统设计,ISS模拟速度通常成为协同模拟瓶颈．基于RTOS的嵌入式软件快速验证方法可以有效地提高软件模拟速度,扩展RTOS功能,适应协同模拟需要,建立硬件模拟驱动,实现软件和硬件模拟器通信连接和协同模拟同步控制．基于RTOS的嵌入式软件验证方法以编译代码模型为基础,从系统行为级验证嵌入式软件功能,验证速度快．在实际应用中,该方法和ISS验证相结合,能够实现更有效、更快速的嵌入式系统协同验证．最后以几个典型硬件设计为基础,编写相应的控制软件,进行软硬件协同验证实验,实验结果数据说明该验证方法实用、有效、快速．相似文献

14.

Customized pipeline and instruction set architecture for embedded processing engines

Amir Yazdanbakhsh Mostafa E. Salehi Sied Mehdi Fakhraie 《The Journal of supercomputing》2014,68(2):948-977

Custom instructions potentially improve execution speed and code compression of embedded applications. However, more efficient custom instructions need higher number of simultaneous registerfile accesses. Larger registerfiles are more power hungry with complex forwarding interconnects. Therefore, due to the limited ports of the base processor registerfile, size and efficiency of custom instructions could be generally limited. Recent researches have focused on overcoming this limitation by some innovative architectural techniques supplemented with customized compilations. However, to the best of our knowledge there are few researches that take into account the complete pipeline design and implementation considerations. This paper proposes a customized instruction set and pipeline architecture for an optimized embedded engine. The proposed architecture increases the performance by enhancing the available registerfile data bandwidth through register access pipelining. The achieved improvements are made by introducing double-word custom instructions whose registerfile accesses are overlapped in the pipeline. Potential hazards in such instructions are resolved by the introduced pipeline backwarding concept, yielding higher performance and code compression. While we study the effectiveness of the proposed architecture on domain-specific workloads from packet-processing benchmarks, the developed framework and architecture are applicable to other embedded application domains. 相似文献

15.

ARMv4指令集模拟器设计及优化技术 总被引：3，自引：0，他引：3

严迎建刘明业《小型微型计算机系统》2005,26(2):315-317

指令集模拟器是处理器、编译器以及嵌入式系统设计中的重要工具之一．首先讨论指令集模拟器的分类及特点，然后阐述作者采用解释技术开发的ARMv4指令集模拟器的实现方法，为了提高模拟效率，还讨论几种性能优化技术．相似文献

16.

开源软核处理器OpenRISC自定义指令的研究与实现

陈俊陈更生《计算机应用与软件》2010,27(1):68-69,113

详细介绍了在OpenRISC上实现自定义指令的方法。开始先简要说明软核的优点,接着基于此优点讨论优化大计算量程序段的两种方法。将两种方法进行比较后,选择自定义指令并介绍实现步骤。相似文献

17.

基于MIPS指令集的超标量和超长指令字混合架构处理器设计

李源马海林何虎《计算机应用研究》2016,33(6)

针对嵌入式和移动设备对处理器高性能低功耗日趋强烈的要求,提出一种基于MIPS指令集的顺序超标量和超长指令字混合架构处理器设计方案,便于以同构多核架构代替目前业界普遍采用的CPU与DSP异构结构,降低功耗面积,同时以VLIW模式获得较好的DSP性能。在PD（Processor Designer）平台下以LISA语言建立处理器的周期精度软件模拟器,通用性能和DSP性能分别由dhrystone、coremark基准测试程序及EEMBC的telecom测试程序进行验证。测试结果表明该设计以较低的硬件开销通过混合架构获得较高的数字信号处理性能,在高性能低功耗的处理器应用场景中拥有良好的适用性。相似文献

18.

Evaluating the impact of reissued instructions on data speculative processor performance

Toshinori 《Microprocessors and Microsystems》2002,25(9-10):469-482

In this paper, we investigate the impact of instructions reissued due to misspeculated data dependences on processor performance. Recently, the practice of speculation in resolving data dependences has been studied as a means of extracting more instruction level parallelism. When a misspeculation occurs, it is necessary to revert the processor state to a safe point where the speculation is initiated, with an instruction reissue mechanism utilized for that purpose. The instruction reissue suffers less miss penalties than instruction squashing which handles misspeculated control flows in current generation processors, but causes redundant instruction dispatching, i.e. multiple copies of an instruction are in flight in functional units. The effectiveness of data speculation would be diminished, if reissued instructions caused serious structural hazards. Therefore, we evaluate how the instruction reissue affects processor performance using an execution-driven simulator. We find that overhead due to instruction reissue is sufficiently small so as to allow data speculation to contribute to processor performance. 相似文献

19.

Supporting multiple-input,multiple-output custom functions in configurable processors

《Journal of Systems Architecture》2007,53(5-6):263-271

Configurable processors have emerged as a promising solution for high performance embedded systems. Many of these processors extend a RISC core with configurable functional units that execute dual-input, single-output (DISO) custom functions. Although studies have shown that supporting multiple-input, multiple-output (MIMO) custom functions can lead to significant speedups, mechanisms to efficiently achieve this have not been adequately addressed. The underlying reason is that a custom function is normally invoked by a single instruction, which usually transfers only two inputs and one output. Attempts to transfer more inputs and outputs in one instruction are impeded by the instruction length and the register file’s R/W ports. This paper proposes a simple extension to transfer multiple inputs and outputs of the custom functions using repeated instructions. While transferring the inputs and outputs may take a few extra cycles, our experiments show that the MIMO extension can still achieve an average 51% increase in speedup compared to a DISO extension and an average 27% increase in speedup compared to a multiple-input, single-output (MISO) extension. 相似文献

20.

密码嵌入式处理器中高速缓存的研究与设计

王晓燕杨先文陈海民《计算机工程与设计》2012,33(8):3000-3005

为了提高密码嵌入式处理器的运行效率,给出了一种哈佛结构的高速缓存(Cache)设计,包括指令Cache(iCache)和数据Cache(dCache)。采用双端口RAM和较低的硬件开销设计了标签存储器和指令/数据存储器,并描述了iCache和dCache控制流程。实现时配置iCache容量为4KB、dCache容量为8KB,并完成了向密码嵌入式处理器的集成。FPGA验证结果表明其满足处理器的应用要求;性能分析结果表明,采用Cache比处理器直接访问主存在速度上至少提高5.26倍。相似文献