期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Improving performance and energy efficiency of embedded processors via post-fabrication instruction set customization

Hamid Noori Farhad Mehdipour Koji Inoue Kazuaki Murakami 《The Journal of supercomputing》2012,60(2):196-222

Encapsulating critical computation subgraphs as application-specific instruction set extensions is an effective technique to enhance the performance and energy efficiency of embedded processors. However, the addition of custom functional units to the base processor is required to support the execution of custom instructions. Although automated tools have been developed to reduce the long design time needed to produce a new extensible processor for each application, short time-to-market, significant non-recurring engineering and design costs are issues. To address these concerns, we introduce an adaptive extensible processor in which custom instructions are generated and added after chip-fabrication. To support this feature, custom functional units (CFUs) are replaced by a reconfigurable functional unit (RFU). The proposed RFU is based on a matrix of functional units which is multi-cycle with the capability of conditional execution. To generate more effective custom instructions, they are extended over basic blocks and hence, multiple-exits custom instruction and intuition behind it are introduced. Conditional execution capability has been added to the RFU to support the multi-exit feature of custom instructions. Because the proposed RFU has limitations on hardware resources (i.e., connections and processing elements), an integrated mapping-temporal partitioning framework is proposed to guarantee that the generated custom instructions can be mapped on the RFU (mappable custom instructions). Experimental results show that multi-exit custom instructions enhance the performance and energy efficiency by an average of 32% and 3% compared to custom instructions limited to one basic block, respectively. A maximum speedup of 4.9, compared to a single-issue embedded processor, and an average speedup of 1.9 was achieved on MiBench benchmark suite. The maximum and average energy saving are 56% and 22%, respectively. These performance and energy efficiency are obtained at the cost of 30% area overhead. 相似文献

2.

Thread-Sensitive Instruction Issue for SMT Processors

《Computer Architecture Letters》2004,3(1):5-5

Simultaneous Multi Threading (SMT) is a processor design method in which concurrent hardware threads share processor resources like functional units and memory. The scheduling complexity and performance of an SMT processor depend on the topology used in the fetch and issue stages. In this paper, we propose a thread sensitive issue policy for a partitioned SMT processor which is based on a thread metric. We propose the number of ready-to-issue instructions of each thread as priority metric. To evaluate our method, we have developed a reconfigurable SMT-simulator on top of the SimpleScalar Toolset. We simulated our modeled processor under several workloads composed of SPEC benchmarks. Experimental results show around 30% improvement compared to the conventional OLDEST_FIRST mixed topology issue policy. Additionally, the hardware implementation of our architecture with this metric in issue stage is quite simple. 相似文献

3.

A framework for post-silicon realization of arbitrary instruction extensions on reconfigurable data-paths

《Journal of Systems Architecture》2014,60(7):592-614

In this paper we present a framework for realizing arbitrary instruction set extensions (IE) that are identified post-silicon. The proposed framework has two components viz., an IE synthesis methodology and the architecture of a reconfigurable data-path for realization of the such IEs. The IE synthesis methodology ensures maximal utilization of resources on the reconfigurable data-path. In this context we present the techniques used to realize IEs for applications that demand high throughput or those that must process data streams. The reconfigurable hardware called HyperCell comprises a reconfigurable execution fabric. The fabric is a collection of interconnected compute units. A typical use case of HyperCell is where it acts as a co-processor with a host and accelerates execution of IEs that are defined post-silicon. We demonstrate the effectiveness of our approach by evaluating the performance of some well-known integer kernels that are realized as IEs on HyperCell. Our methodology for realizing IEs through HyperCells permits overlapping of potentially all memory transactions with computations. We show significant improvement in performance for streaming applications over general purpose processor based solutions, by fully pipelining the data-path. 相似文献

4.

An analytical method for reliability aware instruction set extension

Ali Azarpeyvand Mostafa E. Salehi Sied Mehdi Fakhraie 《The Journal of supercomputing》2014,67(1):104-130

Random variations and low reliability of nanometer new silicons are the most important concerns for the fault-tolerant design of large-area powerful integrated circuits. Logic faults in terms of soft errors or transient faults are now serious problems for embedded processing cores. Recently, augmenting an embedded processor with application specific custom instructions is widely used for improving the performance of a processor. Although area, power, and performance of an augmented processor have been considered for efficient custom instruction selection, its reliability consideration is much needed. This is impeding because this action needs exhaustive fault injection and lengthy and expensive simulations. This demand becomes more serious in the case of many-core, larger area and, therefore, more fault-prone integrated circuits, e.g., tera-computing processors. In this work, we propose an analytical modeling solution for such a demanding problem. First, a simple analytical method is introduced that can evaluate the vulnerability of a custom instruction in a time-saving manner. Using this method and our configurable custom instruction vulnerability analysis framework, the effects of type, order, and word length of various operations of different custom instruction subgraphs on the vulnerability of an extensible processor have been explored analytically and experimentally. Based on our results, for example, replacing orders of operators in custom functional units could yield different vulnerabilities to soft errors. Therefore, our approach enables designers to optionally constrain the operand types and also the custom functional unit structures to reach an acceptable vulnerability level at low computational and design time costs. 相似文献

5.

PipeRench: a reconfigurable architecture and compiler 总被引：1，自引：0，他引：1

Goldstein S.C. Schmit H. Budiu M. Cadambi S. Moe M. Taylor R.R. 《Computer》2000,33(4):70-77

With the proliferation of highly specialized embedded computer systems has come a diversification of workloads for computing devices. General-purpose processors are struggling to efficiently meet these applications' disparate needs, and custom hardware is rarely feasible. According to the authors, reconfigurable computing, which combines the flexibility of general-purpose processors with the efficiency of custom hardware, can provide the alternative. PipeRench and its associated compiler comprise the authors' new architecture for reconfigurable computing. Combined with a traditional digital signal processor, microcontroller or general-purpose processor, PipeRench can support a system's various computing needs without requiring custom hardware. The authors describe the PipeRench architecture and how it solves some of the pre-existing problems with FPGA architectures, such as logic granularity, configuration time, forward compatibility, hard constraints and compilation time 相似文献

6.

一种基于传输触发体系结构的可重构Hash函数处理器：TTAH

下载免费PDF全文

赵学秘王志英戴葵陆洪毅《计算机工程与科学》2007,29(3):66-69

Hash函数是密码学中保证数据完整性的有效手段,性能需求使得某些应用必须采用硬件实现。本文通过分析常用Hash函数在算法上的相似性设计出了专用可重构单元,并将这些可重构单元耦合到传输触发体系结构中,得到一种可重构Hash函数处理器TTAH。常用Hash算法在TTAH上的映射结果表明：与细粒度可重构结构相比,其速度快,资源利用率高;与ASIC相比,可以在额外开销增加较小的前提下有效地支持多种常用Hash函数。相似文献

7.

Toward advanced parallel processing: exploiting parallelism at taskand instruction levels

Fukuda A. Murakami K. Tomita S. 《Micro, IEEE》1991,11(4)

The status of two projects that entail the development of a reconfigurable parallel processor system with 128 Sparc microprocessors and a superscalar processor with four operations proceeding in parallel is discussed. The design principles, system configuration, processing element, network architecture, and memory architecture of the reconfigurable processors (called KRPP) are described. The operating system for KRPP is discussed. The architecture for the superscalar (called a dynamically hazard-resolved, statically code-scheduled, nonuniform superscalar) is presented 相似文献

8.

Performance modeling of distributed memory architectures

S. Lennart Johnsson 《Journal of Parallel and Distributed Computing》1991,12(4)

We provide performance models for several primitive operations on data structures distributed over memory units interconnected by a Boolean cube network. In particular, we model single-source and multiple-source concurrent broadcasting or reduction, concurrent gather and scatter operations, shifts along several axes of multidimensional arrays, and emulation of butterfly networks. We also show how the processor configuration, the data aggregation, and the encoding of the address space affect the performance for two important basic computations: the multiplication of arbitrarily shaped matrices and the Fast Fourier Transform. We also give an example of the performance behavior for local matrix operations for a processor with a single path to local memory and a set of processor registers. The analytic models are verified by measurements on the Connection Machine Model CM-2. 相似文献

9.

Efficient mapping and acceleration of AES on custom multi‐core architectures

Amit Pande Joseph Zambreno 《Concurrency and Computation》2011,23(4):372-389

Multi‐core processors can deliver significant performance benefits for multi‐threaded software by adding processing power with minimal latency, given the proximity of the processors. Cryptographic applications are inherently complex and involve large computations. Most cryptographic operations can be translated into logical operations, shift operations, and table look‐ups. In this paper we design a novel processor (called mu‐core) with a reconfigurable Arithmetic Logic Unit, and design custom two‐dimensional multi‐core architectures on top of it to accelerate cryptographic kernels. We propose an efficient mapping of instructions from the multi‐core grid to the individual processor cores and illustrate the performance of AES‐128E algorithm over custom‐sized grids. The model was developed using Simulink and the performance analysis suggests a positive trend towards development of large multi‐core (or multi‐ µ‐core) architectures to achieve high throughputs in cryptographic operations. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

10.

Mapping of nomadic multimedia applications on the ADRES reconfigurable array processor

Mladen Berekovic Andreas Kanstein Bingfeng Mei Bjorn De Sutter 《Microprocessors and Microsystems》2009,33(4):290-294

相似文献

11.

基于FPGA快速实现定制化RISC-V处理器

陆松蒋句平任会峰《计算机工程与科学》2022,44(10):1747-1752

随着RISC-V指令集的流行,出现了一批应用于IoT智能硬件、嵌入式系统、人工智能芯片、安全设备及高性能计算等不同领域的开源和商业IP软核。性能、功耗和面积三者之间的平衡需要指令集可裁剪、易扩展,以及软件开发环境的配套支持。为此,按照增加自定义指令、扩展ALU功能单元、连接控制信号和数据通路、FPGA原型验证、定制交叉编译环境和应用程序测试的流程,基于FPGA快速实现了定制化RISC-V处理器。以加速矩阵运算为例,基于FPGA在开源IP蜂鸟E203上设计了一条计算向量内积的自定义指令,并在FPGA上进行了原型验证。应用测试程序表明,定制化的RISC-V处理器的计算性能有显著提升,矩阵乘法运算的性能加速比达到了5.3~7.6。相似文献

12.

Real-time image processing on a custom computing platform

Athanas P.M. Abbott A.L. 《Computer》1995,28(2):16-25

The authors explore the utility of custom computing machinery for accelerating the development, testing, and prototyping of a diverse set of image processing applications. We chose an experimental custom computing platform called Splash-2 to investigate this approach to prototyping real time image processing designs. Custom computing platforms are emerging as a class of computers that can provide near application specific computational performance. We developed a real time image processing system called VTSplash, based on the Splash-2 general-purpose platform. Splash-2 is an attached processor featuring programmable processing elements (PEs) and communication paths. The Splash-2 system uses arrays of RAM based field programmable gate arrays (FPGAs), crossbar networks, and distributed memory to accomplish the needed flexibility and performance tasks. Such platforms let designers customize specific operations for function and size, and data paths for individual applications 相似文献

13.

Selecting profitable custom instructions for reconfigurable processors

Tao Li Wu Jigang Siew-Kei Lam Thambipillai Srikanthan Xicheng Lu 《Journal of Systems Architecture》2010,56(8):340-351

Custom-instruction selection is an essential phase in instruction set extension for reconfigurable processors. It determines the most profitable custom-instruction candidates for implementing in the reconfigurable fabric of a reconfigurable processor. In this paper, a practical computing model is proposed for the custom-instruction selection problem that takes into account the area constraint of the reconfigurable fabric. Based on the new computing model, two heuristic algorithms and an exact algorithm are proposed. The first heuristic algorithm, denoted as HEA, dynamically assigns priorities to the custom instruction candidates and incorporates efficient strategies to select custom instructions with the highest priority. The second heuristic algorithm, denoted as TSA, employs an efficient tabu search algorithm to refine the results of HEA to near-optimal ones. Also, a branch-and-bound algorithm (BnB) is proposed to produce exact solutions for relatively small-sized problems or problems with stringent area-constraints. Experimental results show that HEA can produce more specific approximate solutions with a difference of only about 3% when compared to the optimal solutions produced by BnB. This difference is further reduced to about 0.6% by TSA. In addition, for large-sized problems where the exact algorithm becomes prohibitive, HEA and TSA can still produce solutions within reasonable time. 相似文献

14.

低速率可重构声码器的研究与设计

下载免费PDF全文

荆涛王沁赵宏智《计算机工程》2008,34(7):235-237

为满足数字语音通信对高性能和高灵活性的应用需求,设计一种基于SELP语音编解码算法的具有可重构、高并行度、可编程、安全性等特点的声码器。介绍了该声码器的功能、设计目标、特征设计及其体系结构设计,并对四级可重构的ALU单元、数据通路单元等功能部件以及可变长VLIW专用指令集的设计进行描述。相似文献

15.

On the hardware implementation of RIPEMD processor: Networking high speed hashing, up to 2 Gbps

N. Sklavos^{Author Vitae} O. Koufopavlou Author Vitae 《Computers & Electrical Engineering》2005,31(6):361-379

The continued growth of both wired and wireless communications has triggered the revolution for high speed security implementations. RIPEMD hash functions are widely used, in many applications of cryptography. A reconfigurable processor architecture and the VLSI implementation of these functions are proposed in this work. The introduced processor is reconfigurable in the sense that performs alternatively all RIPEMD hash functions. In order to indicate the advantages of the proposed design, each one of these hash functions has also been implemented in a separate hardware device (FPGA). The proposed processor FPGA implementation achieves high speed hashing up to 2 Gbps. Comparing with previous published hardware designs, the proposed processor has higher performance in the range from 22 to 30 times. It also performs much better than the assembly language implementations of the RIPEMD-128 and RIPEMD-160. The proposed processor could be used for the implementation of data integrity units, and in many other sensitive cryptographic applications, such as, digital signatures, message authentication codes and random number generators. 相似文献

16.

Data-parallel C on a reconfigurable logic array

Maya Gokhale Brian Schott 《The Journal of supercomputing》1995,9(3):291-313

相似文献

17.

Virtualization of reconfigurable coprocessors in HPRC systems with multicore architecture

Ivan Gonzalez Sergio Lopez-Buedo Gustavo Sutter Diego Sanchez-Roman Francisco J. Gomez-Arribas Javier Aracil 《Journal of Systems Architecture》2012,58(6-7):247-256

HPRC (High-Performance Reconfigurable Computing) systems include multicore processors and reconfigurable devices acting as custom coprocessors. Due to economic constraints, the number of reconfigurable devices is usually smaller than the number of processor cores, thus preventing that a 1:1 mapping between cores and coprocessors could be achieved. This paper presents a solution to this problem, based on the virtualization of reconfigurable coprocessors. A Virtual Coprocessor Monitor (VCM) has been devised for the XtremeData XD2000i In-Socket Accelerator, and a thread-safe API is available for user applications to communicate with the VCM. Two reference applications, an IDEA cipher and an Euler CFD solver, have been implemented in order to validate the proposed architecture and execution model. Results show that the benefits arising from coprocessor virtualization outperform its overhead, specially when code has a significant software weight. 相似文献

18.

一种嵌入式处理器的动态可重构Cache设计 总被引：1，自引：0，他引：1

张毅汪东升《计算机工程与应用》2004,40(8):94-96,232

一般的处理器芯片都有片上高速缓存Cache,它一般是由固定大小的一级Cache(L1)和二级Cache(L2)构成,文章介绍了一种在嵌入式处理器设计中实现的动态可重构Cache。动态可重构Cache的思想最早是罗彻斯特大学(UniversityofRochester)的学者在他们的一篇关于存储层次的论文1中提出的,当时主要是针对高性能的超标量通用处理器。在此嵌入式处理器设计过程中,笔者创造性地继承了这一思想。通过增加少量硬件以及编译器的配合,在嵌入式处理器中L1Cache和L2Cache总体大小不变的情况下,L1Cache和L2Cache的大小可以根据具体的应用程序动态配置。通过对高速缓存的动态配置,不仅可以有效地提高Cache的命中率,还能够有效降低处理器的功耗。相似文献

19.

An efficient scheme for interprocessor communication usingdual-ported RAMs

Jagadish N. Kumar J.M. Patnaik L.M. 《Micro, IEEE》1989,9(5):10-19

An approach for interprocessor interconnection is described in which communication between the processor nodes involves writing into and reading from a common memory area. The communicating processors do not have to contend for a common bus as in the case of shared-memory systems, since they have independent access to the common memory units shared between them. Only the memory access time of the processors limits the communication speed. Processor-to-processor communication does not use intermediate buffers, input/output ports, or DMAs. The example of a three-dimensional cube is used to illustrate the advantages of this scheme. The implementation of the interprocessor communication scheme on a 64-node cube configuration is discussed 相似文献

20.

可重配置处理器的体系结构级功耗模型与分析

下载免费PDF全文

肖玮臧斌宇朱传琪《计算机工程与应用》2007,43(26):34-37

按照可重配置处理器的体系结构建立并实现功耗模型;模型对处理器的电路级特性进行抽象,基于体系结构级属性和工艺参数进行静态峰值功耗估算,基于性能模拟器进行动态功耗统计,并实现三种条件时钟下的门控技术;可重配置处理器与超标量通用微处理器相比,在性能方面获得的平均加速比为3.59,而在功耗方面的平均增长率仅为1.48;通过实验还说明采用简单的CC1门控技术能有效地降低可重配置系统的功耗和硬件复杂度;该模型为可重配置处理器低功耗设计和编译器级低功耗优化研究奠定了基础。相似文献