首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
在嵌入式系统的应用中,程序代码中存在着相当多的局部变量,这些局部变量的使用范围(生存期)通常都很小.相关指令在流水中需要局部变量的值可以直接从旁路逻辑中得到,并在流水中完成局部变量值的全部使用.对这种局部变量就没有必要将流水输出结果写回寄存器文件,以减少对寄存器文件(RF)的读写操作次数,从而降低对寄存器文件端口的读写要求.决定是否将结果写回寄存器文件的关键的是要确定寄存器的生存期以及流水中旁路逻辑的情况,本文根据所设计的媒体处理器提出了一种确定程序代码中寄存器生存期的算法,并通过指令编码实现对硬件结构的使能控制,即对流水输出结果写回寄存器文件的控制.软件仿真结果表明,对DSP中不同的应用程序平均可以减少94%的寄存器文件写次数.  相似文献   

2.
This paper describes a new architecture for embedded reconfigurable computing, based on a very-long instruction word (VLIW) processor enhanced with an additional run-time configurable datapath. The reconfigurable unit is tightly coupled with the processor, featuring an application-specific instruction-set extension. Mapping computation intensive algorithmic portions on the reconfigurable unit allows a more efficient elaboration, thus leading to an improvement in both timing performance and power consumption. A test chip has been implemented in a standard 0.18-/spl mu/m CMOS technology. The test of a signal processing algorithmic benchmark showed speedups ranging from 4.3/spl times/ to 13.5/spl times/ and energy consumption reduced up to 92%.  相似文献   

3.
4.
In this paper, a power efficient vertex processor for mobile graphics applications is presented. A four-threaded and four-issue expanded VLIW datapath with a quad-float vertex texture fetcher is proposed by exploiting graphics specific characteristics after evaluation of several candidate architectures. Instruction-level power control methods such as operand sharing and writeback re-allocation along with operand isolations and gated clocks result in 40.4% and 82% reduction in energy dissipation and energy delay product compared to the most widely used single threaded SIMD. The proposed processor with the optimized datapath and vertex caches implemented in a 0.18- mum 1P4M CMOS process achieves 186-Mvertices/s geometry performance which is the best result among the processors that are IEEE-754 compliant.  相似文献   

5.
Much of the complexity in today's superscalar microprocessors stems from the need to maintain the speculatively produced results within the on-chip storage components until these results can be safely discarded without endangering the reconstruction of the precise state or impeding the recovery from possible branch misspeculations. For this, modern designs use large, heavily-ported physical register files (RFs) to increase the instruction throughput. The high complexity and power dissipation of such RFs mainly stem from the need to maintain each and every result for a large number of cycles after the result generation. We observed that a significant fraction (about 45%) of the result values are delivered to their consumers via the bypass network (consumed “on-the-fly”) and are never read out from the destination registers. In this paper, we first formulate conditions for identifying such transient values and describe their microarchitectural implementation; then we propose a technique to avoid the writeback of such transient values into the RF. With 64-entry integer and floating point register files, our technique achieves an 11% performance improvement and 29% reduction in the RF energy consumption compared to the baseline machine with the same number of registers. Furthermore, for the same performance target, the Selective Writeback scheme results in a 38% reduction in the energy consumption of the RF compared to the baseline machine.   相似文献   

6.
Code size "bloating" in embedded very long instruction word (VLIW) processors is a major concern for embedded systems since memory is one of the most restricted resources. In this paper, we describe a code compression algorithm based on arithmetic coding, discuss how to design decompression architecture, and illustrate the tradeoffs between compression ratio and decompression overhead, by using different probability models. Experimental results for a VLIW embedded processor TMS320C6x show that compression ratios between 67% and 80% can be achieved, depending on the probability models used. A precache decompression unit design is implemented in TSMC 0.25 mum and a test chip is fabricated.  相似文献   

7.
8.
In embedded system design, memory is one of the most restricted resources, posing serious constraints on program size. Code compression has been used as a solution to reduce the code size for embedded systems. Lossless data compression techniques are used to compress instructions, which are then decompressed on-the-fly during execution. Previous work used fixed-to-variable coding algorithms that translate fixed-length bit sequences into variable-length bit sequences. In this paper, we present a class of code compression techniques called variable-to-fixed code compression (V2FCC), which uses variable-to-fixed coding schemes based on either Tunstall coding or arithmetic coding. Though the techniques are suitable for both reduced instruction set computer (RISC) and very long instruction word (VLIW) architectures, they favor VLIW architectures which require a high-bandwidth instruction prefetch mechanism to supply multiple operations per cycle, and fast decompression is critical to overcome the communication bottleneck between memory and CPU. Experimental results for a VLIW embedded processor TMS320C6x show that the compression ratios using memoryless V2FCC and Markov V2FCC are around 82.5% and 70%, respectively. Decompression unit designs for memoryless V2FCC and Markov V2FCC are implemented in TSMC 0.25-/spl mu/m technology.  相似文献   

9.
An Algorithm-Hardware-System Approach to VLIW Multimedia Processors   总被引:2,自引:0,他引:2  
Very Long Instruction Word (VLIW) processor architectures for multimedia applications are discussed from an algorithm, hardware and system based point of view. VLIW processors show high flexibility and processing power, as well as a good utilization of resources by compiler-generated code, but their exclusive exploitation of instruction level parallelism (ILP) decreases in efficiency as the degree of parallelism increases. This is mainly caused by characteristics of multimedia algorithms, increasing wiring delays, compiler restrictions, and a widening gap between on-chip processing speed and available bandwidth to external memory. As new multimedia applications and standards continue to evolve (MPEG-4), the demand for higher processing power will continue. Therefore, parallel processing in all its available forms will have to be exploited to achieve significant performance improvements. We show that, due to the diminishing returns from a further increase in ILP, multimedia applications will benefit more from an additional exploitation of parallelism at thread-level. We examine how simultaneous multithreading (SMT), a novel architectural approach combining VLIW techniques with parallel processing of threads, can efficiently be used to further increase performance of typical multimedia workloads.  相似文献   

10.
Aim of this paper is to propose a high-level power exploration framework based on an instruction-level energy model for VLIW (Very Long Instruction Word) architectures. More specifically, the present paper deals with the reduction of the complexity of the energy model of K-issue VLIW processors from exponential with respect to the number of operations within the Instruction Set O(⫨ISA K ) to quadratic (O(K*⫨ISA2)). The complexity of the energy model has been further simplified by automatically clustering the operations in the ISA with respect to their average energy. Globally, the proposed approach reduces the complexity of the characterization problem for a K-issue VLIW processor to quadratic (O(K*⫨C2)) with respect to the number of operation clusters. In this way, a more efficient characterization of the VLIW core power consumption can been achieved, while preserving the accuracy of the power estimates. The proposed model has been further extended to provide early power figures and energy/performance trade-offs for multi-cluster VLIW architectures composed of multiple data-path units and a single instruction cache control unit. The proposed high-level power estimation methodology has been applied to the Lx 4-issue VLIW pipelined processor provided by STMicroelectronics.  相似文献   

11.
A four-way very long instruction word (VLIW), 312-MHz geometry processor with peripheral component interconnect/accelerated graphic port bus bridge was implemented in a 0.21-μm, 2.5-V, three-layer-metal CMOS process. We adopted (1) a software bypass mechanism, (2) single-instruction multiple-data stream instructions, (3) four sets of floating-point multiply add and accumulate execution units, (4) special condition code registers and a branch condition generator for a clipping operation, and (5) automatic clock delay tuning methodology. As a result of these features, we achieved a performance of 2.5 GFLOPS and 6.5 million polygons per second for a three-dimensional geometry processor, which is the highest published performance as a single geometry processor. The processor is applicable to computer-aided-design systems that require very high graphics performance  相似文献   

12.
Mpact media processors enable powerful, flexible and cost-effective multimedia in a PC. A single chip replaces today's multiboard, multichip solutions for graphics, video, audio, and communications. The architecture combines a high-bandwidth RAMBUS memory, VLIW/SIMD (single instruction, multiple data) processing, standard buses, and software programmability for the cost of a modern graphics chip. Mpact architecture uses a modified VLIW style with two RISC-like instructions per VLIW. The instructions are either executed sequentially or concurrently based on a tag in the VLIW. Classical VLIW suffers from low code density due to unused instruction fields, but the Mpact modified VLIW has the same code density as RISC instructions. Additionally, the SIMD instructions improve code density by increasing the work done by each instruction. An 8 byte word size was chosen to balance vector and scalar performance and also to balance data and instruction bandwidth. A 9 bit byte was chosen to represent color-component differences in one byte and to represent 18 bit color or 18 bit audio samples in two bytes. Hardware-dithered rounding of quantization noise allows most audio to be processed in two byte precision. The maximal multiplier precision of 24×24 was chosen for audio requirements. The article reviews the first-generation Mpact media processor and then describes the multimedia performance goals and architecture of Chromatic's second-generation media processor architecture. It then presents newer modules of the architecture in more detail  相似文献   

13.
寄存器文件被广泛地应用于最新的DSP和媒体处理器的设计,为了能够减小处理器所开销的芯片面积、功耗以及体系结构的复杂度,必须合理设计寄存器文件结构.本文通过对现行采用的几种寄存器文件结构的分析对比,提出了一种新的独立寄存器文件单元结构,即将寄存器文件作为一个流水级单元,并且通过编译器静态调度的方法实现了寄存器文件端口数的减少以及旁路电路的简化.从实验的结果可以看出,这种结构不仅能满足媒体处理器的目标要求,而且对VLIW结构的媒体处理器有重要的意义.  相似文献   

14.
Cyclic redundancy check (CRC) is widely used for error detection. For optimal performances, a method has been developed for bit-parallel processing, but it may not take advantage of parallel processor architecture. Here, a method is proposed for using the full power of a very long instruction word (VLIW) digital signal processor (DSP) architecture in CRC computation. The method is at least four times faster for 8, 16 and 32 bits CRC  相似文献   

15.
面向移动终端处理器的低功耗、低成本、高效率、灵活升级的需求,本文在对LTE-A基带算法并行性分析的基础上,提出了一种基于超长指令字(Very Long Instruction Word,VLIW)和单指令多数据(Single Instruction Multiple Data,SIMD)混合结构的矢量处理器作为终端软基带解决方案.该矢量处理器采用变长的VLIW指令字,共有7条矢量数据通路,每条通路可执行16个16bit的定点运算;采用分组的系数存储器提高灵活性,受限访问的寄存器组降低电路面积;同时设计了SHUF和ISHUF指令专门用于快速傅里叶变换(FFT)和雏特比(vIT-ERBI)译码算法的矢量化实现.最后本文实现和分析了FFT和VITERBI译码算法.  相似文献   

16.
Many software compilers for embedded processors produce machine code of insufficient quality. Since for most applications software must meet tight code speed and size constraints, embedded software is still largely developed in assembly language. In order to eliminate this bottleneck and to enable the use of high-level language compilers also for embedded software, new code generation and optimization techniques are required. This paper describes a novel code generation technique for embedded processors with irregular data path architectures, such as typically found in fixed-point DSPs. The proposed code generation technique maps data flow graph representation of a program into highly efficient machine code for a target processor modeled by instruction set behavior. High code quality is ensured by tight coupling of different code generation phases. In contrast to earlier works, mainly based on heuristics, our approach is constraint-based. An initial set of constraints on code generation are prescribed by the given processor model. Further constraints arise during code generation based on decisions concerning code selection, register allocation, and scheduling. Whenever possible, decisions are postponed until sufficient information about a good decision has been collected. The constraints are active in the "background" and guarantee local satisfiability at any point of time during code generation. This mechanism permits to simultaneously cope with special-purpose registers and instruction level parallelism. We describe the detailed integration of code generation phases. The implementation is based on the constraint logic programming (CLP) language ECLiPSe. For a standard DSP, we show that the quality of generated code comes close to hand-written assembly code. Since the input processor model can be edited by the user, also retargetability of the code generation technique is achieved within a certain processor class. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

17.
This paper describes a technique for modeling and estimating the power consumptionat the system-level for embedded VLIW (Very Long Instruction Word) architectures.The method is based on a hierarchy of dynamic power estimationengines: from the instruction-level down to the gate/transistor-level. Powermacro-models have been developed for the main components of the system: theVLIW core, the register file, the instruction and data caches. The main goalis to define a system-level simulation framework for the dynamic profilingof the power behavior during the software execution, providing also a break-downof the power contributions due to the single components of the system. Theproposed approach has been applied to the Lx family of scalable embedded VLIWprocessors, jointly designed by STMicroelectronics and HPLabs. Experimentalresults, carried out over a set of benchmarks for embedded multimedia applications,have demonstrated an average accuracy of 5% of the instruction-level estimationengine with respect to the RTL engine, with an average speed-up of four ordersof magnitude.  相似文献   

18.
《Microelectronics Journal》2015,46(7):637-655
This paper proposes a new processor architecture called VVSHP for accelerating data-parallel applications, which are growing in importance and demanding increased performance from hardware. VVSHP merges VLIW and vector processing techniques for a simple, high-performance processor architecture. One key point of VVSHP is the execution of multiple scalar instructions within VLIW and vector instructions on unified parallel execution datapaths. Another key point is to reduce the complexity of VVSHP by designing a two-part register file: (1) shared scalar–vector part with eight-read/four-write ports 64×32-bit registers (64 scalar or 16×4 vector registers) for storing scalar/vector data and (2) vector part with two-read/one-write ports 48 vector-registers, each stores 4×32-bit vector data. Moreover, processing vector data with lengths varying from 1 to 256 represents a key point for reducing the loop overheads. VVSHP can issue up to four scalar/vector operations in each cycle for parallel processing a set of operands and producing up to four results to be written back into VVSHP register file. However, it cannot issue more than one memory operation at a time, which loads/stores 128-bit scalar/vector data from/to data memory. The design of our proposed VVSHP processor is implemented using VHDL targeting the Xilinx FPGA Virtex-5 and its performance is evaluated.  相似文献   

19.
This paper investigates a possible solution to the problem of power consumption in superscalar, out-of-order processors by proposing a new microarchitecture, specifically designed to reduce increasing power requirements of high-end processors. More precisely, we show that by modifying the well-established superscalar processor architecture, significant savings can be achieved in terms of power consumption. Our approach aims at limiting the growing amount of power used in a typical processor for dynamic optimizations (including out-of-order scheduling and register renaming). Our proposed approach achieves significant power savings by reusing as much as possible from the work done by the front-end of a typical superscalar, out-of-order pipeline, via the use of a special cache nested deeply into the processor structure. By reusing instructions that are already decoded, reordered, and have their registers already renamed, the front end of the pipeline can be turned off for large periods of time with significant savings in the overall power consumption. Experimental results show up to 35% (30% on average) savings in average energy per committed instruction, and 35% (20% on average) savings in energy-delay product, with about 9% average performance loss, over a large spectrum of SPEC95 and SPEC2000 benchmarks.  相似文献   

20.
管茂林  何义  杨乾明  张春元  伍楠 《电子学报》2012,40(7):1379-1385
针对流体系结构中VLIW代码体积对指令存储器的容量和功耗带来的问题,本文通过分析流处理器的指令特征,提出了一种新的VLIW分域压缩技术.在此基础上,本文为流体系结构设计了分布式的片上指令存储器,并提出了SIMD流水的执行模式.实验结果证明,该技术减少了38%的片外指令访存,降低约65%的片上指令存储器空间需求;分布式指令存储器减少了约37%的片上指令存储器面积,使得MASA的系统面积降低了8.92%,并降低了61%的指令存储器功耗.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号