期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A microprocessor with a 128-bit CPU, ten floating-point MAC's, fourfloating-point dividers, and an MPEG-2 decoder

Suzuoki M. Kutaragi K. Hiroi T. Magoshi H. Okamoto S. Oka M. Ohba A. Yamamoto Y. Furuhashi M. Tanaka M. Yutaka T. Okada T. Nagamatsu M. Urakawa Y. Funyu M. Kunimatsu A. Goto H. Hashimoto K. Ide N. Murakami H. Ohtaguro Y. Aono A. 《Solid-State Circuits, IEEE Journal of》1999,34(11):1608-1618

A 250-MHz microprocessor intended for home computer entertainment consists of a CPU core with 128-b multimedia extensions, two single-instruction, multiple-data (SIMD) very long instruction word (VLIW) vector processors containing ten floating-point multiplier accelerators and four floating-point dividers, an MPEG-2 decoder, a ten-channel direct memory access (DMA) controller, and other peripherals with 128 b internal buses on one die. The core is a two-way superscalar MIPS-compatible microprocessor with 16-kB scratch-pad RAM. Each vector processor is a five-way SIMD-VLIW architecture, which is tightly dedicated for specific applications concerning three-dimensional geometry calculation and physical simulation. A DMA controller connects between main memory and each processor's local memory to conceal memory access penalty. It contains 10.5 M transistors in 17×14.1 mm and dissipates 15 W at 1.8 V 相似文献

2.

Application Specific Processor Design for H.264 Decoder with a Configurable Embedded Processor

Jin Ho Han Mi Young Lee Younghwan Bae Hanjin Cho 《ETRI Journal》2005,27(5):491-496

An application specific processor for an H.264 decoder with a configurable embedded processor is designed in this research. The motion compensation, inverse integer transform, inverse quantization, and entropy decoding algorithm of H.264 decoder software are optimized. We improved the performance of the processor with instruction‐level hardware optimization, which is tailored to configurable embedded processor architecture. The optimized instructions for video processing can be used in other video compression standards such as MPEG 1, 2, and 4. A significant performance improvement is achieved with high flexibility. Experimental results show that we could achieve 300% performance for the H.264 baseline profile level 2 decoder. 相似文献

3.

Hardware/compiler codevelopment for an embedded media processor

Kozyrakis C. Judd D. Gebis J. Williams S. Patterson D. Yelick K. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2001,89(11):1694-1709

Embedded and portable systems running multimedia applications create a new challenge for hardware architects. A microprocessor for such applications needs to be easy to program like a general-purpose processor and have the performance and power efficiency of a digital signal processor. This paper presents the codevelopment of the instruction set, the hardware, and the compiler for the Vector IRAM media processor. A vector architecture is used to exploit the data parallelism of multimedia programs, which allows the use of highly modular hardware and enables implementations that combine high performance, low power consumption, and reduced design complexity. It also leads to a compiler model that is efficient both in terms of performance and executable code size. The memory system for the vector processor is implemented using embedded DRAM technology, which provides high bandwidth in an integrated, cost-effective manner. The hardware and the compiler for this architecture make complementary contributions to the efficiency of the overall system. This paper explores the interactions and tradeoffs between them, as well as the enhancements to a vector architecture necessary for multimedia processing. We also describe how the architecture, design, and compiler features come together in a prototype system-on-a-chip, able to execute 3.2 billion operations per second per watt 相似文献

4.

A 450-MHz RISC microprocessor with enhanced instruction set andcopper interconnect

Nicoletta C. Alvarez J. Barkin E. Chai-Chin Chao Johnson B.R. Lassandro F.M. Patel P. Reid D. Sanchez H. Seigel J. Snyder M. Sullivan S. Taylor S.A. Minh Vo 《Solid-State Circuits, IEEE Journal of》1999,34(11):1478-1491

This superscalar microprocessor is the first implementation of a 32-bit RISC architecture specification incorporating a single-instruction, multiple-data vector processing engine. Two instructions per cycle plus a branch can be dispatched to two of seven execution units in this microarchitecture designed for high execution performance, high memory bandwidth, and low power for desktop, embedded, and multiprocessing systems. The processor features an enhanced memory subsystem, 128-bit internal data buses for improved bandwidth, and 32-KB eight-way instruction/data caches. The integrated L2 tag and cache controller with a dedicated L2 bus interface supports L2 cache sizes of 512 KB, 1 MB, or 2 MB with two-way set associativity. At 450 MHz, and with a 2-MB L2 cache, this processor is estimated to have a floating-point and integer performance metric of 20 while dissipating only 7 W at 1.8 V. The 10.5 million transistor, 83-mm² die is fabricated in a 1.8-V, 0.20-μm CMOS process with six layers of copper interconnect 相似文献

5.

Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit

Hoseok Chang Junho Cho Wonyong Sung 《Journal of Signal Processing Systems》2009,56(2-3):249-260

The single instruction multiple data (SIMD) architecture is very efficient for executing arithmetic intensive programs, but frequently suffers from data-alignment problems. The data-alignment problem not only induces extra time overhead but also hinders automatic vectorization of the SIMD compiler. In this paper, we compare three on-chip memory systems, which are single-bank, multi-bank, and multi-port, for the SIMD architecture to resolve the data-alignment problems. The single-bank memory is the simplest, but supports only the aligned accesses. The multi-bank memory requires a little higher complexity, but enables the unaligned accesses and the stride accesses with a bank-conflict limitation. The multi-port memory is capable of both the unaligned and stride accesses without any restriction, but needs quite much expensive hardware. We also developed a vectorizing compiler that can conduct dynamic memory allocation and SIMD code generation. The performances of the three memory systems with our SIMD compiler are evaluated using several digital signal processing kernels and the MPEG2 encoder. The experimental results show that the multi-bank memory can carry out MPEG2 encoding 5.8 times faster, whereas the single-bank memory only achieves 2.9 times speed-up when employed in a multimedia system with a 2-issue host processor and an 8-way SIMD coprocessor. The multi-port memory obviously shows the best performance, which is however an impractical improvement over the multi-bank memory when the hardware cost is considered. 相似文献

6.

A 32-b RISC/DSP microprocessor with reduced complexity 总被引：2，自引：0，他引：2

Dolle M. Jhand S. Lehner W. Muller O. Schlett M. 《Solid-State Circuits, IEEE Journal of》1997,32(7):1056-1066

This paper presents a new 32-b reduced instruction set computer/digital signal processor (RISC/DSP) architecture which can be used as a general purpose microprocessor and in parallel as a 16-/32-b fixed-point DSP. This has been achieved by using RISC design principles for the implementation of DSP functionality. A DSP unit operates in parallel to an arithmetic logic unit (ALU)/barrelshifter on the same register set. This architecture provides the fast loop processing, high data throughput, and deterministic program flow absolutely necessary in DSP applications. Besides offering a basis for general purpose and DSP processing, the RISC philosophy offers a higher degree of flexibility for the implementation of DSP algorithms and achieves higher clock frequencies compared to conventional DSP architectures. The integrated DSP unit provides instruction set support for highly specialized DSP algorithms. Subword processing optimized for DSP algorithms has been implemented to provide maximum performance for 16-b data types. While creating a unified base for both application areas, we also minimized transistor count and we reduced complexity by using a short instruction pipeline. A parallelism concept based on a varying number of instruction latency cycles made superscalar instruction execution superfluous 相似文献

7.

An 80/20-MHz 160-mW multimedia processor integrated with embeddedDRAM, MPEG-4 accelerator and 3-D rendering engine for mobileapplications

Chi-Weon Yoon Ramchan Woo Jeengheon Kook Se-Joong Lee Kangmin Lee Hoi-Jun Yeo 《Solid-State Circuits, IEEE Journal of》2001,36(11):1758-1767

A low-power multimedia processor for mobile applications is presented. An 80-MHz 32-b RISC with enhanced multiplier, two 20-MHz hardware accelerators with 7.125-Mb embedded DRAM for MPEG-4 visual SP@L1 decoding and 3-D graphics processing, 2-kB dual-port SRAM, and peripheral blocks are integrated together on a single chip, MPEG-4 SP@L1 video decoding and 3-D graphics rendering with a 16-b depth-buffer alpha-blending double-buffering and gouraud-shading features at 2, 2-Mpolygons/s speed are realized with the help of the dedicated hardware accelerators/ The architecture of the processor is optimized in terms of power consumption and performance, and various low-power circuit techniques are adopted in each hardware block. The chip is implemented using 0.18-μm embedded memory logic (EML) technology. Its area is 84 mm², and power consumption is 160 mW when all of the functions are activated 相似文献

8.

嵌入式Flash CISC/DSP微处理器的研究与实现 总被引：1，自引：0，他引：1

下载免费PDF全文

卢结成丁丁丁晓兵朱少华《电子学报》2003,31(8):1252-1254

本文研究一种新的既具有微控制器功能,又有增强DSP功能的高性能微处理器的实现架构.在统一的增强CISC指令集下,我们将基于哈佛和寄存器-寄存器结构的微处理器模块和单周期乘法/累加器、桶形移位寄存器、无开销循环及跳转硬件支持模块、硬件地址产生器等DSP功能模块以及嵌入式Flash Memory和指令队列缓冲器有机的集成起来,在统一架构下通过单核实现CISC/DSP微处理器,有效地提高了处理器的性能.该微处理器采用0.35μm CMOS工艺实现,芯片面积为25mm².在80M工作频率下,动态功耗为425mW,峰值数据处理能力可达80MIPS.该处理器核可满足片上系统(SOC)对高性能处理器的需求. 相似文献

9.

A memory-based architecture for MPEG2 system protocol LSIs

Inamori M. Naganuma J. Endo M. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1999,7(3):339-344

This paper proposes a memory-based architecture implementing the MPEG2 system protocol large scale integrations (LSIs), and demonstrates its flexibility and performance. The memory-based architecture implements the full functionality of the MPEG2 system protocol for both multiplexing and demultiplexing MPEG2-encoded streams. It consists of a core central processing unit, memories, and dedicated application-specific hardware. It is designed and optimized by hardware/software codesign techniques. The LSI's provide sufficient performance and flexibility for real-time application of the MPEG2 system protocol. They were fabricated with 0.5 μm CMOS embedded gate array process technology. They are now in use on MPEG2 codec systems for several multimedia communication and storage services 相似文献

10.

Execution cache-based microarchitecture for power-efficient superscalar processors

Talpes E. Marculescu D. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(1):14-26

This paper investigates a possible solution to the problem of power consumption in superscalar, out-of-order processors by proposing a new microarchitecture, specifically designed to reduce increasing power requirements of high-end processors. More precisely, we show that by modifying the well-established superscalar processor architecture, significant savings can be achieved in terms of power consumption. Our approach aims at limiting the growing amount of power used in a typical processor for dynamic optimizations (including out-of-order scheduling and register renaming). Our proposed approach achieves significant power savings by reusing as much as possible from the work done by the front-end of a typical superscalar, out-of-order pipeline, via the use of a special cache nested deeply into the processor structure. By reusing instructions that are already decoded, reordered, and have their registers already renamed, the front end of the pipeline can be turned off for large periods of time with significant savings in the overall power consumption. Experimental results show up to 35% (30% on average) savings in average energy per committed instruction, and 35% (20% on average) savings in energy-delay product, with about 9% average performance loss, over a large spectrum of SPEC95 and SPEC2000 benchmarks. 相似文献

11.

MICROTHREAD BASED （MTB） COARSE GRAINED FAULT TOLERANCE SUPERSCALAR PROCESSOR ARCHITECTURE 总被引：1，自引：0，他引：1

Fu Zhongchuan Chen Hongsong Cui Gang 《电子科学学刊(英文版)》2006,23(3):461-466

Fault tolerance in microprocessor systems has become a popular topic of architecture research. Much work has been done at different levels to accomplish reliability against soft errors, and some fault tolerance architectures have been proposed. But little attention is paid to the thread level superscalar fault tolerance. This letter introduces microthread concept into superscalar processor fault tolerance domain, and puts forward a novel fault tolerance architecture, namely, MicroThread Based （MTB） coarse grained transient fault tolerance superscalar processor architecture, then discusses some detailed implementations. 相似文献

12.

Architecture and design of NX-2700: a programmable single-chip HDTVall-format-decode-and-display processor

Dutta S. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2001,9(2):313-328

This paper describes the architecture, functionality, and design of NX-2700, a digital television and media processor chip from Philips Semiconductors. The NX-2700 is the second generation of an architectural family of programmable multimedia processors targeted at the digital television (DTV) markets, including the United States Advanced Television Systems Committee (ATSC) DTV-standard-based applications. The chip not only supports all of the 18 ATSC formats from standard-definition to wide-angle, high-definition video, but has also the power to handle high-definition television (HDTV) video and audio source decoding (high-level MPEG-5 AC-3 and ProLogic audio, closed captioning, etc.) as well as the flexibility to process advanced interactive services. NX-2700 is a programmable processor with a very powerful, general-purpose very long instruction word (VLIW) central processing unit (CPU) core that implements many nontrivial multimedia algorithms, coordinates all on-chip activities, and runs a small real-time operating system. The CPU core, aided by an array of peripheral devices (multimedia coprocessors and input-output units) and high-performance buses, facilitates concurrent processing of audio, video, graphics, and communication-data 相似文献

13.

一种基于数字信号处理器的媒体处理器结构及设计

高健陈杰《微电子学与计算机》2007,24(4):1-4

设计了一种针对图像、音频、视频等多媒体数据的处理新型结构的媒体处理器。该媒体处理器由一个通用数字信号处理器及多媒体协处理器构成,其指令集包含了通用的数字信号处理指令及扩展的多媒体处理指令。多媒体协处理器中包含了多个专用于多媒体处理的功能模块,可以加速多媒体处理的进行。该媒体处理器具有强大的多媒体处理能力,可实现对JPEG压缩图像、MP3音频流或MPEG2的MP@ML级别的压缩视频流的实时解码。相似文献

14.

The impact of Mpact 2

Purcell S. 《Signal Processing Magazine, IEEE》1998,15(2):102-107

Mpact media processors enable powerful, flexible and cost-effective multimedia in a PC. A single chip replaces today's multiboard, multichip solutions for graphics, video, audio, and communications. The architecture combines a high-bandwidth RAMBUS memory, VLIW/SIMD (single instruction, multiple data) processing, standard buses, and software programmability for the cost of a modern graphics chip. Mpact architecture uses a modified VLIW style with two RISC-like instructions per VLIW. The instructions are either executed sequentially or concurrently based on a tag in the VLIW. Classical VLIW suffers from low code density due to unused instruction fields, but the Mpact modified VLIW has the same code density as RISC instructions. Additionally, the SIMD instructions improve code density by increasing the work done by each instruction. An 8 byte word size was chosen to balance vector and scalar performance and also to balance data and instruction bandwidth. A 9 bit byte was chosen to represent color-component differences in one byte and to represent 18 bit color or 18 bit audio samples in two bytes. Hardware-dithered rounding of quantization noise allows most audio to be processed in two byte precision. The maximal multiplier precision of 24×24 was chosen for audio requirements. The article reviews the first-generation Mpact media processor and then describes the multimedia performance goals and architecture of Chromatic's second-generation media processor architecture. It then presents newer modules of the architecture in more detail 相似文献

15.

A Low Power 16‐Bit RISC Microprocessor Using ECRL Circuits

Youngjoon Shin Chanho Lee Yong Moon 《ETRI Journal》2004,26(6):513-519

This paper presents a low power 16‐bit adiabatic reduced instruction set computer (RISC) microprocessor with efficient charge recovery logic (ECRL) registers. The processor consists of registers, a control block, a register file, a program counter, and an arithmetic and logical unit (ALU). Adiabatic circuits based on ECRL are designed using a 0.35 µm CMOS technology. An adiabatic latch based on ECRL is proposed for signal interfaces for the first time, and an efficient four‐phase supply clock generator is designed to provide power for the adiabatic processor. A static CMOS processor with the same architecture is designed to compare the energy consumption of adiabatic and non‐adiabatic microprocessors. Simulation results show that the power consumption of the adiabatic microprocessor is about 1/3 compared to that of the static CMOS microprocessor. 相似文献

16.

A 64-bit microprocessor in 130-nm and 90-nm technologies with power management features

Rohrer N.J. Lichtenau C. Sandon P.A. Kartschoke P. Cohen E. Canada M.G. Pfluger T. Ringler M.I. Hilgendorf R.B. Geissler S. Zimmerman J.S. 《Solid-State Circuits, IEEE Journal of》2005,40(1):19-27

The first two members in a family of 64-bit superscalar microprocessors are presented. The 130-nm processor, which was introduced first, offers 5-way instruction dispatch, support for 4-way integer and floating-point single-instruction multiple-data (SIMD) operations, a 512-kB second level (L2) cache, and a high-speed external bus. The 90-nm processor is a technology remap of the 130-nm design. It retains the features of the 130-nm processor and adds others, including a new power management facility. The architecture, device characteristics, power management, and thermal details of these two processors are described. In addition, the dataflow layout, aspects of the circuit design, clocking, and timing are discussed. 相似文献

17.

Low Power Reconfiguration Technique for Coarse-Grained Reconfigurable Architecture

Yoonjin Kim Mahapatra R.N. Ilhyun Park Kiyoung Choi 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(5):593-603

Coarse-grained reconfigurable architectures (CGRAs) require many processing elements (PEs) and a configuration memory unit (configuration cache) for reconfiguration of its PE array. Although this structure is meant for high performance and flexibility, it consumes significant power. Specially, power consumption by configuration cache is explicit overhead compared to other types of intellectual property (IP) cores. Reducing power is very crucial for CGRA to be more competitive and reliable processing core in embedded systems. In this paper, we propose a reusable context pipelining (RCP) architecture to reduce power-overhead caused by reconfiguration. It shows that the power reduction can be achieved by using the characteristics of loop pipelining, which is a multiple instruction stream, multiple data stream (MIMD)-style execution model. RCP efficiently reduces power consumption in configuration cache without performance degradation. Experimental results show that the proposed approach saves much power even with reduced configuration cache size. Power reduction ratio in the configuration cache and the entire architecture are up to 86.33% and 37.19%, respectively, compared to the base architecture. 相似文献

18.

Microarchitectural innovations: boosting microprocessor performancebeyond semiconductor technology scaling

Moshovos A. Sohi G.S. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2001,89(11):1560-1575

Semiconductor technology scaling provides faster and more plentiful transistors to build microprocessors, and applications continue to drive the demand for more powerful microprocessors. Weaving the "raw" semiconductor material into a microprocessor that offers the performance needed by modern and future applications is the role of computer architecture. This paper overviews some of the microarchitectural techniques that empower modem high-performance microprocessors. The techniques are classified into: 1) techniques meant to increase the concurrency in instruction processing, while maintaining the appearance of sequential processing and 2) techniques that exploit program behavior. The first category includes pipelining, superscalar execution, out-of-order execution, register renaming, and techniques to overlap memory-accessing instructions. The second category includes memory hierarchies, branch predictors, trace caches, and memory-dependence predictors. The paper also discusses microarchitectural techniques likely to be used in future microprocessors, including data value speculation and instruction reuse, microarchitectures with multiple sequencers and thread-level speculation, and microarchitectural techniques for tackling the problems of power consumption and reliability 相似文献

19.

The Circuits and Robust Design Methodology of the Massively Parallel Processor Based on the Matrix Architecture

Noda H. Tanizaki T. Gyohten T. Dosaka K. Nakajima M. Mizumoto K. Yoshida K. Iwao T. Nishijima T. Okuno Y. Arimoto K. 《Solid-State Circuits, IEEE Journal of》2007,42(4):804-812

Novel circuits and design methodology of the massively parallel processor based on the matrix architecture are introduced. A fine-grained processing elements (PE) circuit for high-throughput MAC operations based on the Booth's algorithm enhances the performance of a 16-bit fixed-point signed MAC, which operates up to 30.0GOPS/W. The dedicated I/O interface circuits are designed for converting the direction of data access and supporting the interleaved memory architecture, and they are implemented for maximizing the processor core efficiency. Power management techniques for suppressing current peaks and reducing average power consumption are proposed to enhance the robustness of the macro. The circuits and the design methodology proposal in this paper are attractive for achieving a high performance and robust massively parallel SIMD processor core employed in multimedia SoCs 相似文献

20.

An 80-MFLOPS (peak) 64-b microprocessor for parallel computer

Nakano H. Nakajima M. Nakakura Y. Yoshida T. Goi Y. Nakai Y. Segawa R. Kishida T. Kadota H. 《Solid-State Circuits, IEEE Journal of》1992,27(3):365-372

An 80-MFLOPS (peak) 64-b microprocessor that employs superscalar architecture to execute two instructions simultaneously in one 25-ns cycle, including the combination of 64-b floating-point add and multiply instructions, is described. The processor implemented in a 0.8-μm CMOS technology contains 1300 K transistors. The processor also employs a RISC architecture and Harvard-style bus organization. The authors provide an overview of the processor, especially focusing on processor architecture, floating-point hardware, and performance 相似文献