期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A modularized processor LSI with a highly parallel structure forcontinuous speech recognition

Takahashi J. Hamaguchi S. Tansho K. Kimura T. 《Solid-State Circuits, IEEE Journal of》1991,26(6):833-843

A speech recognition processor CMOS LSI was developed as the processing element (PE) of a ring array processor previously proposed by the authors as architecture to carry out highly parallel recognition processing with array size flexibility. There are three key features for the LSI: (1) a highly parallel I/O structure of triple buffer with cyclical-mode transition control methods to solve the serious problem of inter-PE data transfer overhead versus the array processing; (2) a control structure with two direct memory access (DMA) controllers to realize inter-PE data I/O processing and intra-PE processing in parallel; and (3) a pipelined recognition processing at a high execution rate realized by a pipelined structure and a balanced clock distribution design technique. These effective designs for the PE LSI allow high-speed recognition processing without any inter-PE data transfer overhead in the ring array processor. Combining the PE-LSI architecture with the proposed array architecture for highly parallel dynamic time warping (DTW) processing, a real-time continuous speech recognition system based on continuous dynamic programming matching using the SPLIT method for a 1000-word vocabulary, can be constructed using a ring array processor consisting of 30 PEs 相似文献

2.

Dynamically scalable dual-core pipelined processor

《International Journal of Electronics》2013,100(10):1754-1764

This article proposes design and architecture of a dynamically scalable dual-core pipelined processor. Methodology of the design is the core fusion of two processors where two independent cores can dynamically morph into a larger processing unit, or they can be used as distinct processing elements to achieve high sequential performance and high parallel performance. Processor provides two execution modes. Mode1 is multiprogramming mode for execution of streams of instruction of lower data width, i.e., each core can perform 16-bit operations individually. Performance is improved in this mode due to the parallel execution of instructions in both the cores at the cost of area. In mode2, both the processing cores are coupled and behave like single, high data width processing unit, i.e., can perform 32-bit operation. Additional core-to-core communication is needed to realise this mode. The mode can switch dynamically; therefore, this processor can provide multifunction with single design. Design and verification of processor has been done successfully using Verilog on Xilinx 14.1 platform. The processor is verified in both simulation and synthesis with the help of test programs. This design aimed to be implemented on Xilinx Spartan 3E XC3S500E FPGA. 相似文献

3.

CRISP: a pipelined 32-bit microprocessor with 13-kbit of cache memory

《Solid-State Circuits, IEEE Journal of》1987,22(5):776-782

The implementation and architecture of a 172, 163-transistor single-chip general-purpose 32-b microprocessor is described. The 16-MHz chip is fabricated using a single-metal double-poly 1.75-/spl mu/m CMOS technology and is capable of a peak execution rate of over one instruction/clock. Multiple on-chip catches, pipelining, and a one-cycle I/O protocol are utilized. 相似文献

4.

Reduced instruction set computer architecture

《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》1988,76(1):38-55

A tutorial on the reduced instruction set computer (RISC) approach is presented and the key design issues involved in RISC architecture are highlighted. The results of a number of studies on the instruction execution characteristics of compiled high-level-language programs are examined first. The results of these studies inspired the RISC movement. Approaches to tree key RISC design issues are then summarized: optimized register usage, reduced instruction sets, and pipelining. As examples, an experimental system, the Berkeley RISC and a commercial system, the MIPS R2000, are presented. The advantages and disadvantages of a RISC versus CISC (complex instruction set computer) architecture are also discussed 相似文献

5.

A Processor-In-Memory Architecture for Multimedia Compression

Jasionowski B. J. Lay M. K. Margala M. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(4):478-483

This paper presents the design and development of a novel, low-complexity processor-in-memory (PIM) architecture for image and video compression. By integrating a novel-processing element with SRAM, bandwidth is improved and latency is greatly reduced. This paper also presents PIM design techniques for reduced power, area, and complexity for rapid deployment and reduced cost. A design methodology is presented and followed by an analysis of the processing element performance and capabilities. The proposed datapath solution delivers between 2 to 40 times higher performance compared to other presented solutions. The architecture executes a discrete cosine and wavelet transforms achieving up to 40% higher throughput per watt and occupying as little as 0.9% area compared to a commercial digital signal processing and other application-specified integrated circuit implementations while maintaining precision. A comprehensive comparative analysis is also provided. The proposed processor-in-memory is implemented in 1.8-V 0.18-mum CMOS technology and operates with a 300-MHz clock 相似文献

6.

A flexible parallel architecture for relaxation labeling algorithms

Lin S.-Y. Chen Z. 《Signal Processing, IEEE Transactions on》1992,40(5):1231-1240

The design of a flexible parallel architecture for both the discrete relaxation labeling (DRL) algorithm and the probabilistic relaxation labeling (PRL) algorithm is addressed. Through the analysis of parallelism in the computational models of both algorithms, the parallel execution of the algorithms on a flexible parallel architecture is presented. Three basic types of parallel operations are performed in the architecture: simultaneous, pipeline, and systolic. An illustrative example is used to show how the DRL algorithm can be executed on the parallel architecture. In doing so, the processing element (PE) organization and the combiner organization of the architecture are described. The same architecture with programmable functional units is shown to be able to execute the PRL algorithm, too. The performance comparisons between the proposed architecture and some other existing ones are also given 相似文献

7.

An Adaptive Motion Estimation Architecture for H.264/AVC

Yang Song Ali Akoglu 《Journal of Signal Processing Systems》2013,73(2):161-179

We introduce a variable block size motion estimation architecture that is adaptive to the full search (FS) and the three-step search (3SS) algorithms. Early termination, intensive data reuse, pipelined datapath with bit serial execution, and memory access management tailored to the search patterns of the FS and 3SS form key features of the architecture. The design was synthesized using Synopsys Design Compiler and 45nm standard cell library technology. The architecture sustains real-time CIF format with an operational frequency as low as 17.6MHz and consumes 1.98 mW at this clock rate. This architecture with its 500MHz peak operational frequency provides the end-user with the flexibility of choosing between video quality and throughput based on power consumption and processing speed constraints. 相似文献

8.

A 24-b 50-ns digital image signal processor

Nakagawa S.-I. Terane H. Matsumura T. Segawa H. Yoshimoto M. Shinohara H. Kato S.-I. Hatanaka M. Ohira H. Kato Y. Iwatsuki M. Tabuchi K. Horiba Y. 《Solid-State Circuits, IEEE Journal of》1990,25(6):1484-1493

A 50-ns digital image signal processor (DISP)-an image/video application-specific VLSI chip-is discussed. This chip integrates 538 K transistors and dissipates 1.4 W at a 40-MHz clock. It is based on a 24-b fixed-point architecture with a five-stage pipeline. The DISP features a real-time processing capability realized by an enhanced parallel architecture, video-oriented data processing functions, and an instruction cycle time that is typically 35 ns, and 50 ns at worst. This 50-ns cycle time allows the DISP to execute mor than 60-million operations per second (MOPS). High-density 1.0-μm CMOS technology allows numerous on-chip features, including specified resources optimized for image processing. This allows a flexible hardware implementation of various algorithms for picture coding. Several circuit design techniques that are intended to attain a fast instruction cycle are reviewed, including distributed instruction decoding and a hierarchical clocking circuit. The LSI has been designed by the extensive use of a cell-based design method. The processor incorporates a sophisticated testing function compatible with a cell-based design environment 相似文献

9.

A 32-b RISC/DSP microprocessor with reduced complexity 总被引：2，自引：0，他引：2

Dolle M. Jhand S. Lehner W. Muller O. Schlett M. 《Solid-State Circuits, IEEE Journal of》1997,32(7):1056-1066

This paper presents a new 32-b reduced instruction set computer/digital signal processor (RISC/DSP) architecture which can be used as a general purpose microprocessor and in parallel as a 16-/32-b fixed-point DSP. This has been achieved by using RISC design principles for the implementation of DSP functionality. A DSP unit operates in parallel to an arithmetic logic unit (ALU)/barrelshifter on the same register set. This architecture provides the fast loop processing, high data throughput, and deterministic program flow absolutely necessary in DSP applications. Besides offering a basis for general purpose and DSP processing, the RISC philosophy offers a higher degree of flexibility for the implementation of DSP algorithms and achieves higher clock frequencies compared to conventional DSP architectures. The integrated DSP unit provides instruction set support for highly specialized DSP algorithms. Subword processing optimized for DSP algorithms has been implemented to provide maximum performance for 16-b data types. While creating a unified base for both application areas, we also minimized transistor count and we reduced complexity by using a short instruction pipeline. A parallelism concept based on a varying number of instruction latency cycles made superscalar instruction execution superfluous 相似文献

10.

Bit-by-Bit Pipelined and Hybrid-Grained 2D Architecture for Motion Estimation of H.264/AVC

Yang Song Ali Akoglu 《Journal of Signal Processing Systems》2012,68(1):49-62

In H.264/AVC, the motion estimation (ME) routine supports variable block size and involves highly parallel sum of absolute difference (SAD) computations. In this study, we introduce a bit serial hybrid-grained processing element (PE) based 2D architecture that has both early termination and intensive data reuse capabilities. PEs operate on most significant bit-first arithmetic for early termination and the 2D architecture enables on-chip data reuse between neighboring PEs in a bit-by-bit pipelined fashion. Hybrid-grained PEs reduce the hardware overhead of conventional adder tree structures used for implementing the variable block size ME. Our design reduces the gate count by 7x compared to its ASIC counterpart, operates at a comparable frequency while sustaining 30 fps and 60 fps; and outperforms bit parallel and bit serial architectures in terms of throughput and performance per gate for various video formats. 相似文献

11.

Issue logic for a 600-MHz out-of-order execution microprocessor

Farrell J.A. Fischer T.C. 《Solid-State Circuits, IEEE Journal of》1998,33(5):707-712

The logic and circuits are presented for a 20-entry instruction queue which scoreboards 80 registers and issues four instructions per cycle in a 600-MHz microprocessor. The request logic and arbiter circuits that control integer execution are described in addition to a novel compaction scheme that maintains temporal order in the queue. The issue logic data path is implemented in 141000 transistors, occupying 10 mm² in a 0.35-μm CMOS process 相似文献

12.

适用于片上并行计算阵列的超精简处理器架构

周韧研刘雷波魏少军《电路与系统学报》2012,17(2):1-5

提出一种超精简处理单元架构。该处理单元基于运算-跳转式单指令处理器体系。使用指令优化和内部总线上加速器,该处理单元能够执行传统算术运算式单指令处理器难于执行的高效位运算以及执行效率较低的数据转移操作。以该处理单元构成的片上大规模并行计算阵列可用于图像处理等局部性强、实时性要求高的计算任务。包含有该处理单元架构的16 16的原型阵列已经在FPGA上实现,性能达30.7GOPS@120MHz,平均功耗39.5mW。相似文献

13.

DSP中基于指令并行和任务并行的DMA接口设计

沈戈樊晓桠高德远段然《微电子学与计算机》2004,21(7):160-163

在面向多媒体数据流的计算密集型的应用中，不仅要求DSP(数字信号处理器)有非常强大的数据处理能力，还要求其具有高速的数据输入、输出接口带宽。本文在传统DSP常用的增强型哈佛结构的基础上，提出一种DSP处理器DMA接口结构的设计方案．实现了基于指令并行和任务并行的DMA并行传输模式。通过6个常用的DSP算法程序实验验证．在片上存储器使用单口RAM的前提下，指令中带有片上Memory访存操作的指令占总指令的42．2％-94．3％时．这种方法设计的。DMA接口能够在DSP零开销的情况下，完成必要的数据传输。而且能够实现对Host处理器程序员透明的。DMA数据传输操作．有效地提高了DSP系统的性能。相似文献

14.

Application-specific instruction set processor for SoC implementation of modern signal processing algorithms

Zhaohui Liu Dickson K. McCanny J.V. 《IEEE transactions on circuits and systems. I, Regular papers》2005,52(4):755-765

A novel application-specific instruction set processor (ASIP) for use in the construction of modern signal processing systems is presented. This is a flexible device that can be used in the construction of array processor systems for the real-time implementation of functions such as singular-value decomposition (SVD) and QR decomposition (QRD), as well as other important matrix computations. It uses a coordinate rotation digital computer (CORDIC) module to perform arithmetic operations and several approaches are adopted to achieve high performance including pipelining of the micro-rotations, the use of parallel instructions and a dual-bus architecture. In addition, a novel method for scale factor correction is presented which only needs to be applied once at the end of the computation. This also reduces computation time and enhances performance. Methods are described which allow this processor to be used in reduced dimension (i.e., folded) array processor structures that allow tradeoffs between hardware and performance. The net result is a flexible matrix computational processing element (PE) whose functionality can be changed under program control for use in a wider range of scenarios than previous work. Details are presented of the results of a design study, which considers the application of this decomposition PE architecture in a combined SVD/QRD system and demonstrates that a combination of high performance and efficient silicon implementation are achievable. 相似文献

15.

A Distributed,Simultaneously Multi-Threaded (SMT) Processor with Clustered Scheduling Windows for Scalable DSP Performance

Mladen Berekovic Mladen Berekovic Tim Niggemeier 《Journal of Signal Processing Systems》2008,50(2):201-229

A scalable, distributed, processor architecture is presented that emphasizes on high performance computing for digital signal processing applications by combining high frequency design techniques with a very high degree of parallel processing on a chip. The architecture is based on a superscalar processor model with a modified Tomasulo scheme that was extended to eliminate all central control structures for the data flow and to support simultaneous instruction issue from multiple independent threads [simultaneously multi-threaded (SMT)]. Consequent application of fine clustering reduces the cycle-time for wire-sensitive building blocks of the processor like the register file and the scheduling window and leads to a distributed architecture model, where independent thread processing units, arithmetic logic units, registers files and memories are distributed across the chip and communicate with each other by special network. A special communication protocol replaces broadcasting and associative compare of destination tags in a centralised instruction scheduler with explicit operand transfer instructions, thus decentralizing the control of the data flow to the greatest extent. As a result, the processor cycle time does neither depend on the issue bandwidth of a single thread nor on the execution bandwidth of the SMT processor. This makes the performance of the architecture scalable with both the number of function and the number of thread units without having any impact on the processors cycle-time. Performance and scalability of the proposed microarchitecture is demonstrated with critical signal processing kernels from the MPEG-4 video coding standard on a cycle-true simulator.

Tim NiggemeierEmail:

相似文献

16.

Parallel image processing with the block data parallel architecture 总被引：2，自引：0，他引：2

Alexander W.E. Reeves D.S. Gloster C.S. Jr. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》1996,84(7):947-968

Many digital signal and image processing algorithms can be speeded up by executing them in parallel on multiple processors. The speed of parallel execution is limited by the need for communication and synchronization between processors. In this paper, we present a paradigm for parallel processing that we call the block data flow paradigm (BDFP). The goal of this paradigm is to reduce interprocessor communication and relax the synchronization requirements for such applications. We present the block data parallel architecture which implements this paradigm, and we present methods for mapping algorithms onto this architecture. We illustrate this methodology for several applications including two-dimensional (2-D) digital filters, the 2-D discrete cosine transform, QR decomposition of a matrix and Cholesky factorization of a matrix. We analyze the resulting system performance for these applications with regard to speedup and efficiency as the number of processors increases. Our results demonstrate that the block data parallel architecture is a flexible, high-performance solution for numerous digital signal and image processing algorithms 相似文献

17.

多线程非阻塞指令Cache设计

胡孔阳陈鹏桑红石《微电子学与计算机》2012,29(5):143-147

非阻塞Cache是指Cache在等待预取数据返回时,还能继续提供指令和数据.首先分析了多线程非阻塞Cache的处理器需求,然后提出其时序要求和一种实现方案.利用SystemVerilog对该方案进行RTL级建模和性能评估.仿真结果表明,该方案可以很好地应用于多线程、乱序执行处理器的指令引擎设计之中. 相似文献

18.

A hardware accelerator for two-dimensional image analysis

《Integration, the VLSI Journal》1988,6(3):329-344

This paper describes the architecture and operation of a new hardware accelerator called MultiRing for performing various geometrical operations on two-dimensional image space. This hardware architecture is shown to be applicable for design rule checking in VLSI layout and many image processing operations including noise suppression and contour extraction. It has both a fast execution speed and extremely high flexibility. Each row data stored in ring memory is processed in the corresponding processor in full parallelism. Each processor is simultaneously configured by the instruction decoder/controller to perform one of the 20 basic instructions each ring cycle, which gives MultiRing maximal flexibility in terms of design rule change or the instruction set enhancement. Correct functional behavior of MultiRing was confirmed by successfully running a software simulator having one-to-one structural correspondence to the MultiRing hardware. 相似文献

19.

High-speed fiber-optic links for distribution of satellite traffic

Daryoush A.S. Ackerman E. Saedi R. Kunath R. Shalkhauser K. 《Microwave Theory and Techniques》1990,38(5):510-517

Low-loss fiber-optic links are designed for distribution of data and the frequency reference in large-aperture phased-array antennas based on the transmit/receive-level data mixing architecture. In particular, design aspects of a fiber-optic link satisfying the distribution requirements of satellite data traffic are presented. The design is addressed in terms of reactively matched optical transmitter and receiver modules. Analog and digital characterization of a 50-m fiber-optic link realized using these modules indicates the applicability of this architecture as the only viable alternative for distribution of data signals inside a satellite at present. It is demonstrated that the design of a reactive matching modules enhances the link performance. A dynamic range of 88 dB/MHz was measured for analog data over 500-1000-MHz bandwidth 相似文献

20.

面向分组密码的可重构异构多核并行处理架构

下载免费PDF全文

冯晓李伟戴紫彬马超李功丽《电子学报》2017,45(6):1311-1320

现有的可重构分组密码实现结构中,专用指令处理器吞吐率不高,阵列结构资源利用率低、算法映射过程复杂.为此,设计了分组密码可重构异构多核并行处理架构RAMCA（Reconfigurable Asymmetrical Multi-Core Architecture）,分析了典型SP（AES-128）、Feistel（SMS4）、L-M（IDEA）及MISTY（KASUMI）结构算法在RAMCA上的映射过程.在65nm CMOS工艺下完成了逻辑综合和功能仿真.实验表明,RAMCA工作频率可达到1GHz,面积约为1.13mm²,消除工艺影响后,对各分组密码算法的运算速度均高于现有专用指令处理器以及Celator、RCPA和BCORE等阵列结构密码处理系统. 相似文献