首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A 135K transistor, uniformly pipelined 50-MHz CMOS 64-bit floating-point arithmetic processor chip is described. The execution unit is capable of sustaining pipelined performance of one 32-bit or 64-bit result every 20 ns for all operations except double-precision multiply (40 ns) and divide. The chip employs an exponent difference prediction scheme and a unified leading-one and sticky-bit computation logic for the addition and subtraction operations. A hardware multiplier using a radix-8 modified Booth algorithm and a divider using a radix-2 SRT algorithm are employed.<>  相似文献   

2.
A 32×32-bit multiplier using multiple-valued current-mode circuits has been fabricated in 2-μm CMOS technology. For the multiplier based on the radix-4 signed-digit number system, 32×32-bit two's complement multiplication can be performed with only three-stage signed-digit full adders using a binary-tree addition scheme. The chip contains about 23600 transistors and the effective multiplier size is about 3.2×5.2 mm2, which is half that of the corresponding binary CMOS multiplier. The multiply time is less than 59 ns. The performance is considered comparable to that of the fastest binary multiplier reported  相似文献   

3.
In this article we consider a design of a multiplier for the multiplication of complex numbers. The complex numbers are packed into one 32-bit word. They are represented by two 13-bit parts with the same 6-bit exponent. Multiplication of complex numbers is examined from the perspectives of performance, complexity and silicon area. The design is unique and combines shared Booth encoding for the real and imaginary parts including only one combined modified Wallace tree of 4:2 adders for each part. The regular Wallace tree is compared with the tree of 4:2 adders. This design results in a more compact wiring structure and balanced delays resulting in a faster multiplier circuit. The number of adders used in the multiplier is also reduced. We consider VLSI CMOS technology and the relevant characteristics as they impact the implementation and performance.  相似文献   

4.
A 32-bit integer execution core containing a Han-Carlson arithmetic-logic unit (ALU), an 8-entry /spl times/ 2 ALU instruction scheduler loop and a 32-entry /spl times/ 32-bit register file is described. In a 130 nm six-metal, dual-V/sub T/ CMOS technology, the 2.3 mm/sup 2/ prototype contains 160 K transistors. Measurements demonstrate capability for 5-GHz single-cycle integer execution at 25/spl deg/C. The single-ended, leakage-tolerant dynamic scheme used in the ALU and scheduler enables up to 9-wide ORs with 23% critical path speed improvement and 40% active leakage power reduction when compared to a conventional Kogge-Stone implementation. On-chip body-bias circuits provide additional performance improvement or leakage tolerance. Stack node preconditioning improves ALU performance by 10%. At 5 GHz, ALU power is 95 mW at 0.95 V and the register file consumes 172 mW at 1.37 V. The ALU performance is scalable to 6.5 GHz at 1.1 V and to 10 GHz at 1.7 V, 25/spl deg/C.  相似文献   

5.
Over the last few years, the Logarithmic Number System (LNS) has played a pivotal and decisive role in the field of Digital Signal Processing (DSP) and Image processing. Multiplication is a ubiquitous thirsty area to perform arithmetic operations in DSP applications and researchers have found that LNS is the possible solution for multiplication to be performed for a DSP application. In this paper, we propose a novel approach based on the Improved Operand Decomposition (IOD) to make an efficient logarithmic multiplier and subsequent achievement through scale realization. The Pipeline technique and the efficient correction circuit are used for error minimization at the cost of minimal hardware and delay. Reported and proposed multiplier is evaluated and compared in terms of Data Arrival Time (DAT), area, power, Area Delay Product (ADP), and EPS (Energy per Sample) at 90 nm CMOS technology by using Synopsys Design Compiler. Simulation results show that the proposed IOD method for logarithmic multiplication without the pipelining gives maximum of 35.39% less ADP and 11.15% less EPS for 32-bit architecture than of the reported logarithmic multiplier architecture. The proposed IOD based logarithmic multiplier with the pipelining gives a maximum of 20.17% less ADP for 8-bit architecture and 21.72% for 32-bit architecture than of the reported iterative pipelined architecture of logarithmic multiplier. Simulation results show that the optimized logarithmic converter gives 7.32%, and optimized antilogarithmic converter gives 41.59% less ADP respectively than of the reported logarithmic and antilogarithmic converter structures. The optimized antilogarithmic converter architecture gives a maximum of 43.94% less EPS than of the reported antilogarithmic converter structure.  相似文献   

6.
一种嵌入式RISC微处理器的整数部件设计   总被引:3,自引:2,他引:1  
文章介绍了一种与ARM7TDMI兼容的32位嵌入式RISC微处理器核NPUARM的设计。重点讨论了其中的整数执行部件的设计,包括ALU、乘法器、桶式移位器、寄存器堆等重要执行部件。NPUARM的设计采用top—down方法,用Verilog HDL语言描述,经过仿真、综合、布局布线后,验证设计完全符合预定的结果。  相似文献   

7.
3:2 counters and 4:2 compressors have been widely used for multiplier implementations. In this paper, a fast 5:3 compressor is derived for high-speed multiplier implementations. The fast 5:3 compression is obtained by applying two rows of fast 2-bit adder cells to five rows in a partial product matrix. As a design example, a 16-bit by 16-bit MAC (Multiply and Accumulate) design is investigated both in a purely logical gate implementation and in a highly customized design. For the partial product reduction, the use of the new 5:3 compression leads to 14.3% speed improvement in terms of XOR gate delay. In a dynamic CMOS circuit implementation using 0.225 m bulk CMOS technology, 11.7% speed improvement is observed with 8.1% less power consumption for the reduction tree.  相似文献   

8.
针对Wallace树连接线复杂度高,版图实现比较困难的缺点,提出了一种新的加法器阵列结构.这种结构在规则性和连接复杂度方面优于ZM树和OS树.同时提出一种新的CLA加法器结构以提高乘法器的性能.乘法器采用1.5μm CMOS工艺实现,完成一次定点与浮点乘法操作的时间分别是56ns和76ns.  相似文献   

9.
In this paper a 2-/spl mu/m CMOS, microprogrammable Signal Processor Core (SPC) is described,intended as the number crunching unit in single-chip general purpose digital signal processors. This core contains a 16 X 16 bit paralleI multiplier, a 40-bit multiprecision accumulator, a 40--32-bit extractor, an overflow detection unit, a format adjuster, and a three-port register file for local storage of 15 operands. Its 100-ns throughput rate makes it highly suitable for signal processing systems with sample rates up to 50 kHz (speech, telecom, and HiFi audio). The architecture of this unit is discussed in detail.The design approach, using full-custom cells, bit-sliced functional blocks, and a complete bottom-up logical verification of mask data, is also discribed. The Signal Processor Core contains 19 200 transistors on a 15.5-mm/sup 2/ area. This compares with a packing density of 1200 transistors/mm/sup 2/.  相似文献   

10.
We describe a micropower 16times16-bit multiplier (18.8 muW/MHz @1.1 V) for low-voltage power-critical low speed (les5 MHz) applications including hearing aids. We achieve the micropower operation by substantially reducing (by ~62% and ~79% compared to conventional 16times16-bit and 32times32-bit designs respectively) the spurious switching in the Adder Block in the multiplier. The approach taken is to use latches to synchronize the inputs to the adders in the Adder Block in a predetermined chronological sequence. The hardware penalty of the latches is small because the latches are integrated (as opposed to external latches) into the adder, termed the latch adder (LA). By means of the LAs and timing, the number of switchings (spurious and that for computation) is reduced from ~5.6 and ~10 per adder in the adder block in conventional 16times16-bit and 32times32-bit designs respectively to ~2 in our designs. Based on simulations and measurements on prototype ICs (0.35 mum three metal dual poly CMOS process), we show that our 16times16-bit design dissipates ~32% less power, is ~20% slower but has ~20% better energy-delay-product (EDP) than conventional 16times16-bit multipliers. Our 32times32-bit design is estimated to dissipate ~53% less power, ~29% slower but is ~39% better EDP than the conventional general multiplier  相似文献   

11.
A 16-bit /spl times/ 16-bit multiplier for 2 two's-complement binary numbers based on a new algorithm is described. This multiplier has been fabricated on an LSI chip using a standard n-E/D MOS process technology with a 2.7-/spl mu/m design rule. This multiplier is characterized by use of a binary tree of redundant binary adders. In the new algorithm, n-bit multiplication is performed in a time proportional to log/SUB 2/ n and the physical design of the multiplier is constructed of a regular cellular array. This new algorithm has been proposed by N. Takagi et al. (1982, 1983). The 16-bit/spl times/16-bit multiplier chip size is 5.8 /spl times/ 6.3 mm/SUP 2/ using the new layout for a binary adder tree. The chip contains about 10600 transistors, and the longest logic path includes 46 gates. The multiplication time was measured as 120 ns. It is estimated that a 32-bit /spl times/ 32-bit multiplication time is about 140 ns.  相似文献   

12.
This paper describes a 32-bit address generation unit designed for 4-GHz operation in 1.2-V 130-nm technology. The AGU utilizes a 152-ps sparse-tree adder core to achieve 20% delay reduction, 80% lower interconnect complexity, and a low (1%) active energy leakage component. The dual-V/sub T/ semidynamic implementation of the adder core provides the performance of a dynamic CMOS design with an average energy profile similar to static CMOS, enabling 71% savings in average energy with a good sub-130-nm scaling trend.  相似文献   

13.
A high speed redundant binary (RB) architecture, which is optimized for the fast CMOS parallel multiplier, is developed. This architecture enables one to convert a pair of partial products in normal binary (NB) form to one RE number with no additional circuit. We improved the RB adder (RBA) circuit so that it can make a fast addition of the RB partial products. We also simplified the converter circuit that converts the final RE number into the corresponding NE number. The carry propagation path of the converter circuit is carried out with only multiplexer circuits. A 54×54-bit multiplier is designed with this architecture. It is fabricated by 0.5 μm CMOS with triple level metal technology. The active area size is 3.0×3.08 mm2 and the number of transistors is 78,800. This is the smallest number for all 54×54-bit multipliers ever reported. Under the condition of 3.3 V supply voltage, the chip achieves 8.8 ns multiplication time. The power dissipation of 540 mW is estimated for the operating frequency of 100 MHz. These are, so far, the fastest speed and the lowest power for 54×54-bit multipliers with 0.5-μm CMOS  相似文献   

14.
In this paper, a design of a 16-bit asynchronous multiplier is presented. The multiplier core consists of small basic blocks. Each block includes handshake and computation logic and communicates with four neighbor cells in asynchronous handshake fashion using four-phase protocol. The computation logic is implemented in dual-rail coded domino logic. The input and output signals of the multiplier are single-rail coded. The single-rail coding allows communication with other single-rail coded asynchronous blocks using four-phase signaling. The design speed is self-adjusting to the technology parameters and supply voltage variations. The multiplier has low latency and achieves a throughput rate of 250 MHz. The multiplier was fabricated in a 0.6-μm CMOS process and has a core size of 4.3×2.1 mm  相似文献   

15.
This paper deals with a new approach to the design of high-performance asynchronous pipelined datapaths. A novel methodology to implement the self-timed stages of a data-path is demonstrated. It is based on the use of both static and dynamic CMOS modules. The former act as overlapped execution circuits and anticipate their computation with respect to the dynamic blocks. An appropriate four-phase protocol able to orchestrate the proposed architecture and a new efficient handshake circuit are described. The above method, applied to a 32-bit addition stage, allows a performance gain to be obtained of up to about 40% and a reduction in power dissipation of about 33%, with a reasonable area overhead compared with conventional designs.  相似文献   

16.
文章通过对32位定点DSP的体系结构及其设计方法的研究,重点阐述了32位定点DSP中CPU包括ALU、MPY、ARAU、流水线、指令系统和总线接口等关键逻辑部件工作原理,对各个逻辑部件的设计思路和实现方法进行了分析描述。采用基于标准单元正向设计方法,设计了一款32位指令集的定点DSP电路,该电路采用哈佛总线结构,可以在单周期内实现16×16位有符号整数乘法、32位累加和32位数据的算术逻辑运算,处理精度高。该电路采用0.5μm 1P3M CMOS工艺流片,集成度7万门,工作频率可达36 MHz,动态功耗594 mW。  相似文献   

17.
A novel high-performance priority encoder design using standard CMOS library cell is proposed. The new encoder design implementation accommodates both high- and low-priority functionalities with scalable design structure through a special prefixing scheme. The prefixing scheme is applied to minimize the entire propagation delay and exploit the shared hardware between the high- and low-priority evaluation logics circuitry. The proposed encoder shows significant improvement in terms of speed, robustness for top-level floor plan routing, and modularity with pattern structure in compared to the existing encoder designs. Simulation results are conducted for different encoder inputs through 0.15-$muhbox m$TSMC CMOS technology, where 32-bit priority encoder is used as a test vehicle for comparison improvement measurements. The expected results show that the 32-bit encoder is operating at a maximum of 667-MHz operating frequency with total count of 1106 transistors and a maximum power consumption of total 13.8 mW.  相似文献   

18.
In this paper, a double-precision carry-save adder (CSA)-based array multiplier is designed using the Dual Mode Logic (DML) approach in a commercial 65-nm low-power CMOS technology. DML typically allows on-the-fly controllable switching at the gate level between static and dynamic operation modes. The proposed multiplier exploits this unique ability of DML to efficiently trade performance and energy consumption when considering on-demand double-precision (8 × 8-bit or 16 × 16-bit) operations. This occurs in the DML multiplier working in a mixed operation mode, i.e., by employing the static and dynamic mode for lower and higher precision operations, respectively. In fact, the use of the dynamic mode for higher precision operations ensures higher performance as compared to the standard CMOS circuit (16% gain on average) at the cost of higher energy consumption. Such energy penalty is counterbalanced at lower precision operations where the static mode is enabled in the DML circuit. Overall, the adoption of the mixed operation mode in the proposed DML multiplier proves to be beneficial to achieve a better performance/energy trade-off with respect to the standard CMOS implementation and to the case when using either the static or the dynamic mode for both operations at the two different precisions. When compared to its CMOS counterpart, our DML design operating in the mixed mode exhibits an average improvement of 15% in terms of energy-delay product (EDP) under wide-range supply voltage scaling. Such benefit is maintained over process-voltage-temperature (PVT) variations.  相似文献   

19.
本文设计了异步LDPC解码器运算通路,利用异步电路减少信号到达时间不一致引起的毛刺和时钟引起的功耗.利用输入数据的统计特性设计了运算通路中的主要运算单元,减少了冗余运算.本文还实现了同步运算通路和基于门控时钟的运算通路作为比较.三种设计采用相近的架构,在0.18μm CMOS工艺下实现相同的功能.仿真结果表明,提出的异步设计功耗最小,相比于同步设计和基于门控时钟设计,分别节省了42.0%和32.6%的功耗.虽然性能稍逊于同步设计,但优于门控时钟设计.其中,同步设计的延时是1.09ns,基于门控时钟的设计延时是1.61ns,而异步设计则是1.20ns.  相似文献   

20.
A multiplier architecture and encoding scheme well suited for programmable digital filtering applications is described. The multiplier's partial product recoding scheme uses only simple multiplexers and takes advantage of a RAM that stores filter coefficients. We use an optimized 20-transistor full-adder cell in the carry-save adder array, and a carry-select vector-merge adder produces the final output. An integrated circuit comprising an ll-b by ll-b multiplier using second-order recoding has been fabricated in 2-μm CMOS technology. It operates in 22 ns and its core occupies 1.53 mm2. Also, an ll-b by 16-b multiplier using third-order recoding has been fabricated through MOSIS in 1.2-μm CMOS technology. Its core occupies 0.9 mm2 and it operates in 19 ns  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号