期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Fast multiplier bit-product matrix reduction using bit-ordering and parity generation

Ben C. Drerup Earl E. Swartzlander Jr. 《The Journal of VLSI Signal Processing》1994,7(3):249-257

The “Wallace Tree/Dadda Fast Multiplier” consists of the following three steps: 1) form a bit-product matrix; 2) reduce the bit-product matrix to two rows; and 3) sum the two rows. This article describes a novel approach to implementing the second step. The new second step is accomplished with sorting and parity generation logic. This is very different from the Wallace/Dadda method, which uses full and half adders to reduce the bit-product matrix. This approach yields a multiplier that is faster than a Wallace/Dadda multiplier when multiplying small numbers. However, this method also requires more gates to implement. 相似文献

2.

General algorithms for a simplified addition of 2's complementnumbers

Salomon O. Green J.-M. Klar H. 《Solid-State Circuits, IEEE Journal of》1995,30(7):839-844

Two algorithms for both a simplified carry save and carry ripple addition of 2's complement numbers are presented. The algorithms form the partial products so that they exclusively have positive coefficients which eliminates the need for the common sign bit extension. This results in a reduction of circuit area by up to six full adders per row of adders when partial products are added in an N/2 or Wallace tree. Furthermore, the capacitive load of the intermediate sum and carry sign bit signals decreases by up to a factor of seven which leads to an appropriate reduction of delay. Although the algorithms are derived for multipliers they can always be applied to appropriate adder circuits 相似文献

3.

Optimization Techniques for FPGA-Based Wave-Pipelined DSP Blocks

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(7):783-793

In this paper, techniques for efficient implementation of field-programmable gate-array (FPGA)-based wave-pipelined (WP) multipliers, accumulators, and filters are presented. A comparison of the performance of WP and pipelined systems has been made. Major contributions of this paper are development of an on-chip clock generation scheme which permits finer tuning of the frequency, a synthesis technique that reduces the area and latency by 25%, a placement utility that results in 10%–40% increase in speed and proposal of an interleaving scheme for filters that reduces the number of multipliers required by 50%. WP multipliers of size 2$times$6 and the filters using them are found to be 11% faster and require lower power than those using pipelined multipliers. Filters with higher order WP multipliers also operate with lower power at the cost of speed. The delay-register products of such filters are found to be about 60% lower than those using the pipelined multipliers. The paper also outlines applications of these techniques for the Spartan II FPGAs and a self-tuning scheme for optimizing the speed. 相似文献

4.

Efficient Recursive Digital Filters using Combined Look-Ahead Denominator Distribution and Numerator Decomposition

J. Living M. Moniri S.B. Tennakoon 《The Journal of VLSI Signal Processing》2001,27(3):269-295

This paper presents a new efficient method for designing stable look-ahead pipelined recursive digital filters with reduced multipliers. The multiplier savings are obtained by generating pipelined transfer functions which combine numerator decomposition with look-ahead denominator distribution. This is achieved by not restricting the denominator to either the clustered or scattered forms while also preserving term count from the unpipelined filter transfer function. The coefficients of the pipelined transfer function are obtained by running product solved using matrices and an algorithm with two stages: pre and post distribution, each having a multiplier cost which are minimised independently. The proposed method can produce pipelined filter designs requiring fewer multipliers when compared with previously reported methods. For example, for a range of second order transfer functions and pipelining levels, an average 40% reduction in multipliers can be achieved while an 18% reduction in multipliers necessary for pipelining is obtained for a sixth order filter. Furthermore, the proposed two-stage algorithm can accommodate pipelined adders as well as pipelined multipliers in the recursive filter structure, avoiding delay penalties otherwise suffered by previously reported methods. A detailed analysis has been carried out confirming that filters designed using the proposed method do not suffer increased noise. 相似文献

5.

High-performance FIR filter design based on sharing multiplication

Jongsun Park Muhammad K. Roy K. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(2):244-253

Finite impulse response (FIR) filtering can be expressed as multiplications of vectors by scalars. We present high-speed designs for FIR filters based on a computation sharing multiplier which specifically targets computation re-use in vector-scalar products. The performance of the proposed implementation is compared with implementations based on carry-save and Wallace tree multipliers in 0.35-/spl mu/m technology. We show that sharing multiplier scheme improves speed by approximately 52 and 33% with respect to the FIR filter implementations based on the carry-save multiplier and Wallace tree multiplier, respectively. In addition, sharing multiplier scheme has a relatively small power delay product than other multiplier schemes. Using voltage scaling, power consumption of the FIR filter based on computation sharing multiplier can be reduced to 41% of the FIR filter based on the Wallace tree multiplier for the same frequency of operation. 相似文献

6.

Modified Booth Multipliers With a Regular Partial Product Array

《Circuits and Systems II: Express Briefs, IEEE Transactions on》2009,56(5):404-408

The conventional modified Booth encoding (MBE) generates an irregular partial product array because of the extra partial product bit at the least significant bit position of each partial product row. In this brief, a simple approach is proposed to generate a regular partial product array with fewer partial product rows and negligible overhead, thereby lowering the complexity of partial product reduction and reducing the area, delay, and power of MBE multipliers. The proposed approach can also be utilized to regularize the partial product array of posttruncated MBE multipliers. Implementation results demonstrate that the proposed MBE multipliers with a regular partial product array really achieve significant improvement in area, delay, and power consumption when compared with conventional MBE multipliers. 相似文献

7.

Digit-Serial Complex-Number Multipliers on FPGAs

T. Sansaloni J. Valls K.K. Parhi 《The Journal of VLSI Signal Processing》2003,33(1-2):105-115

This paper presents an optimized implementation on FPGA of digit-serial Complex-Number Multipliers (CMs) using Booth recoding techniques and tree adders based on Carry Save (CS) and Ripple Carry Adders (RCA). This kind of Complex-Number multipliers can be pipelined at the same level independent of the digit-size. Variable and fixed coefficient CMs have been considered. In the first case an efficient mapping of the modified Booth recoding and the partial product generation is presented which results in a logic depth reduction. The combination of 5:3 and 4:3 converters in the CS structure and the utilization of RCA trees lead to a minimum area requirement. In the case of fixed coefficient CMs, partial products generator is based on look-up tables and multi-bit Booth recoding is used to reduce the area and increase the performance of the circuit. The study reveals that efficient mapping of the 5-bit Booth recoding to generate the partial products is the optimum multibit recoding when Xilinx FPGA devices are used. 相似文献

8.

Split-channel pipelined packet scheduling for wireless networks

Xue Yang Vaidya N.H. Ravichandran P. 《Mobile Computing, IEEE Transactions on》2006,5(3):240-257

To reduce medium access control (MAC) overhead and improve channel utilization, there has been extensive research on dynamically adjusting the channel access behavior of a contending station based on channel feedback information. This paper explores an alternative approach, named pipelined packet scheduling, to reduce the MAC overhead. MAC overheads can be divided into bandwidth-dependent and bandwidth-independent components and these overheads can both be reduced by using split-channel pipelining mechanisms, as demonstrated in this paper. In the past, pipelining mechanisms have not been well studied. This paper introduces two total pipelining schemes that attempt to fully pipeline contention resolution with data transmission. Further, the paper identifies shortcomings of total pipelining in the wireless environment and proposes a partial pipelining approach to overcome these shortcomings. Simulation results show that substantial performance improvement in channel utilization, average packet access delay, and access energy cost can be achieved with a properly designed scheme. 相似文献

9.

一种基于Dadda树的乘法器设计

李路路何春宗竹林章凌宇《微电子学与计算机》2011,28(5)

在基带信号处理芯片中,面积和速度是两个关键的指标.文中在改进的booth算法基础上,采用了Dadda树压缩算法,通过对压缩器基本单元的改进,同时对符号位和尾部零填充进行优化设计;不仅保持了Wallace树结构的并行计算优势,而且面积上也得到了很大的改善;同时相对干Wallace树结构的规则结构也更利于版图设计.压缩结果采用了多层CLA块技术,使得乘法器的速度得到进一步的提高.在0.13μm的SMIC八层金属CMOS工艺下,DC(Design Compiler)综合结果表明,芯片面积为20633.59μm2,最大延迟仅为3.00ns. 相似文献

10.

Minimization of switching activities of partial products for designing low-power multipliers

Chen O.T.-C. Sandy Wang Yi-Wen Wu 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(3):418-433

This work presents low-power 2's complement multipliers by minimizing the switching activities of partial products using the radix-4 Booth algorithm. Before computation for two input data, the one with a smaller effective dynamic range is processed to generate Booth codes, thereby increasing the probability that the partial products become zero. By employing the dynamic-range determination unit to control input data paths, the multiplier with a column-based adder tree of compressors or counters is designed. To further reduce power consumption, the two multipliers based on row-based and hybrid-based adder trees are realized with operations on effective dynamic ranges of input data. Functional blocks of these two multipliers can preserve their previous input states for noneffective dynamic data ranges and thus, reduce the number of their switching operations. To illustrate the proposed multipliers exhibiting low-power dissipation, the theoretical analyzes of switching activities of partial products are derived. The proposed 16 /spl times/ 16-bit multiplier with the column-based adder tree conserves more than 31.2%, 19.1%, and 33.0% of power consumed by the conventional multiplier, in applications of the ADPCM audio, G.723.1 speech, and wavelet-based image coders, respectively. Furthermore, the proposed multipliers with row-based, hybrid-based adder trees reduce power consumption by over 35.3%, 25.3% and 39.6%, and 33.4%, 24.9% and 36.9%, respectively. When considering product factors of hardware areas, critical delays and power consumption, the proposed multipliers can outperform the conventional multipliers. Consequently, the multipliers proposed herein can be broadly used in various media processing to yield low-power consumption at limited hardware cost or little slowing of speed. 相似文献

11.

Integer Linear Programming-Based Bit-Level Optimization for High-Speed FIR Decimation Filter Architectures

Anton Blad Oscar Gustafsson 《Circuits, Systems, and Signal Processing》2010,29(1):81-101

Analog-to-digital converters based on sigma-delta modulation have shown promising performance, with steadily increasing bandwidth. However, associated with the increasing bandwidth is an increasing modulator sampling rate, which becomes costly to decimate in the digital domain. Several architectures exist for the digital decimation filter, and among the more common and efficient are polyphase decomposed finite-length impulse response (FIR) filter structures. In this paper, we consider such filters implemented with partial product generation for the multiplications, and carry-save adders to merge the partial products. The focus is on the efficient pipelined reduction of the partial products, which is done using a bit-level optimization algorithm for the tree design. However, the method is not limited only to filter design, but may also be used in other applications where high-speed reduction of partial products is required. The presentation of the reduction method is carried out through a comparison between the main architectural choices for FIR filters: the direct-form and transposed direct-form structures. For the direct-form structure, usage of symmetry adders for linear-phase filters is investigated, and a new scheme utilizing partial symmetry adders is introduced. The optimization results are complemented with energy dissipation and cell area estimations for a 90 nm CMOS process. 相似文献

12.

A 600-MHz 54×54-bit multiplier with rectangular-styledWallace tree

Itoh N. Naemura Y. Makino H. Nakase Y. Yoshihara T. Horiba Y. 《Solid-State Circuits, IEEE Journal of》2001,36(2):249-257

This paper presents an efficient layout method for a high-speed multiplier. The Wallace-tree method is generally used for high-speed multipliers. In the conventional Wallace tree, however, every partial product is added in a single direction from top to bottom. Therefore, the number of adders increases as the adding stage moves forward. As a result, it generates a dead area when the multiplier is laid out in a rectangle. To solve this problem, we propose a rectangular Wallace-tree construction method. In our method, the partial products are divided into two groups and added in the opposite direction. The partial products in the first group are added downward, and the partial products in the second group are added upward. Using this method, we eliminate the dead area. Also, we optimized the carry propagation between the two groups to realize high speed and a simple layout, We applied it to a 54×54-bit multiplier. The 980 μm×1000 μm area size and the 600 MHz clock speed have been achieved using 0.18 μm CMOS technology 相似文献

13.

Design of power efficient butterflies from Radix-2 DIT FFT using adder compressors with a new XOR gate topology

Mateus Beck Fonseca Eduardo A. César da Costa João B. S. Martins 《Analog Integrated Circuits and Signal Processing》2012,73(3):945-954

This paper addresses the design of power efficient dedicated structures of Radix-2 Decimation in Time (DIT) pipelined butterflies, aiming the implementation of low power Fast Fourier Transform (FFT), using adder compressors, with a new XOR gate topology. In the FFT computation, the butterflies play a central role, since they allow calculation of complex terms. In this calculation, involving multiplications of input data with appropriate coefficients, the optimization of the butterfly can contribute for the reduction of power consumption of FFT architectures. In this paper, different and dedicated structures for the 16 bit-width pipelined Radix-2 DIT butterfly, running at 100 MHz, are implemented, where the main goal is to minimize both the number of real multipliers and the critical path of the structures. This is done by changing the structure of the complex multipliers and applying them into the butterflies. For logic synthesis of the implemented butterflies it was used Cadence Encounter RTL Compiler tool with XFAB MOSLP 0.18 μm library. Area and power consumption results are presented for the synthesized butterflies. Regarding power consumption, switching activity analysis is performed using 10,000 inputs vectors at inputs of the butterflies. The main results show that when combining the use of pipeline approach and the use of efficient adder compressors, with a new XOR gate topology, the power consumption of the butterflies is significantly reduced. 相似文献

14.

Built-in Test with Modified-Booth High-Speed Pipelined Multipliers and Dividers

Hao-Yung Lo Hsiu-Feng Lin Chichyang Chen Jenshiuh Liu Chia-Cheng Liu 《Journal of Electronic Testing》2003,19(3):245-269

An embedded test pattern generator scheme for large-operand (unlimited bit length) multiplier and divider is presented by employing a simple digital circuit. This scheme is based on the generation of cyclic code polynomials from a characterized polynomials generator G(X) and incorporated with Modified-Booth algorithm. Due to the advantages of the former, the hardware complexity is simple, and moreover, the multiplier and divider can share the same hardware with a small change of control lines. Due to the advantages of latter's schemes, the numbers of sub/add operations are reduced to one half of the multiplicand for the result of final product. Therefore, the proposed pipelined multipliers permit very high throughput for arbitrary value of digit size. Only full adders/subtractors and shift registers are used in the proposed multiplier and divider hardware. The input data of the multiplier/divider can be processed in parallel or in pipelined without considering carry/borrow delays during the operations. The speed of computation has therefore been greatly improved by approximately a factor of 2. Since most parts of the components can be used for both the multiplier and divider, with full adders replaced by subtractors for switching from a multiplier to a divider, the structure is therefore tremendously reduced. In addition, these function units are involved with cyclic code generators, so that they can be used as a built-in self-test (BIST). 相似文献

15.

A pipelined architecture for the multidimensional DFT 总被引：1，自引：0，他引：1

Sungwook Yu Swartzlander E.E. Jr. 《Signal Processing, IEEE Transactions on》2001,49(9):2096-2102

This paper presents an efficient pipelined architecture for the N ^m-point m-dimensional discrete Fourier transform (DFT). By using a two-level index mapping scheme that is different from the conventional decimation-in-time (DIT) or decimation-infrequency (DIF) algorithms, the conventional pipelined architecture for the one-dimensional (1-D) fast Fourier transform (FFT) can be efficiently used for the computation of higher dimensional DFTs. Compared with systolic architectures, the proposed scheme is area-efficient since the computational elements (CEs) use the minimum number of multipliers, and the number of CEs increases only linearly with respect to the dimension m. It can be easily extended to the N^m-point m-dimensional DFT with large m and/or N, and it is more flexible since the throughput can be easily varied to accommodate various area/throughput requirements 相似文献

16.

Reconfigurable parallel inner product processor architectures

Rong Lin 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2001,9(2):261-272

This paper presents a novel approach for low-power high-performance inner product processor design. The processor is dynamically reconfigurable for computing inner products of input arrays with four or more combinations of array dimensions and precision. The processor mainly consists of an array of 8×8 or 4×4 small multipliers plus two or three arrays of adders. It requires very simple reconfigurable components. The whole network may be reconfigured by using a few control bits for the desired computations, and the reconfiguration can be done dynamically. The design is regular, modular, and can easily be pipelined, and most parts of the network are symmetric and repeatable. A set of low-power high-performance parallel counters is also proposed for the implementation of the design, which could lead to a significant reduction in worst case power dissipation compared with traditional binary-logic based architectures, while showing superiority in speed, VLSI area, and layout simplicity 相似文献

17.

Ultra Low-Power Clocking Scheme Using Energy Recovery and Clock Gating 总被引：1，自引：0，他引：1

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(1):33-44

A significant fraction of the total power in highly synchronous systems is dissipated over clock networks. Hence, low-power clocking schemes are promising approaches for low-power design. We propose four novel energy recovery clocked flip-flops that enable energy recovery from the clock network, resulting in significant energy savings. The proposed flip-flops operate with a single-phase sinusoidal clock, which can be generated with high efficiency. In the TSMC 0.25-$mu$m CMOS technology, we implemented 1024 proposed energy recovery clocked flip-flops through an H-tree clock network driven by a resonant clock-generator to generate a sinusoidal clock. Simulation results show a power reduction of 90% on the clock-tree and total power savings of up to 83% as compared to the same implementation using the conventional square-wave clocking scheme and flip-flops. Using a sinusoidal clock signal for energy recovery prevents application of existing clock gating solutions. In this paper, we also propose clock gating solutions for energy recovery clocking. Applying our clock gating to the energy recovery clocked flip-flops reduces their power by more than 1000 $times$ in the idle mode with negligible power and delay overhead in the active mode. Finally, a test chip containing two pipelined multipliers one designed with conventional square wave clocked flip-flops and the other one with the proposed energy recovery clocked flip-flops is fabricated and measured. Based on measurement results, the energy recovery clocking scheme and flip-flops show a power reduction of 71% on the clock-tree and 39% on flip-flops, resulting in an overall power savings of 25% for the multiplier chip. 相似文献

18.

FPGA中专用可重构乘法器的设计

余洪敏陈陵都刘忠立《半导体学报》2008,29(11)

提出了一种新的嵌入在FPGA中可重构的流水线乘法器设计.该设计采用了改进的波茨编码算法,可以实现18×18有符号乘法或17×17无符号乘法.还提出了一种新的电路优化方法来减少部分积的数目,并且提出了一种新的乘法器版图布局,以便适应tilebased FPGA芯片设计所加的约束.该乘法器可以配置成同步或异步模式,也町以配置成带流水线的模式以满足高频操作.该设计很容易扩展成不同的输入和输出位宽.同时提出了一种新的超前进位加法器电路来产生最后的结果.采用了传输门逻辑来实现整个乘法器.乘法器采用了中芯国际0.13μm CMOS工艺来实现,完成18×18的乘法操作需要4.1ns.全部使用2级的流水线时,时钟周期可以达到2.5ns.这比商用乘法器快29.1%,比其他乘法器快17.5%.与传统的基于查找表的乘法器相比,该乘法器的面积为传统乘法器面积的1/32. 相似文献

19.

一种RISC微处理器的快速乘除法运算设计与实现

下载免费PDF全文

王江黄秀荪陈刚杨旭光仇玉林《电子器件》2007,30(1):162-166

定点尾数乘除法器是相应32位浮点运算的核心部件,针对工控应用,本文采用半定制方法完成了设计并且采用TSMC0.18微米工艺实现.乘法器采用基4Booth编码,通过对符号位、隐含位的处理减少了部分积的生成,并在Wallace树求和过程中,引入4∶2压缩器,加快了求和速度.除法器采用改进的SRT算法,引入商位猜测、部分余并行计算、商位修正值选择电路.乘除法器均采用了进位保留加法器提高运算速度.后端物理实现表明,乘除法器的频率分别可到227 MHz,305 MHz,整体设计具有简洁、快速、计算准确的特征. 相似文献

20.

A sub-10-ns 16/spl times/16 multiplier using 0.6-/spl mu/m CMOS technology

《Solid-State Circuits, IEEE Journal of》1987,22(5):762-767

A 16/spl times/16-b parallel multiplier fabricated in a 0.6-/spl mu/m CMOS technology is described. The chip uses a modified array scheme incorporated with a Booth's algorithm to reduce the number of adding stages of partial products. The combination of scaled 0.6-/spl mu/m CMOS technology and advanced arithmetic architecture achieves a multiplication time of 7.4 ns while dissipating only 400 mW. This multiplication time is shorter than other MOS high-speed multipliers previously reported and is comparable to those for advanced bipolar and GaAs multipliers. 相似文献