期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

On high-performance parallel decimal fixed-point multiplier designs

《Computers & Electrical Engineering》2014,40(7):2126-2138

High-performance, area and power efficient hardware implementation of decimal multiplication is preferred to slow software simulations in various key scientific and financial applications, where errors caused by converting decimal numbers into their approximate binary representations are unacceptable. This paper presents a parallel architecture for fixed-point 8421-BCD-based decimal multiplication. In essence, it applies a hybrid 8421–5421 recoding scheme to generate partial products, and accumulates them with 8421 carry-lookahead adders organized as a tree structure. In addition, we propose a 4221-BCD-based decimal multiplier that is built upon a novel 4221-BCD full adder; operands of this 4221 multiplier are directly represented in the 4221 BCD. The proposed 16 × 16 decimal multipliers are compared with other best-known decimal multiplier designs with a TSMC 90-nm technology, and the evaluation results show that the proposed 8421–5421 multiplier achieves the lowest delay and area, as well as the highest power efficiency, among all the existing hardware-based BCD multipliers. 相似文献

2.

Hybrid weighted bit flipping low density parity check decoding

《Digital Signal Processing》2014

Low density parity check codes (LDPC) exhibit near capacity performance in terms of error correction. Large hardware costs, limited flexibility in terms of code length/code rate and considerable power consumption limit the use of belief-propagation algorithm based LDPC decoders in area and energy sensitive mobile environment. Serial bit flipping algorithms offer a trade-off between resource utilization and error correction performance at the expense of increased number of decoding iterations required for convergence. Parallel weighted bit flipping decoding and its variants aim at reducing the decoding iteration and time by flipping the potential erroneous bits in parallel. However, in most of the existing parallel decoding methods, the flipping threshold requires complex computations.In this paper, Hybrid Weighted Bit Flipping (HWBF) decoding is proposed to allow multiple bit flipping in each decoding iteration. To compute the number of bits that can be flipped in parallel, a criterion for determining the relationship between the erroneous bits in received code word is proposed. Using the proposed relation the proposed scheme can detect and correct a maximum of 3 erreneous hard decision bits in an iteration. The simulation results show that as compared to existing serial bit flipping decoding methods, the number of iterations required for convergence is reduced by 45% and the decoding time is reduced by 40%, by the use of proposed HWBF decoding. As compared to existing parallel bit flipping decoding methods, the proposed HWBF decoding can achieve similar bit error rate (BER) with same number of iterations and lesser computational complexity. Due to reduced number of decoding iterations, less computational complexity and reduced decoding time, the proposed HWBF decoding can be useful in energy sensitive mobile platforms. 相似文献

3.

基于FPGA的进位存储大数乘法器的改进与实现

张晓楠高献伟董秀则《计算机工程与应用》2017,53(21):58-61

提出了一种基于FPGA的进位存储的大数乘法器的改进算法,该算法采用串并混合结构可以在一个时钟内完成多次迭代计算,减少了完成一次运算的时钟数,因此有效地提高了大数乘法器的速度。最后硬件结构设计在Altera Stratix II EP2S90F1508C3上实现,给出了192位、256位以及384位的乘法器性能分析,其中,192位可达到0.18?μs,256位达到0.27?μs,384位达到0.59?μs,速度上都提高了3.5倍左右。相似文献

4.

基为4的可扩展模乘运算器设计

麻永新曾晓洋顾叶华孙承绶《计算机工程与应用》2006,42(12):110-113

基于基为4的Montgomery模乘算法和改进的流水线组织结构,文章提出了一种结构优化的可扩展模乘运算器结构。设计中采用了按字运算的模乘算法,使本设计具有很好的可扩展性,它可以完成任意位数的模乘运算。同时,因为模乘运算器的运算数据通路采用多级处理单元的流水线结构,所以设计时可以很方便进行配置,以达到模乘运算器硬件成本和运算性能的折衷。分析结果显示,文章提出的模乘运算器结构具有很高的效率和很好的可扩展性。相似文献

5.

Partition Algorithm For Parallel Processing Of Array Multiplication In Gf(2m) Fields

《国际计算机数学杂志》2012,89(7):805-809

The multiplication operations in GF(2m) fields are widely used in cryptosystems. However, the multiplication operations for public-key cryptosystems require very large operands with 512 bits or more, and then existing multipliers are not available for such multiplications. In this paper, we will present a partition algorithm to divide large operands into small operands such as 32 bits or 64 bits, and then existing multipliers can be employed. We also present a parallel version of the partition algorithm by employing an important natural property of the multiplication operations in GF(2m) fields. 相似文献

6.

An Optimized Deep-Learning-Based Low Power Approximate Multiplier Design

M. Usharani B. Sakthivel S. Gayathri Priya T. Nagalakshmi J. Shirisha 《计算机系统科学与工程》2023,44(2):1647-1657

Approximate computing is a popular field for low power consumption that is used in several applications like image processing, video processing, multimedia and data mining. This Approximate computing is majorly performed with an arithmetic circuit particular with a multiplier. The multiplier is the most essential element used for approximate computing where the power consumption is majorly based on its performance. There are several researchers are worked on the approximate multiplier for power reduction for a few decades, but the design of low power approximate multiplier is not so easy. This seems a bigger challenge for digital industries to design an approximate multiplier with low power and minimum error rate with higher accuracy. To overcome these issues, the digital circuits are applied to the Deep Learning (DL) approaches for higher accuracy. In recent times, DL is the method that is used for higher learning and prediction accuracy in several fields. Therefore, the Long Short-Term Memory (LSTM) is a popular time series DL method is used in this work for approximate computing. To provide an optimal solution, the LSTM is combined with a meta-heuristics Jellyfish search optimisation technique to design an input aware deep learning-based approximate multiplier (DLAM). In this work, the jelly optimised LSTM model is used to enhance the error metrics performance of the Approximate multiplier. The optimal hyperparameters of the LSTM model are identified by jelly search optimisation. This fine-tuning is used to obtain an optimal solution to perform an LSTM with higher accuracy. The proposed pre-trained LSTM model is used to generate approximate design libraries for the different truncation levels as a function of area, delay, power and error metrics. The experimental results on an 8-bit multiplier with an image processing application shows that the proposed approximate computing multiplier achieved a superior area and power reduction with very good results on error rates. 相似文献

7.

A novel power efficient 0.64-GFlops fused 32-bit reversible floating point arithmetic unit architecture for digital signal processing applications

《Microprocessors and Microsystems》2017

Floating point digital signal processing technology has become the primary method for real time signal processing in most digital systems presently. However, the challenges in the implementation of floating point arithmetic on FPGA are that, the hardware modules are larger, have longer latency and high power consumption. In this work, a novel efficient reversible floating point fused arithmetic unit architecture is proposed confirming to IEEE 754 standard. By utilizing reversible logic circuits and implementation with adiabatic logic, power efficiency is achieved. The hardware complexity is reduced by employing fused elements and latency is improved by decomposing the operands in the realization of floating point multiplier and square root. To validate the design, the proposed unit was used for realization of FFT and FIR filter which are important applications of a DSP processor. As detection is one of the core baseband processing operations in digital communication receivers and the detection speed determines the data rates that can be achieved, the proposed unit has been used to implement the detection function. Simulation results and comparative studies with existing works demonstrate that the proposed unit efficiently utilizes the number of gates, has reduced quantum cost and produced less garbage outputs with low latency, thereby making the design a computational and power efficient one. 相似文献

8.

Power and delay efficient fir filter design using ESSA and VL-CSKA based booth multiplier

《Microprocessors and Microsystems》2021

FIR filter plays a major role in digital image processing applications. The power and delay performance of any FIR filter depends on the switching activities between the filter coefficients (FCs) and its basic arithmetic operations (i.e., multiplication and addition) performed in the convolution equations. In this paper, a new FIR filter is designed using Enhanced Squirrel Search Algorithm (ESSA) and Variable latency Carry skip adder (VL-CSKA) based booth multiplier. The proposed ESSA algorithm selects an optimal FC by minimizing the switching activities of FC based on the ripple contents, power and Transition width parameter to meet the required specifications of FIR filter in the frequency domain. Also, the VL-CSKA based booth multiplier is proposed to reduce the delay of FIR filter with parallel addition of partial products (PPs). In this design, the VL-CSKA adders utilize variable size and compound gate-based skip logic to deduce the delay with low power. The proposed FIR filter is simulated in Xilinx working platform by developing Verilog coding. The simulation result shows that the proposed FIR filter outperforms the state-of-the-art FIR filters by consuming only 0.142 mW power with delay of 28.175 ns. 相似文献

9.

可伸缩双域Montgomery乘法器的优化设计与实现

秦帆戴紫彬《电子技术应用》2009,35(6)

模乘运算是公钥密码算法中的关键运算,本文基于全字运算的Montgomery模乘算法,设计了具有可伸缩硬件结构的模乘器。该模乘器可以基于固定的数据路径宽度对任意长度的数据进行运算,并且能够支持两个有限域上的运算。最后用Verilog硬件描述语言对该乘法器的硬件结构进行代码设计,并用Synopsys公司的Design Complier在Artisan SIMC 0.18μm typical工艺库下综合。实验结果表明,相对于其他模乘器设计,本文设计具有较高的时钟频率,并且由于大大减少了运算所需的时钟周期数,模乘运算速度较快。相似文献

10.

Design and analysis of high-speed 8-bit ALU using 18 nm FinFET technology

Shylashree N. Venkatesh B. Saurab T. M. Srinivasan Tarun Nath Vijay 《Microsystem Technologies》2019,25(6):2349-2359

All modern computational devices consist of ALU. With increase in complexity of software and the consistent shift of software towards parallelism, high speed processors with hardware support for time consuming operations such as multiplication would benefit. Smaller, compact devices such as IoT devices need to run software such as security software and be able to offload computation cost from the cloud. In this paper, a high speed 8-bit ALU using 18 nm FinFET technology is proposed. The arithmetic and logical unit consists of fast compute units such as Kogge Stone fast adder and Dadda multiplier along with basic logic gates. In this paper, an ALU with each compute unit optimized for speed is proposed, while responsibly consuming area. Dadda multiplier is of 8 × 8 architecture as opposed to conventional approach of 4 × 4 making it a true 8-bit ALU. Simulation and analysis is done using Cadence Virtuoso in Analog Design Environment. The transistor count of proposed design is 5298, the power consumption is 219 µW and maximum delay is 166.8 ps. The design is also expected to consume a maximum of one clock cycle for any computation.

相似文献

11.

GF（2m）上的一种可并行快速乘法器结构

下载免费PDF全文

马自堂段斌刘云飞《计算机工程与应用》2009,45(35):59-61

在可重构的高位优先串行乘法器基础上,提出了一种GF（2^m）上可控制的快速乘法器结构。该乘法器增加了1个控制信号和7个两路选择器,在域宽小于最大域宽的一半时能利用现有硬件资源并行计算两个乘法。该乘法器结构电路复杂度低,能利用现有存储空间并行计算,并能扩展应用于串并混合结构中。这种乘法器适合存储空间小、低硬件复杂度的可重构密码系统VLSI设计。相似文献

12.

容错数字信号处理系统的设计方法及实现 总被引：1，自引：1，他引：0

马建峰王新梅《计算机学报》1997,20(1):82-86

高物可靠性的信号处理对于并行处理器的高速实时应用越来越重要。已知在状态空间模型下线性数字信号处理算法可以表示秋为阵与矢量乘积形式，在该模型下，我们假设同一种运算在主计算过程中具有相同的错误概率，即计算过程是珠，并利用矩阵划分的方法提出一种具有纠错能力的容错设计方案。相似文献

13.

FT-SIMD:一种高性能乘法器的设计

李国强陈书明万江华杨惠《计算机工程与科学》2012,34(1):53-57

为了提高多媒体数据的处理能力,高性能DSP普遍引入了SIMD技术。作为DSP重要组成部分的乘法器也必须具备这一功能。本文对SIMD乘法器的实现进行深入研究,提出了一种新的SIMD乘法器体系结构,采用两个16×8乘法器,通过对其操作数和结果进行符号扩展和拼接等处理,简单而高效地实现了16位FT-SIMD乘法器。同时,本体系结构可以扩展为32位和64位的SIMD乘法器。相似文献

14.

基于STM32F4的时栅位移传感器信号处理系统集成化设计

杨继森许强冯济琴《传感器与微系统》2013,(12):113-116

设计了一种基于单片STM32F4芯片的时栅位移传感器信号处理系统,将驱动电源、信号采样以及数据处理与误差补偿集成在一片芯片中完成,采用数字频率直接合成（DDS）技术进行激励源的设计,利用输入捕获方式进行高频时钟脉冲插补来采集测量信号,由芯片集成的单周期DSP指令部件完成数据计算,并采用傅氏级数谐波修正技术来进行误差修正。实验表明：采用该系统后,72对极时栅误差峰峰值为3．29”,在保证精度的同时实现了时栅信号处理系统的集成化、小型化,降低了生产成本。相似文献

15.

基于FPGA的Montgomery模乘器的高效实现*

高献伟张晓楠董秀则《计算机应用研究》2017,34(11)

为了提高椭圆曲线密码处理器的模乘速度,本文提出了一种更有效且更适合硬件实现的Montgomery算法。改进的算法分析了基于CSA加法器的Montgomery模乘算法,提出了多步CSA加法器的Montgomery算法,该算法能够在一个时钟内做多次CSA迭代运算,可以有效地降低时钟个数,进而提高模乘速度。通过Modelsim仿真工具仿真,正确完成一次256bits Montgomery模乘运算只需要16个时钟周期。在Altera EP3SL200F1517C2 FPGA中的运行结果表明：71.5MHz的时钟频率下,完成一次256位的模乘运算仅需要0.22微秒。相似文献

16.

一种支持优化分块策略的矩阵乘加速器设计

沈俊忠肖涛乔寓然杨乾明文梅《计算机工程与科学》2016,38(9):1748-1754

在许多应用领域中,大规模浮点矩阵乘法往往是最耗时的计算核心之一。在新兴的应用中经常存在至少有一个维度很小的大规模矩阵,我们把具备这种特性的矩阵称为非均匀矩阵。由于FPGA上用以存储中间结果的片上存储器容量十分有限,计算大规模矩阵乘法时往往需要将矩阵划分成细粒度的子块计算任务。当加速非均匀矩阵乘法时,由于只支持固定分块大小,大多数现有的线性阵列结构的硬件矩阵乘法器将遭受很大的性能下降。为了解决这个问题,提出了一种有效的优化分块策略。在此基础上,在Xilinx公司的Zynq XC7Z045FPGA芯片上实现了一个支持可变分块的矩阵乘法器。通过集成224个处理单元,该矩阵乘法器在150 MHz的时钟频率下对于实际应用中的非均匀矩乘达到了48GFLOPS的实测性能,而所需带宽仅为4.8GB/s。实验结果表明,我们提出的分块策略相比于传统的分块算法实现了高达12%的性能提升。相似文献

17.

基于云计算的编码器信号误差补偿系统设计

张金波胡俊军李雨倩《自动化与仪器仪表》2020,(1):118-121

传统的编码器信号误差补偿系统存在着补偿精度低的缺陷,为此提出基于云计算的编码器信号误差补偿系统。编码器信号误差补偿系统硬件设计包括编码器模拟控制单元、电源单元、信号采集单元与通信单元,软件设计包括通信模块、信号处理模块与信号误差补偿模块,通过编码器信号误差补偿系统硬件与软件的设计实现了编码器信号误差补偿系统的运行。通过实验得到,设计的编码器信号误差补偿系统补偿精度比传统系统高出30%,充分说明设计的编码器信号误差补偿系统具备极高的有效性。相似文献

18.

Integer squarers with overflow detection 总被引：1，自引：0，他引：1

Mustafa Gök Author Vitae 《Computers & Electrical Engineering》2008,34(5):378-391

Squaring is commonly used in digital signal processing applications. Significant performance increase can be achieved by supporting squaring in hardware. This paper presents overflow detection methods applicable to integer squarers with unsigned and two’s complement operands. These methods are unified for a combined squarer design. Presented methods can be applied to any squarer independent of size and architecture. The proposed squarer designs have approximately 50% less area and delay compared to the conventional squarer designs with overflow detection. 相似文献

19.

基于查找表和SF CORDIC的高精度正余弦函数求值方法 总被引：1，自引：0，他引：1

牟胜梅李兆刚《计算机与数字工程》2014,(3):359-363

常用查找表法和CORDIC算法在FPGA上实现正余弦函数求值.查找表法实现简单,输出延迟小,但随着计算精度的提高,存储资源需求呈指数增长;传统的CORDIC方法硬件资源消耗大,且输出时延长.论文提出一种新方法,将查找表和SF-CORDIC算法相结合,以查表所得中间向量为迭代初始向量,对剩余旋转角应用SF-CORDIC算法,迭代系数取0或1,减少了x、y通路的计算开销和舍入误差;并对z通路使用加减交替法提前生成剩余旋转角,以减少每级流水线的延迟.所需查找表的地址位数和迭代次数分别较常规查表法和CORDIC算法减少一半左右.基于FPGA完成了算法的设计、仿真与误差分析,结果表明该方法可利用较少的硬件资源和存储资源实现较高精度和较低时延的正余弦函数求值. 相似文献

20.

VLIW处理器可重组乘法器单元设计

杨焱张凯《微处理机》2007,28(3):21-23

在VLIW多媒体芯片的设计过程中,针对传统乘法器与加法器的不足,提出了一种新的分叉华莱氏树结构的乘法器模型,采用可重用的模块化设计思想,通过重用一位全加器阵列对乘法器进行扩展,处理器可以在一个乘法器单元内部同时支持多个32/16/8位的乘法运算,同时使乘法单元的速度和面积均得以优化。仿真测试表明,新的乘法器结构可有效减少FFT、滤波等信号处理以及多媒体处理中常用算法的执行周期,提高了实际运行速度,进一步增强了VLIW处理器在多媒体与信号处理运算上的能力。相似文献