期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

邵磊李昆张树丹于宗光徐睿《微计算机信息》2007,23(9)

本文介绍一种用于高性能DSP的32位浮点乘法器设计,通过采用改进Booth编码的树状4-2压缩器结构,提高了速度,降低了功耗,该乘法器结构规则且适合于VLSI实现,单个周期内完成一次24位整数乘或者32位浮点乘。整个设计采用Verilog HDL语言结构级描述,用0.25um单元库进行逻辑综合.完成一次乘法运算时间为24.30ns. 相似文献

2.

分簇VLIW DSP上支持单双字模式选择的SIMD编译优化

黄胜兵郑启龙郭连伟《计算机应用》2015,35(8):2371-2374

BWDSP100是一款采用超长指令字(VLIW)和单指令多数据流(SIMD)架构的针对高性能计算领域而设计的32位静态标量数字信号处理器,其指令级并行(ILP)主要是通过其特殊的分簇体系结构和SIMD指令来实现,然而现有的编译框架无法对这些特殊的SIMD指令提供支持。由于BWDSP100拥有丰富的SIMD向量化资源,且其所运用的雷达数字信号处理领域对程序的性能要求极高,因此针对BWDSP100结构的特点,在传统Open64编译器中SIMD编译优化框架的基础上提出并实现了一种支持单双字模式选择的SIMD编译优化算法,通过该算法可以显著提高一些在DSP上有着广泛运用计算密集型程序的性能。实验结果表明,与优化前相比,该算法方案在BWDSP编译器上的实现能够平均取得5.66的加速比。相似文献

3.

X-DSP浮点乘法器的设计与实现

彭元喜杨洪杰谢刚《计算机应用》2010,30(11):3121-3125

为了满足高性能X-DSP浮点乘法器的性能、功耗、面积要求,研究分析了X型DSP总体结构和浮点乘法器指令特点,采用Booth 2编码算法和4∶2压缩树形结构,使用4级流水线结构设计实现了一款高性能低功耗浮点乘法器。使用逻辑综合工具Design Compiler,采用第三方公司0.13μm CMOS工艺库,对所设计的乘法器进行了综合,其结果为工作频率500MHz,面积67529.36μm2,功耗22.3424mW。相似文献

4.

基于改进4—2压缩结构的32位浮点乘法器设计

邵磊李昆张树丹于宗光徐睿《微计算机信息》2007,23(3X):224-225,199

本文介绍一种用于高性能DSP的32位浮点乘法器设计,通过采用改进Booth编码的树状4-2压缩器结构,提高了速度,降低了功耗,该乘法器结构规则且适合于VLSI实现,单个周期内完成一次24位整数乘或者32位浮点乘。整个设计采用Verilog HDL语言结构级描述,用0.25um单元库进行逻辑综合.完成一次乘法运算时间为24.30ns. 相似文献

5.

SIMD指令集设计空间的形式化描述

李春江徐颖黄娟娟杨灿群《计算机科学》2013,40(6):32-36

SIMD (Single-Instruction-Multiple-Data)并行体系结构在现代处理器体系结构中扮演非常重要的角色.SI MD指令集已经成为处理器指令集中重要的子集.SIMD结构和指令集实现了短向量并行处理能力,SIMD指令集实现了对多种数据类型、多种操作模式的支持.采用形式化的方法,描述SIMD指令集的设计空间,从多个正交的维度刻画SIMD指令集的设计,基于此详细讨论了SIMD指令集的设计问题.该形式化方法有益于对SIMD指令集体系结构的分析和设计. 相似文献

6.

一种面向SIMD扩展部件的向量化统一架构

刘鹏赵荣彩赵博高伟《计算机科学》2014,41(9):28-31,44

随着多媒体应用的普及和高性能计算的需求,越来越多的处理器集成了SIMD扩展。为了针对不同SIMD扩展部件自动生成高效的向量化代码,设计了一套虚拟向量指令集,在此基础上构建了一种面向SIMD扩展部件的向量化统一架构。将输入程序通过向量识别等阶段转变为虚拟向量指令的中间表示,而后通过向量长度解虚拟化和指令集解虚拟化,将其转变为特定SIMD部件的向量指令集。在申威1600、DSP和Alpha上的实验结果表明:统一架构能够针对3种平台自动变换出高效的向量化代码,在DSP上的加速比要明显优于其它两种平台。相似文献

7.

VLIW处理器可重组乘法器单元设计

杨焱张凯《微处理机》2007,28(3):21-23

在VLIW多媒体芯片的设计过程中,针对传统乘法器与加法器的不足,提出了一种新的分叉华莱氏树结构的乘法器模型,采用可重用的模块化设计思想,通过重用一位全加器阵列对乘法器进行扩展,处理器可以在一个乘法器单元内部同时支持多个32/16/8位的乘法运算,同时使乘法单元的速度和面积均得以优化。仿真测试表明,新的乘法器结构可有效减少FFT、滤波等信号处理以及多媒体处理中常用算法的执行周期,提高了实际运行速度,进一步增强了VLIW处理器在多媒体与信号处理运算上的能力。相似文献

8.

改进部分积压缩结构的快速乘法器

下载免费PDF全文

董时华乔庐峰《计算机工程》2010,36(9):252-254

针对16位乘法器运算速度慢、硬件逻辑资源消耗大的问题,采用华莱士树压缩结构,通过对二阶布思算法、4-2压缩器和保留进位加法器的优化组合使用及对符号数采用合理的添、补、删策略,实现16位符号数快速乘法器的优化设计。该乘法器采用SMIC 0.18 μm工艺标准数字单元库,使用Synopsys Design Compiler综合实现,在1.8 V, 25℃条件下,芯片最大路径延时为3.16 ns,内核面积为 50 452.75 μm2,功耗为5.17 mW。相似文献

9.

一种高性能子字并行乘法器的设计与实现

下载免费PDF全文

黄立波岳虹陆洪毅戴葵《计算机工程与应用》2007,43(20):104-106

提出了一种支持子字并行的乘法器体系结构,并完成了其VLSI设计与实现。该乘法器在16 bit阵列子字并行结构的基础上,扩展了有符号与无符号之间的混合操作,采用多周期合并技术,实现了32 bit宽度的子字并行,并支持子字模式的乘累加,同时采用流水线设计技术,能够在单周期内完成4个8×8、2个16×16或1个32×16的有符号/无符号乘法操作。0.18 μm的标准单元库的实现表明该乘法器既能减小面积又能提高主频,是硬件消耗和运算性能的较好折衷,非常适用于多媒体微处理器的设计。相似文献

10.

一种改进的嵌入式SIMD协处理器设计 总被引：1，自引：0，他引：1

周国昌王忠车德亮冯国臣《计算机工程与应用》2004,40(31):13-16

论文介绍的SIMD协处理器是用于低层图像理解的16位定点嵌入式阵列处理器。该协处理器采用load/store体系结构,并且除SIMD固有的数据并行性外,还具有三级流水和三组指令并发执行的并行性。三组指令并发执行使数据交换操作和其它类型操作并发执行,从而实现了数据交换操作的隐含执行,大大减少了通信和I/O操作的开销。相似文献

11.

一种低成本128位高精度浮点SIMD乘加单元的设计与实现

黄立波王志英沈立马胜《计算机工程与科学》2012,34(9):71-76

SIMD单元集成已经成为提高处理器性能的重要途径之一。虽然定点SIMD单元的硬件复用低成本设计技术已经较为成熟,但是,大部分浮点SIMD单元的硬件设计还停留在简单的硬件复制方法上。本文针对日益增长的128位高精度浮点操作的计算需求,提出了其相应的SIMD低成本硬件结构方案。综合实验结果表明,所提出的SIMD浮点乘加单元比传统128位高精度浮点乘加单元具有更加优化的性能与面积参数。相似文献

12.

A fast, streaming SIMD Extensions 2, logistic squashing function

Milner JJ Grandison AJ 《Neural computation》2008,20(12):2967-2972

Schraudolph proposed an excellent exponential approximation providing increased performance particularly suited to the logistic squashing function used within many neural networking applications. This note applies Intel's streaming SIMD Extensions 2 (SSE2), where SIMD is single instruction multiple data, of the Pentium IV class processor to Schraudolph's technique, further increasing the performance of the logistic squashing function. It was found that the calculation of the new 32-bit SSE2 logistic squashing function described here was up to 38 times faster than the conventional exponential function and up to 16 times faster than a Schraudolph-style 32-bit method on an Intel Pentium D 3.6 GHz CPU. 相似文献

13.

The TMS320C30 floating-point digital signal processor

Papamichalis P. Simar R. Jr. 《Micro, IEEE》1988,8(6):13-29

The 320C30 is a fast processor with a large memory space and floating-point-arithmetic capabilities. The authors describe the 320C30 architecture in detail, discussing both the internal organization of the device and the external interfaces. They also explain the pipeline structure, addressing software-related issues and constructs, and examine the development tools and support. Finally, they present examples of applications. Some of the major features of the 320C30 are: a 60-ns cycle time that results in execution of over 16 million instructions per second (MIPS) and over 33 million floating-point operations per second (Mflops); 32-bit data buses and 24-bit address buses for a 16M-word overall memory space; dual-access, 4 K×32-bit on-chip ROM and 2 K×32-bit on-chip RAM; a 64×32-bit program cache; a 32-bit integer/40-bit floating-point multiplier and ALU; eight extended-precision registers, eight auxiliary registers, and 23 control and status registers; generally single-cycle instructions; integer, floating-point, and logical operation; two- and three-operand instructions; an on-chip DMA controller; and fabrication in 1-μm CMOS technology and packaging in a 180-pin package. These facilitate FIR (finite impulse response) and IIR (infinite impulse response) filtering, telecommunications and speech applications, and graphics and image processing applications 相似文献

14.

一种支持SIMD指令的低功耗分裂式ALU设计

郑伟姚庆栋张明蒋志迪李东晓赖莉亚周莉《计算机工程》2004,30(17):175-177

在面向多媒体运算的高性能、低功耗DSP芯片MD32设计中，支持SIMD指令的分裂式、低功耗ALU设计是实现其没计目标的重要环节。该文提出了利用基于资源共享的设计思想，以超前进位加法器(Catry Look-ahead Adder)为核心构造数据处理单元，完成算术以及逻辑运算，减少了ALU模块的面积，同时均衡了不同数据通路长度，并且采用先进行数据选择，而后进行数据处理的设计原则，降低不使用模块的活动度，减少了功耗。根据Design Power分析其综合后门级实现结果，芯片面积可减少8％，功耗可减少51％。相似文献

15.

Vector data flow analysis for SIMD optimizations on OpenCL programs

Yu‐Te Lin Jenq‐Kuen Lee 《Concurrency and Computation》2016,28(5):1629-1654

Multi‐core systems equipped with micro processing units and accelerators such as digital signal processors (DSPs) and graphics processing units (GPUs) have become a major trend in processor design in recent years in attempts to meet ever‐increasing application performance requirements. Open Computing Language (OpenCL) is one of the programming languages that include new extensions proposed to exploit the computing power of these kinds of processors. Among the newly extended language features, the single‐instruction multiple‐data (SIMD) linguistics and vector types are added to OpenCL to exploit hardware features of the accelerators. The addition makes it necessary to consider how traditional compiler data flow analysis can be adopted to meet the optimization requirements of vector linguistics. In this paper, we propose a calculus framework to support the data flow analysis of vector constructs for OpenCL programs that compilers can use to perform SIMD optimizations. We model OpenCL vector operations as data access functions in the style of mathematical functions. We then show that the data flow analysis for OpenCL vector linguistics can be performed based on the data access functions. Based on the information gathered from data flow analysis, we illustrate a set of SIMD optimizations on OpenCL programs. The experimental results incorporating our calculus and our proposed compiler optimizations show that the proposed SIMD optimizations can provide average performance improvements of 22% on x86 CPUs and 4% on advanced micro devices GPUs. For the selected 15 benchmarks, 11 of them are improved on x86 CPUs, and six of them are improved on advanced micro devices GPUs. The proposed framework has the potential to be used to construct other SIMD optimizations on OpenCL programs. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

16.

Parallel merged multiplier-accumulator coprocessor optimized for digital filters

H. Parandeh-Afshar Author Vitae Author Vitae O. Fatemi Author Vitae 《Computers & Electrical Engineering》2010,36(5):864-873

In an attempt to improve the speed of VLSI signal processing systems, a new architecture for a high-speed multiply-accumulate (MAC) unit optimized for digital filters is proposed. This unit is designed as a coprocessor for the LEON2 RISC processor [LEON2 Processor; 2005 [Online]. <http://www.gaisler.com/products/leon2/leon.html>]. In this work, four parallel MAC units with two dual-port coefficient register-files, a three-port general register-file and a control unit are included in the coprocessing block. With the existence of four parallel units, several SIMD format instructions have been added to LEON2 instruction set. Each MAC unit has two 16-bit inputs, 32-bit output register and a programmable round-saturate block. The MAC unit uses a new architecture which embeds the accumulate module within the partial products summation tree of the multiplier with minimum overhead. A central control unit controls inputs of the four MACs and loading of the output registers. Our experimental results demonstrate a high performance in implementation of digital filters at elevated speeds of up to 33 millions of input samples per second in a 0.18 μm technology. 相似文献

17.

基于跳跃式Wallace树的低功耗32位乘法器 总被引：3，自引：1，他引：2

下载免费PDF全文

李伟戴紫彬陈韬《计算机工程》2008,34(17):229-231

为了提高乘法器的综合性能,从3个方面对乘法器进行了优化设计。采用改进的Booth算法生成各个部分积,利用跳跃式Wallace树结构进行部分积压缩,通过改进的LING加法器对压缩结果进行求和。在FPGA上进行验证与测试,并在0.18 μm SMIC工艺下进行逻辑综合及布局布线。结果表明,与采用传统Wallace树结构的乘法器相比,该乘法器的延时减少了29%,面积减少了17%,功耗降低了38%,能够满足高性能的处理要求。相似文献