期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

何婷婷彭元喜雷元武《计算机应用》2015,35(7):1854-1857

针对双精度浮点除法通常运算过程复杂、延时较大这一问题,提出一种基于Goldschmidt算法设计支持IEEE-754标准的高性能双精度浮点除法器方法。首先,分析Goldschmidt算法运算除法的过程以及迭代运算产生的误差;然后,提出了控制误差的方法;其次,采用了较节约面积的双查找表法确定迭代初值,迭代单元采用并行乘法器结构以提高迭代速度;最后,合理划分流水站,控制迭代过程使浮点除法可以流水执行,从而进一步提高除法器运算速率。实验结果表明,在40 nm工艺下,双精度浮点除法器采用14位迭代初值流水结构,其综合cell面积为84902.2618 μm²,运行频率可达2.2 GHz;相比采用8位迭代初值流水结构运算速度提高了32.73%,面积增加了5.05%;计算一条双精度浮点除法的延迟为12个时钟周期,流水执行时,单条除法平均延迟为3个时钟周期,与其他处理器中基于SRT算法实现的双精度浮点除法器相比,数据吞吐率提高了3~7倍;与其他处理器中基于Goldschmidt算法实现的双精度浮点除法器相比,数据吞吐率提高了2~3倍。相似文献

2.

赛灵思ISE13．3设计套件完全定制精度浮点支持

《单片机与嵌入式系统应用》2011,11(12):86-86

赛灵思公司（Xilinx,Inc．）推出具有全新功能的ISE13．3设计套件,可帮助DSP设计人员在面向无线、医疗、高性能计算和视频应用的设计中,轻松实现具备比特精度的单精度、双精度、完全定制精度浮点数学运算。相似文献

3.

一种高性能四倍精度浮点乘加器的设计与实现

何军黄永勤朱英《计算机工程》2014,(2):294-299

高精度、高性能浮点运算部件是高性能微处理器设计的重要部分。通过对传统双精度浮点乘加运算算法的研究,结合四倍精度浮点数据格式特点,设计并实现一种高性能的四倍精度浮点乘加器(QPFMA),该乘加器支持多种浮点运算,运算延迟为7拍,全流水结构。采用双路加法器改进算法结构,优化头零预测和规格化移位逻辑,减小运算延迟和硬件开销。通过参数化设计验证方法,实现高效的正确性验证。逻辑综合结果表明,基于65 nm工艺,该QPFMA频率可达1.2 GHz,比现有的QPFMA设计运算延迟减少3拍,频率提高约11.63%。相似文献

4.

基于JNI和C+〖KG-*3〗+的Intel集成众核并行方法

桑喆 邓川 苟聪 刘开兴 白明泽 《计算机与现代化》2018,(4):32

针对当前Intel集成众核协处理器（MIC）只能使用C/C+〖KG-*3〗+/Fortran编程语言进行并行计算,不能对已有的Java程序提供高性能计算支持的问题,提出基于Java Native Interface(JNI)技术和C+〖KG-*3〗+的MIC混合并行计算方法。该方法基于JNI设计Java代码与C+〖KG-*3〗+代码的数据交换机制,使MIC协处理器强大的浮点计算能力加速Java应用程序成为可能。通过实验测试分析基于MIC多线程并行的Java程序计算性能效果,结果表明该方法能有效利用MIC协处理器,对Java程序的计算性能提升显著。相似文献

5.

VelociTI结构浮点DSPs寄存器堆读写的流水线设计

下载免费PDF全文

胡正伟仲顺安陈禾《计算机工程》2007,33(21):237-239

研究了VelociTI结构浮点数字信号处理器寄存器堆的流水线读写原理并提出了一种设计方法。该方法对单操作数双精度浮点指令采用2个32位数据通路用1个流水线周期读取源操作数，双操作数双精度浮点指令采用锁定译码单元，利用若干流水线周期读取源操作数。采用写控制向量的方法实现了流水线多个周期执行写操作。该方法正确实现了基于IEEE754标准的双精度浮点数据在寄存器堆与功能单元之间的32位数据通路上的传输，仿真结果验证了其正确性。相似文献

6.

用于导航解算的矩阵运算硬件加速器设计

《计算机工程》2014,(8)

针对捷联式惯导系统中浮点矩阵乘积计算量大、串行处理方法耗时多制约捷联式惯导系统实时性提升的问题,提出一种基于FPGA/SOPC的浮点矩阵乘积并行处理方法。该处理方法的核心——高性能矩阵乘积单元是在脉动阵列结构基础上通过循环分块、数据空间分割及迭代空间合并优化后的高并行度处理单元,并利用直接内存存取大批量数据传输的速度优势,运算速度得到进一步提升。实验结果表明,据此设计的浮点矩阵乘积加速器不但能够准确地完成运算,而且运算速率有明显提升,较其他串、并行计算方法消耗的周期数分别减少71.3%,78%以上,能够有效地提高导航系统的实时性。相似文献

7.

航天领域高性能并行计算研究进展

龚春叶包为民汤国建王玲孙学功刘杰《计算机工程与科学》2014,36(9):1629-1636

航天领域的大规模科学与工程问题的数值模拟既依赖于高性能并行计算的支撑,同时也是高性能并行计算发展的动力。综述了航天领域高性能并行计算的研究进展,对高性能并行计算环境进行简单介绍,对相关研究领域包括气动力、气动热、化学非平衡、结构强度、热防护、蒙特卡罗方法和湍流研究等进行分类和详细阐述;总结了航天领域高性能并行计算存在科学计算高并行效率和工程计算低实用价值、并行应用的多样性和缺少科学的并行方法的矛盾,并指出了进一步研究方向。相似文献

8.

基于编译时插桩的浮点异常检测方法

郭思雨王磊《计算机工程与科学》2022,44(6):979-985

浮点数是实数的有限精度编码,在进行浮点计算时,可能会导致不精确或者异常的结果,因此实现有效的浮点异常检测方法很重要。现有异常检测方法不面向浮点数学函数,由此提出了一种面向浮点数学函数的异常检测方法。该方法依据IEEE-754标准中定义的上溢出、下溢出、被零除、无效操作和不精确5类异常,并结合申威高性能数学函数库中使用的浮点控制寄存器FPCR和IEEE-754标准定义的浮点异常产生条件的相关理论,通过将异常类型和浮点运算指令进行对应分类,在程序编译时进行插桩以检测出浮点数学函数中出现的异常,同时记录代码覆盖率。最后将该方法应用于数学函数库,对库中100多个浮点数学函数进行了测试实验。实验结果表明,该浮点异常检测方法能够有效检测各类异常。相似文献

9.

基于MPI的高精度归约函数设计与实现

何康黄春姜浩谷同祥齐进刘杰《计算机工程与科学》2021,43(4):594-602

随着科学工程计算大规模、高维数和长时程的特性越来越显著,浮点舍入误差的累积效应往往使得计算结果不可信,提高计算精度成为了并行计算领域研究的热点之一。基于MPICH3框架,采用无误差变换技术构建新的数据格式和相应运算操作符,设计了高精度归约函数MPI_ACCU_REDUCE,实现了高精度的求和、求积和求L2范数3种MPI归约运算。数值实验结果表明,提出的3种高精度归约运算有效提高了数值计算的精度。相似文献

10.

面向高性能计算的众核处理器结构级高能效技术 总被引：1，自引：0，他引：1

郑方张昆邬贵明高红光唐勇吕晖过锋李宏亮谢向辉《计算机学报》2014,37(10)

随着半导体技术的进步,众核处理器已经广泛应用于高性能计算领域.而要构建未来高性能计算系统,处理器必须突破严峻的"能耗墙"挑战.文中以一款自主众核处理器DFMC原型为基础,首先对其在典型负载下的能耗分布进行了分析,结合该处理器的具体结构,提出了基于指令窗口的指令缓冲、操作数锁存两种结构级能效优化技术,探索了能效优先的浮点部件设计方法.实验表明,通过上述技术可以降低处理器取指和译码能耗约50%、寄存器文件能耗11.2%和浮点部件能耗17.6%,最终全芯片降低能耗约14.7%.在该文所述实验环境下,作者还进行了DFMC原型的双精度矩阵乘(DGEMM)性能功耗比测试,并与NVIDIA公司的Kepler K20GPU进行了对比. 相似文献

11.

并行浮点加法器架构与核心算法的研究 总被引：1，自引：0，他引：1

陈弦张伟功于伦正《计算机工程与应用》2006,42(17):53-55,75

考虑到浮点运算在图形处理中的重要作用,依据速度和面积的优化原理,文章从两个方面对FAU结构中最复杂的双精度浮点加法进行了研究。其一:在结构上采用了三条相互并行的主线,设计了一种尽可能并行处理的三级浮点流水结构,极大地提高了运算的速度,节约了芯片资源;其二:对结构中制约浮点加法速度的关键运算——尾加和移位操作进行了创新设计与实现,并就设计的先进性和高速性与传统设计进行了参数比较和综合分析。相似文献

12.

Variable-precision,interval arithmetic coprocessors

Michael J. Schulte Earl E. Swartzlander Jr. 《Reliable Computing》1996,2(1):47-62

This paper presents hardware designs, arithmetic algorithms, and numerical applications for variable-precision, interval arithmetic coprocessors. These coprocessors give the programmer the ability to set the initial precision of the computation, determine the accuracy of the results, and recompute inaccurate results with higher precision. Variable-precision, interval arithmetic algorithms are used to reduce the execution times of numerical applications. Three hardware designs with data paths of 16, 32, and 64 bits are examined. These designs are compared based on their estimated chip area, cycle time, and execution times for various numerical applications. Each coprocessor can be implemented on a single chip with a cycle time that is comparable to IEEE double-precision floating point coprocessors. For certain numerical applications, the coprocessors are two to four orders of magnitude faster than a conventional software package for variable-precision, interval arithmetic. 相似文献

13.

A survey on parallel and distributed multi-agent systems for high performance computing simulations

《Computer Science Review》2016

Simulation has become an indispensable tool for researchers to explore systems without having recourse to real experiments. Depending on the characteristics of the modeled system, methods used to represent the system may vary. Multi-agent systems are often used to model and simulate complex systems. In any cases, increasing the size and the precision of the model increases the amount of computation, requiring the use of parallel systems when it becomes too large. In this paper, we focus on parallel platforms that support multi-agent simulations and their execution on high performance resources as parallel clusters. Our contribution is a survey on existing platforms and their evaluation in the context of high performance computing. We present a qualitative analysis of several multi-agent platforms, their tests in high performance computing execution environments, and the performance results for the only two platforms that fulfill the high performance computing constraints. 相似文献

14.

The TMS390C602A floating-point coprocessor for Sparc systems

Darley M. Kronlage B. Bural D. Churchill B. Pulling D. Wang P. Iwamoto R. Yang L. 《Micro, IEEE》1990,10(3):36-47

A recent Sparc (scalable processor architecture) processor consists of a two-chip configuration, containing the TMS390C601 integer unit (IU) and the TMS390C602A floating-point unit (FPU). The second device, an innovative coprocessor that lets the processor execute single- or double-precision floating-point instructions concurrently with IU operations is described. Dedicated floating-point hardware in the FPU increases the performance of the system. Running at clock periods as small as 20 ns, the chip should deliver 5.5 million double-precision floating-point operations per second under the Linpack benchmark (50-MHz clock rate). The FPU provides single- and double-precision arithmetic functions: addition, subtraction, multiplication, division, square root, compare, and convert. To minimize its math unit's latency, the FPU uses a highly parallel architecture requiring separate math units to optimize additions and multiplications. Traps stop the execution of a program to jump to software routine for handling data-dependent errors or to execute instructions not implemented in the hardware. Benchmark results are presented 相似文献

15.

高性能子字并行运算单元的设计与实现

下载免费PDF全文

董冕吴丹饶金理黄威戴葵邹雪城《计算机工程》2012,38(16):249-252

通过硬件共享的方式实现一套高性能子字并行运算单元,运算单元采用流水线设计,可以一个周期进行1个64-bit、2个32-bit、4个16-bit或8个8-bit定点运算,1个双精度或2个单精度浮点运算。运算单元采用Verilog HDL设计,在0.18 μm 标准CMOS工艺库下实现,并针对实际多媒体应用程序基于ESCA系统进行性能评测。实验结果表明,该运算单元可以在硬件开销和性能上获得较好的平衡。相似文献

16.

Physical Implementation of the Eight-Core Godson-3B Microprocessor

下载免费PDF全文

王茹范宝峡杨梁高燕萍刘动肖斌王江嵋张译夫王宏胡伟武《计算机科学技术学报》2011,26(3):520-527

The Godson-3B processor is a powerful processor designed for high performance servers including Dawning Servers.It offers significantly improved performance over previous Godson-3 series CPUs by incorporating eight CPU cores and vector computing units.It contains 582.6 M transistors within 300 mm2 area in 65 nm technology and is implemented in parallel with full hierarchical design flows.In Godson-3B,advanced clock distribution mechanisms including GALS (Globally Asynchronous Locally Synchronous) and clock mesh are adopted to obtain an OCV tolerable clock network.Custom-designed de-skew modules are also implemented to afford further latency balance after fabrication.The power reduction of Godson-3B is maintained by MLMM (Multi Level Multi Mode) clock gating and multi-threshold-voltage cells substitution schemes.The highest frequency of Godson-3B is 1.05 GHz and the peak performance is 128 GFlops (double-precision) or 256 GFlops (single-precision) with 40 W power consumption. 相似文献

17.

遥感影像的高性能并行处理技术研究

赵颖辉 ;蒋从锋《计算机技术与发展》2014,(7):201-205

随着空间遥感技术和对地观测技术的不断发展,光学、热红外和微波等不同技术手段可以获取同一地区的多种遥感影像数据（多时相、多光谱、多传感器、多平台和多分辨率等）,每天获取的遥感数据量越来越大。同时,大量的遥感应用需要快速地对这些遥感数据进行处理与分析,提供辅助决策信息。因此,如果不能及时进行数据处理,这些数据就会失去时效性,甚至失去数据本身的价值。高性能计算与并行处理技术,加速了遥感影像数据处理与信息提取的进度,如大规模多处理系统、网格与云计算技术、通用图形处理器（GPGPU）等。文中综述了高性能计算、并行处理及云计算技术应用于遥感领域的最新进展,给出了一些研究与应用范例,并提出了当前高性能遥感影像处理所面临的一些挑战。相似文献

18.

图像处理中圆心算法研究 总被引：4，自引：0，他引：4

雷家勇达飞鹏孟广猛《计算机与现代化》2005,(3):25-26,34

根据计算机处理中求取高精度圆心的要求,对计算机处理图像圆心的多种算法进行研究和比较,分析产生误差的来源和解决的方法,提出采用约束条件预处理的最小二乘计算圆心的方法。采用径向误差作为约束条件选择有效的图像边界点,既可避免不必要的计算,又提高了精度,是一种准确有效的算法。相似文献

19.

航空结构大规模并行分析与优化应用

常亮段世慧王立凯罗利龙《计算机辅助工程》2017,26(3):45-50

针对飞机设计精细化数值分析模型自由度已经达到亿级,对高性能计算的要求也越来越高的问题,围绕大规模并行计算环境下结构分析和优化的若干关键问题,研究满足高性能计算体系特点的区域分解并行算法、超大规模结构变量敏度高效求解和结构非线性振动特性求解等关键技术.对国产CAE软件HAJIF进行并行化改造,初步实现基于最大航程的气动结构综合优化设计和基于精细化模型的复合材料机翼综合优化设计.HAJIF的计算效率和精度得到明显提高. 相似文献

20.

A new era in scientific computing: Domain decomposition methods in hybrid CPU–GPU architectures

M. Papadrakakis G. Stavroulakis A. Karatarakis 《Computer Methods in Applied Mechanics and Engineering》2011,200(13-16):1490-1508

Recent advances in graphics processing units (GPUs) technology open a new era in high performance computing. Applications of GPUs to scientific computations are attracting a lot of attention due to their low cost in conjunction with their inherently remarkable performance features and the recently enhanced computational precision and improved programming tools. Domain decomposition methods (DDM) constitute today an important category of methods for the solution of highly demanding problems in simulation-based applied science and engineering. Among them, dual domain decomposition methods have been successfully applied in a variety of problems in both sequential as well as in parallel/distributed processing systems. In this work, we demonstrate the implementation of the FETI method to a hybrid CPU–GPU computing environment. Parametric tests on implicit finite element structural mechanics benchmark problems revealed the tremendous potential of this type of hybrid computing environment as a result of the full exploitation of multi-core CPU hardware resources and the intrinsic software and hardware features of the GPUs as well as the numerical properties of the solution method. 相似文献