期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A single-chip multiprocessor for multimedia: the MVP 总被引：2，自引：0，他引：2

Guttag K. Gove R.J. Van Aken J.R. 《Computer Graphics and Applications, IEEE》1992,12(6):53-64

The multimedia video processor (MVP) architecture, which incorporates a variety of parallel processing techniques to deliver very high performance to a wide range of imaging and graphics applications, is described. The MVP combines, on a single semiconductor chip, multiple fully programmable processors with multiple data streams connected to shared RAMs through a crossbar network. Each of the independent processors can execute many operations in parallel every cycle. The architecture is scalable and supports different numbers of processors to meet the cost and performance requirements of different markets. MVP's target environment and the development of MVP are outlined 相似文献

2.

数字信号处理器(DSP)结构设计及发展趋势 总被引：4，自引：0，他引：4

沈戈高德远樊晓桠《计算机工程与应用》2003,39(7):4-6,39

高速信息化的时代需要更高性能的数字信号处理器(DSP),以满足网络通信和3G移动通信等方面的要求。该文分析了早期DSP处理器的结构特点和当今最先进的体系结构,结合应用背景着重探讨了不同DSP体系结构和它们各自的优势和劣势,在研究了数字信号处理新应用领域的特点后,根据今后的半导体制造工艺和微处理器体系结构设计的发展,指出了DSP处理器在微结构设计方面的发展趋势。相似文献

3.

Scalable, vector processors for embedded systems

Kozyrakis C.E. Patterson D.A. 《Micro, IEEE》2003,23(6):36-45

For embedded applications with data-level parallelism, a vector processor offers high performance at low power consumption and low design complexity. Unlike superscalar and VLIW designs, a vector processor is scalable and can optimally match specific application requirements.To demonstrate that vector architectures meet the requirements of embedded media processing, we evaluate the Vector IRAM, or VIRAM (pronounced "V-IRAM"), architecture developed at UC Berkeley, using benchmarks from the Embedded Microprocessor Benchmark Consortium (EEMBC). Our evaluation covers all three components of the VIRAM architecture: the instruction set, the vectorizing compiler, and the processor microarchitecture. We show that a compiler can vectorize embedded tasks automatically without compromising code density. We also describe a prototype vector processor that outperforms high-end superscalar and VLIW designs by 1.5x to 100x for media tasks, without compromising power consumption. Finally, we demonstrate that clustering and modular design techniques let a vector processor scale to tens of arithmetic data paths before wide instruction-issue capabilities become necessary. 相似文献

4.

实时微处理器体系结构综述 总被引：1，自引：0，他引：1

下载免费PDF全文

石伟张明郭御风龚锐《计算机工程与科学》2015,37(5):857-864

实时应用已经成为嵌入式应用中一类快速崛起的典型应用。作为实时系统的核心部件,实时微处理器体系结构是微处理器领域的一个重要研究方向。与通用处理器追求最大吞吐量不同,实时处理器要求具有紧凑且可计算的最坏执行时间。传统的实时处理器往往采用较为简单的处理器结构,避免复杂结构引入执行时间的不确定性。随着实时应用对处理器性能需求越来越高,实时处理器正逐渐向多线程与多核结构发展。在多线程与多核处理器中,共享资源竞争导致实时系统的确定性变差,对实时处理器体系结构带来了更大挑战。对实时微处理器体系结构进行综述,首先从指令集、微体系结构、存储、I/O、任务调度等多个方面对传统实时处理器进行分析;然后分别对采用多线程与多核结构的高性能实时处理器展开分析;最后对几种商用实时处理器结构进行比较,总结实时处理器发展现状与未来发展趋势。相似文献

5.

用于图像处理的FPGA存储器优化分配

陈凯峰梁鉴如《计算机工程与科学》2019,41(11):1924-1929

现场可编程门阵列(FPGA)在计算机视觉应用领域有着广阔的前景,然而FPGA有限的片上存储器资源难以满足应用场景下性能、尺寸和功率的需求。针对这个问题,研究片上存储器的资源分配,在最小化片上资源使用和整体功耗的前提下提出一种易于实现的分区平衡算法。实验结果表明,与商用FPGA高级综合工具相比,本文算法的利用率提高达60%,且动态功耗降低了约70%。在高级算法MeanShift跟踪的实验中,实验结果显示,分区算法可以在不影响关键性能的前提下降低总功耗高达30%。相似文献

6.

性能非对称多核处理器下异构感知调度技术

赵姗杨秋松李明树《软件学报》2019,30(4):1164-1190

为了满足应用程序的多样化需求,异构多核处理器出现并逐渐进入市场,其中的处理核心（core）具有不同的微架构或者指令集架构（ISA）,为应用提供多样化特性支持,比如指令级并行（ILP）、内存级并行（MLP）,这些核心协同工作满足整个计算系统的优化目标,比如高性能、低功耗或者良好的能效.然而,目前主流的调度技术主要是针对传统同构处理器架构设计,没有考虑异构硬件能力的差异性.在异构多核处理器环境下,调度技术如何感知硬件的异构特性,为不同类型的应用程序提供更加合适和匹配的硬件资源,这是值得探索的问题.对近年来在该研究领域的成果进行了综述研究,特别是在性能非对称多核处理器架构下,异构调度技术面临的优化目标、分析模型、调度决策和算法评估等主要问题进行了分析和描述,并依次对相关技术进行了系统的总结,最后从软硬件融合的角度对今后的研究工作进行了展望. 相似文献

7.

“腾越-Ⅱ”嵌入式异步微处理器的设计与实现

苏博石伟王志英任洪广王友瑞《计算机工程与科学》2012,34(7):65-70

嵌入式系统对处理器功耗开销有严格的限制,异步电路技术可以作为设计低功耗处理器的有效方法之一。针对嵌入式多媒体应用,本文设计实现了一款低功耗异步微处理器——腾越-Ⅱ。处理器中包含一个异步TTA微处理器内核、一个同步TTA微处理器内核、两个存储控制器和多个外部通信接口。异步内核通过基于宏单元的异步电路设计方法实现,其它部分通过基于标准单元的半定制设计流程实现。处理器芯片采用UMC0.18μmCMOS工艺实现,基片面积为4.89×4.89mm2,工作电压为1.8V。经测试,处理器工作主频达到200MHz,且异步内核的功耗开销低于同步内核的50%。相似文献

8.

配置流驱动计算体系结构指导下的ASIP设计 总被引：1，自引：0，他引：1

李勇王志英赵学秘岳虹《计算机研究与发展》2007,44(4):714-721

为了兼顾嵌入式处理器设计中的灵活性与高效性,提出配置流驱动计算体系结构.在体系结构设计中将软/硬件界面下移,使功能单元之间的互连网络对编译器可见,并由编译器来完成传输路由,从而支持复杂但更为高效的互连网络.在该体系结构指导下,提出一种支持段式可重构互连网络的专用指令集处理器(ASIP)设计方法.该方法应用到密码领域的3类ASIP设计中表明,与简单总线互连相比,在不影响性能的前提下,可平均节约53%的互连功耗和38.7%的总线数量,从而达到减少总线数量、降低互连功耗的目的. 相似文献

9.

异构多核处理器体系结构设计研究 总被引：2，自引：0，他引：2

陈芳园张冬松王志英《计算机工程与科学》2011,33(12):27-36

多核技术成为当今处理器发展的重要方向,异构多核处理器由于可将不同类型的计算任务分配到不同类型的处理器核上并行处理,从而为不同需求的应用提供更加灵活、高效的处理机制而成为当今研究的热点.本文从体系结构的角度探讨了异构多核处理器设计中的关键点,从内核结构、互连方式、存储系统、操作系统支持、测试与验证、动态电压调节等方面分析... 相似文献

10.

New Products

Michalopouios D.A. 《Computer》1978,11(5):92-98

EAI's new hybrid computer system, called Hyshare, is aimed at the multi-user, multi-task application demands of larger- scale simulation and scientific computation laboratories. Consisting of an EAI 3200 digital computer and up to six EAI analog processors, the analog/digital and digital/ analog communications interface employs on-line, dynamic resource allocation techniques which allow analog processors to be assigned to separate tasks or linked together to meet specific application requirements. 相似文献

11.

A hybrid closed queuing network model for multi-threaded dataflow architecture 总被引：1，自引：0，他引：1

Vidhyacharan 《Computers & Electrical Engineering》2005,31(8):556-571

In this paper, a closed queuing network model with both single and multiple servers has been proposed to model dataflow in a multi-threaded architecture. Multi-threading is useful in reducing the latency by switching among a set of threads in order to improve the processor utilization. Two sets of processors, synchronization and execution processors exist. Synchronization processors handle load/store operations and execution processors handle arithmetic/logic and control operations. A closed queuing network model is suitable for large number of job arrivals. The normalization constant is derived using a recursive algorithm for the given model. State diagrams are drawn from the hybrid closed queuing network model, and the steady-state balance equations are derived from it. Performance measures such as average response times and average system throughput are derived and plotted against the total number of processors in the closed queuing network model. Other important performance measures like processor utilizations, average queue lengths, average waiting times and relative utilizations are also derived. 相似文献

12.

A holistic approach for tightly coupled reconfigurable parallel processors

Hritam Dutta Dmitrij Kissler Frank Hannig Alexey Kupriyanov Jürgen Teich Bernard Pottier 《Microprocessors and Microsystems》2009,33(1):53-62

New standards in signal, multimedia, and network processing for embedded electronics are characterized by computationally intensive algorithms, high flexibility due to the swift change in specifications. In order to meet demanding challenges of increasing computational requirements and stringent constraints on area and power consumption in fields of embedded engineering, there is a gradual trend towards coarse-grained parallel embedded processors. Furthermore, such processors are enabled with dynamic reconfiguration features for supporting time- and space-multiplexed execution of the algorithms. However, the formidable problem in efficient mapping of applications (mostly loop algorithms) onto such architectures has been a hindrance in their mass acceptance. In this paper we present (a) a highly parameterizable, tightly coupled, and reconfigurable parallel processor architecture together with the corresponding power breakdown and reconfiguration time analysis of a case study application, (b) a retargetable methodology for mapping of loop algorithms, (c) a co-design framework for modeling, simulation, and programming of such architectures, and (d) loosely coupled communication with host processor. 相似文献

13.

Advanced Arithmetic for the Digital Computer,Design of Arithmetic Units

《Electronic Notes in Theoretical Computer Science》2000

Advances in computer technology are now so profound that the arithmetic capability and repertoire of computers can and should be expanded. Nowadays the elementary floating-point operations +, −, ×, / give computed results that coincide with the rounded exact result for any operands. Advanced computer arithmetic extends this accuracy requirement to all operations in the usual product spaces of computation: the real and complex vector spaces as well as their interval correspondents. This enhances the mathematical power of the digital computer considerably. A new computer operation, the scalar product, is fundamental to the development of advanced computer arithmetic.This paper studies the design of arithmetic units for advanced computer arithmetic. Scalar product units are developed for different kinds of computers like personal computers, workstations, mainframes, super computers or digital signal processors. The new expanded computational capability is gained at modest cost. The units put a methodology into modern computer hardware which was available on old calculators before the electronic computer entered the scene. In general the new arithmetic units increase both the speed of computation as well as the accuracy of the computed result. The circuits developed in this paper show that there is no way to compute an approximation of a scalar product faster than the correct result.A collection of constructs in terms of which a source language may accommodate advanced computer arithmetic is described in the paper. The development of programming languages in the context of advanced computer arithmetic is reviewed. The simulation of the accurate scalar product on existing, conventional processors is discussed. Finally the theoretical foundation of advanced computer arithmetic is reviewed and a comparison with other approaches to achieving higher accuracy in computation is given. Shortcomings of existing processors and standards are discussed. 相似文献

14.

面向异构多处理器设备的自适应命令解释系统

刘文卿李栋崔莉《软件学报》2017,28(S1):11-19

智能化赋予了物联网更深刻的实用价值,但是在计算能力强与功耗低的之间寻求性能最优是目前物联网设备极难解决的问题.异构多处理器结构与单一或者同构的多处理器相比可以结合不同处理器的优势,同时满足高计算能力与低功耗的系统需求,但异构多处理器结构下软件编程难度大的问题以及如何优化顶层应用在多处理器设备上的运行性能都是目前亟待解决的技术难题.针对以上问题,设计并实现了一个面向异构多处理器设备的自适应命令解释系统.首先,该系统允许用户将物联网应用安装到设备上,应用程序以命令脚本形式呈现;其次,系统设计了命令在异构多处理器设备上的自动分发算法,该算法考虑性能和功耗的多维参数,在满足时间上限的条件下最优化应用执行能耗.最后,提出了针对同时满足不同用户应用需求的解决方案,在物联网设备的资源受限的条件下,根据具体用户使用习惯,提出了一种基于用户使用历史的命令解释系统自适应方案,可以根据用户个性化习惯自动完成命令解释系统的自适应部署和运行时优化. 相似文献

15.

基于SIMD的Square Root函数高性能实现与优化

赵永浩贾海鹏张云泉张思佳《计算机工程与科学》2021,43(4):662-669

在计算机图形学、积分计算和神经网络等应用场景中,平方根函数的高性能实现在构建处理器的基础软件生态中起到了十分重要的作用.随着A RM架构处理器得到广泛的使用,研究A RM架构下的函数快速算法实现变得更加关键.当前大量处理器都采用了SIMD架构,所以,研究基于SIMD实现高性能函数计算方法具有重要的研究意义和发展前景.因此,对平方根函数进行了高性能的实现与优化.通过分析IEEE 754标准的浮点数在内存中的存储格式,设计了高效的平方根函数算法;然后通过结合平方根倒数和泰勒公式算法,进一步提高了算法精度;最后通过SIMD优化进一步提升了算法性能.实验结果表明,在满足精度的前提下,相比于libm算法库,实现的平方根函数的,性能提高了约7倍,相比于A RM V8提供的计算平方根的指令在性能上提高了约3倍. 相似文献

16.

Mobile satellite reception with a virtual satellite dish based on a reconfigurable multi-processor architecture

M.D. van de Burgwal K.C. Rovers K.C.H. Blom A.B.J. Kokkeler G.J.M. SmitAuthor vitae 《Microprocessors and Microsystems》2011,35(8):716-728

Traditionally, mechanically steered dishes or analog phased array beamforming systems have been used for radio frequency receivers, where strong directivity and high performance were much more important than low-cost requirements. Real-time controlled digital phased array beamforming could not be realized due to the high computational requirements and the implementation costs. Today, digital hardware has become powerful enough to perform the massive number of operations required for real-time digital beamforming. With the continuously decreasing price per transistor, high performance signal processing has become available by using multi-processor architectures. More and more applications are using beamforming to improve the spatial utilization of communication channels, resulting in many dedicated digital architectures for specific applications. By using a reconfigurable architecture, a single hardware platform can be used for different applications with different processing needs.In this article, we show how a reconfigurable multi-processor system-on-chip based architecture can be used for phased array processing, including an advanced tracking mechanism to continuously receive signals with a mobile satellite receiver. An adaptive beamformer for DVB-S satellite reception is presented that uses an Extended Constant Modulus Algorithm to track satellites. The receiver consists of 8 antennas and is mapped on three reconfigurable Montium TP processors. With a scenario based on a phased array antenna mounted on the roof of a car, we show that the adaptive steering algorithm is robust in dynamic scenarios and correctly demodulates the received signal. 相似文献

17.

面向WCET分析的实时多核体系结构研究

陈芳园丁亚军张冬松吴　飞　任秀江《计算机工程与科学》2014,36(3):393-398

随着工艺技术的发展以及嵌入式实时应用范围的不断扩大和需求的不断提升,多核处理器必将凭其高性能和低功耗特性应用到嵌入式实时领域中。但是,多核处理器体系结构很难甚至无法满足实时系统的实时限制和对WCET的可预测性要求。从多核中的共享资源入手,分析多核中的片上共享资源（共享Cache、片上互连）和片外共享资源（片外存储）对WCET分析的影响,探讨了各种干扰下的WCET分析方法。介绍了两种多核WCET分析模型：多核静态WCET分析模型和多核混合WCET分析模型;同时,针对嵌入式实时应用提出了多核设计原则。相似文献

18.

SCMP: A Single-Chip Message-Passing Parallel Computer

Baker James M. Gold Brian Bucciero Mark Bennett Sidney Mahajan Rajneesh Ramachandran Priyadarshini Shah Jignesh 《The Journal of supercomputing》2004,30(2):133-149

As technology improves and transistor feature sizes continue to shrink, the effects of on-chip interconnect wire latencies on processor clock speeds will become more important. In addition, as we reach the limits of instruction-level parallelism that can be extracted from application programs, there will be an increased emphasis on thread-level parallelism. To continue to improve performance, computer architects will need to focus on architectures that can efficiently support thread-level parallelism while minimizing the length of on-chip interconnect wires. The SCMP (Single-Chip Message-Passing) parallel computer system is one such architecture. The SCMP system includes up to 64 processors on a single chip, connected in a 2-D mesh with nearest neighbor connections. Memory is included on-chip with the processors and the architecture includes hardware support for communication and the execution of parallel threads. Since there are no global signals or shared resources between the processors, the length of the interconnect wires will be determined by the size of the individual processors, not the size of the entire chip. Avoiding long interconnect wires will allow the use of very high clock frequencies, which, when coupled with the use of multiple processors, will offer tremendous computational power. 相似文献

19.

网络处理器中处理单元的设计与实现

下载免费PDF全文

李诚李华伟《计算机工程》2007,33(2):252-254

随着网络带宽的飞速增长和各种新的网络应用不断涌现，原有的基于通用处理器和ASIC的互联网架构已经不能满足新的需求。兼具强大处理能力和灵活可编程配置能力的网络处理器逐渐得到广泛的应用。高性能的网络处理器通常采用多个并发的处理单元进行数据平面的快速处理，这些处理单元在网络处理器中居于核心的地位。该文讨论了网络处理器中处理单元设计需要考虑的因素，设计了一种较为灵活有效的处理单元架构，并进行了FPGA原型验证，证实了该结构的可行性。相似文献

20.

一种类脑处理器片上网络的验证框架

陈小帆杨智杰彭凌辉王世英周干李石明康子扬王耀石伟王蕾《计算机工程与科学》2022,44(5):769-778

近年来,随着摩尔定律的放缓,传统体系结构逐渐面临“存储墙”和“功耗墙”问题。如今新型计算模式和体系结构层出不穷,其中就包含了类脑计算。由于其存算一体的特点,类脑计算已逐步打破了冯·诺依曼体系结构带来的“存储墙”和“功耗墙”限制,在类脑处理器上相关类脑算法得到了高效的应用。现阶段在大规模生物神经网络的应用场景下,需要提升多核类脑处理器的规模可扩展性,保持其高数据吞吐量和低传输延时。现今,大多数多核类脑处理器的设计采用片上网络作为互连结构。然而目前关于这类片上网络的验证研究还相对较少。鉴于片上网络对多核类脑处理器的重要性,建立一套完整而鲁棒的片上网络功能验证框架意义重大。旨在基于随机化方法来生成行为级和FPGA硬件级测试所需的激励文件,通过对日志文件进行高效处理实现较为全面的功能验证。相似文献