期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

LS SIMD协处理器控制器设计 总被引：1，自引：1，他引：0

周国昌王忠车德亮冯国臣《计算机应用研究》2005,22(7):99-100

LS SIMD协处理器是用于底层图像理解的16位定点嵌入式阵列处理器,该处理器除SIMD固有的数据并行性外,还具有三级流水和三组指令并发执行的并行性。主要阐述LS SIMD协处理器的三级流水线和三组指令并发执行的基本可重用的主控制器设计。相似文献

2.

Efficient multimedia coprocessor with enhanced SIMD engines for exploiting ILP and DLP

《Parallel Computing》2013,39(10):586-602

Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively. 相似文献

3.

一种改进的嵌入式SIMD协处理器设计 总被引：1，自引：0，他引：1

周国昌王忠车德亮冯国臣《计算机工程与应用》2004,40(31):13-16

论文介绍的SIMD协处理器是用于低层图像理解的16位定点嵌入式阵列处理器。该协处理器采用load/store体系结构,并且除SIMD固有的数据并行性外,还具有三级流水和三组指令并发执行的并行性。三组指令并发执行使数据交换操作和其它类型操作并发执行,从而实现了数据交换操作的隐含执行,大大减少了通信和I/O操作的开销。相似文献

4.

基于FPGA的动态可重构体系结构研究 总被引：1，自引：0，他引：1

蔡启先蔡洪波黄晓璐蔡启仲《计算机应用》2006,26(7):1741-1743

提出了一种基于FPGA的动态可重构系统的设计方案。该系统以协处理器的形式与LEON2通用处理器构成主/协处理器结构，并通过寄存器与网络来保存和传递数据流和配置流，实现了二者的优势互补。以具体实验对该方案进行了验证。相似文献

5.

Processing in memory: the Terasys massively parallel PIM array

Gokhale M. Holmes B. Iobst K. 《Computer》1995,28(4):23-31

SRC researchers have designed and fabricated a processor-in-memory (PIM) chip, a standard 4-bit memory augmented with a single-bit ALU controlling each column of memory. In principle, PIM chips can replace the memory of any processor, including a supercomputer. To validate the notion of integrating SIMD computing into conventional processors on a more modest scale, we have built a half dozen Terasys workstations, which are Sun Microsystems Sparcstation-2 workstations in which 8 megabytes of address space consist of PIM memory holding 32K single-bit ALUs. We have designed and implemented a high-level parallel language, called data parallel bit C (dbC), for Terasys and demonstrated that dbC applications using the PIM memory as a SIMD array run at the speed of multiple Cray-YMP processors. Thus, we can deliver supercomputer performance for a small fraction of supercomputer cost. Since the successful creation of the Terasys research prototype, we have begun work on processing in memory in a supercomputer setting. In a collaborative research project, we are working with Cray Computer to incorporate a new Cray-designed implementation of the PIM chips into two octants of Cray-3 memory 相似文献

6.

Efficient Index Set Generation for Compiling HPF Array Statements on Distributed-Memory Machines

S.D. Kaushik C.-H. Huang P. Sadayappan 《Journal of Parallel and Distributed Computing》1996,38(2):237

In languages such as High Performance Fortran (HPF), array statements are used to express data parallelism. In compiling array statements for distributed-memory machines, efficient enumeration of local index sets and commmunication sets is important. A method based on a virtual processor approach has been proposed for efficient index set enumeration for array statements involving arrays distributed using block-cyclic distributions. The virtual processor approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or block-wise) mapped to the physical processors. The key idea of the method is to first develop closed forms in terms of simple regular sections for the index sets for arrays distributed using block or cyclic distributions. These closed forms are then used with the virtual processor approach to give an efficient solution for arrays with the block-cyclic distribution. HPF supports a two-level mapping of arrays to processors. Arrays are first aligned with a template at an offset and a stride and the template is then distributed among the processors using a regular data distribution. The introduction of a nonunit stride in the alignment creates “holes” in the distributed arrays which leads to memory wastage. In this paper, using simple mathematical properties of regular sections, we extend the virtual processor approach to address the memory allocation and index set enumeration problems for array statements involving arrays mapped using the two-level mapping. We develop a methodology for translating the closed forms for block and cyclically distributed arrays mapped using a one-level mapping to closed forms for arrays mapped using the two-level mapping. Using these closed forms, the virtual processor approach is extended to handle array statements involving arrays mapped using two-level mappings. Performance results on the Cray T3D are presented to demonstrate the efficacy of the extensions and identify various trade-offs associated with the proposed method. 相似文献

7.

并行可配置ECC专用指令协处理器 总被引：2，自引：1，他引：1

仲先海徐金甫严迎建《计算机工程》2009,35(5):153-155

采用软硬件结合的方法,给出一种基于VLIW的并行可配置椭圆曲线密码体制（ECC）专用指令协处理器架构。该协处理器采用点加、倍点并行调度算法,功能单元微结构采用可重构的思想,具有高度灵活性与较高运算速度,能支持域宽可伸缩的GF（p）与G只2″）有限域上的可变参数Weierstrass曲线,签名认证算法可升级。实验结果表明,GF（p）域上192bit的ECC点乘运算只需0．32ms,比其他同类芯片运算速度提高了116％～350％。相似文献

8.

A SAT Solver Using Reconfigurable Hardware and Virtual Logic 总被引：1，自引：0，他引：1

Miron Abramovici Jose T. De Sousa 《Journal of Automated Reasoning》2000,24(1-2):5-36

In this paper, we present the architecture of a new SAT solver using reconfigurable logic and a virtual logic scheme. Our main contributions include new forms of massive fine-grain parallelism, structured design techniques based on iterative logic arrays that reduce compilation times from hours to minutes, and a decomposition technique that creates independent subproblems that may be concurrently solved by unconnected FPGAs. The decomposition technique is the basis of the virtual logic scheme, since it allows solving problems that exceed the hardware capacity. Our architecture is easily scalable. Our results show several orders of magnitude speedup compared with a state-of-the-art software implementation, and also with respect to prior SAT solvers using reconfigurable hardware. 相似文献

9.

基于可重构密码模块的VPN安全网关

下载免费PDF全文

褚有睿王志远欧阳旦《计算机工程》2011,37(5):152-154

结合片上可编程系统和IPSec技术,设计一种基于可重构密码处理模块的虚拟专用网安全网关.该网关采用双处理器结构,主处理器完成系统芯片的初始化配置、系统控制、管理和数据包的预处理,协处理器完成IPSec处理功能,可重构密码处理模块加速加解密处理,从而提高算法执行效率,同时扩展IPSec协议的安全性.实验结果表明,该网关具... 相似文献

10.

Automatic Synthesis of FPGA Processor Arrays from Loop Algorithms

Marcus Bednara Jürgen Teich 《The Journal of supercomputing》2003,26(2):149-165

We consider the problem of automatic mapping of computation-intensive loop nests onto FPGA hardware. The regular cell array structure of these chips reflects the parallelism in regular loop-like computations. Furthermore, the flexibility of FPGAs allows the cost-effective implementation of reconfigurable high performance processor arrays. So far, there exists no continuous design flow that allows automated generation of FPGA configuration data from a loop nest specified in a high level language. Here, we present a methodology for automatic generation of synthesizable VHDL code specifying a processor array and optimized for FPGA implementation. 相似文献

11.

The UCSC Kestrel parallel processor

Di Bias A. Dahle D.M. Diekhans M. Grate L. Hirschberg J. Karplus K. Keller H. Kendrick M. Mesa-Martinez F.J. Pease D. Rice E. Schultz A. Speck D. Hughey R. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(1):80-92

The architectural landscape of high-performance computing stretches from superscalar uniprocessor to explicitly parallel systems, to dedicated hardware implementations of algorithms. Single-purpose hardware can achieve the highest performance and uniprocessors can be the most programmable. Between these extremes, programmable and reconfigurable architectures provide a wide range of choice in flexibility, programmability, computational density, and performance. The UCSC Kestrel parallel processor strives to attain single-purpose performance while maintaining user programmability. Kestrel is a single-instruction stream, multiple-data stream (SIMD) parallel processor with a 512-element linear array of 8-bit processing elements. The system design focuses on efficient high-throughput DNA and protein sequence analysis, but its programmability enables high performance on computational chemistry, image processing, machine learning, and other applications. The Kestrel system has had unexpected longevity in its utility due to a careful design and analysis process. Experience with the system leads to the conclusion that programmable SIMD architectures can excel in both programmability and performance. This work presents the architecture, implementation, applications, and observations of the Kestrel project at the University of California at Santa Cruz. 相似文献

12.

The multi-dataflow composer tool: generation of on-the-fly reconfigurable platforms

Francesca Palumbo Nicola Carta Danilo Pani Paolo Meloni Luigi Raffo 《Journal of Real-Time Image Processing》2014,9(1):233-249

Dataflow specifications are suitable to describe both signal processing applications and the relative specialized hardware architectures, fostering the hardware–software development gap closure. They can be exploited for the development of automatic tools aimed at the integration of multiple applications on the same coarse-grained computational substrate. In this paper, the multi-dataflow composer (MDC) tool, a novel automatic platform builder exploiting dataflow specifications for the creation of run-time reconfigurable multi-application systems, is presented and evaluated. In order to prove the effectiveness of the adopted approach, a coprocessor for still image and video processing acceleration has been assembled and implemented on both FPGA and 90 nm ASIC technology. 60 % of savings for both area occupancy and power consumption can be achieved with the MDC generated coprocessor compared to an equivalent non-reconfigurable design, without performance losses. Thanks to the generality of high-level dataflow specification approach, this tool can be successfully applied in different application domains. 相似文献

13.

System-level performance evaluation of reconfigurable processors

《Microprocessors and Microsystems》2005,29(2-3):63-73

Reconfigurable architectures that tightly integrate a standard CPU core with a field-programmable hardware structure have recently been receiving increased attention. The design of such a hybrid reconfigurable processor involves a multitude of design decisions regarding the field-programmable structure as well as its system integration with the CPU core. Determining the impact of these design decisions on the overall system performance is a challenging task. In this paper, we first present a framework for the cycle-accurate performance evaluation of hybrid reconfigurable processors on the system level. Then, we discuss a reconfigurable processor for data-streaming applications, which attaches a coarse-grained reconfigurable unit to the coprocessor interface of a standard embedded CPU core. By means of a case study we evaluate the system-level impact of certain design features for the reconfigurable unit, such as multiple contexts, register replication, and hardware context scheduling. The results illustrate that a system-level evaluation framework is of paramount importance for studying the architectural trade-offs and optimizing design parameters for reconfigurable processors. 相似文献

14.

Top-Performance Tokenization and Small-Ruleset Regular Expression Matching

Daniele Paolo Scarpazza 《International journal of parallel programming》2011,39(1):3-32

In the last decade, the volume of unstructured data that Internet and enterprise applications create and consume has been growing at impressive rates. The tools we use to process these data are search engines, business analytics suites, natural-language processors and XML processors. These tools rely on tokenization, a form of regular expression matching aimed at extracting words and keywords in a character stream. The further growth of unstructured data-processing paradigms depends critically on the availability of high-performance tokenizers. Despite the impressive amount of parallelism that the multi-core revolution has made available (in terms of multiple threads and wider SIMD units), most applications employ tokenizers that do not exploit this parallelism. I present a technique to design tokenizers that exploit multiple threads and wide SIMD units to process multiple independent streams of data at a high throughput. The technique benefits indefinitely from any future scaling in the number of threads or SIMD width. I show the approach’s viability by presenting a family of tokenizer kernels optimized for the Cell/B.E. processor that deliver a performance seen, so far, only on dedicated hardware. These kernels deliver a peak throughput of 14.30 Gbps per chip, and a typical throughput of 9.76 Gbps on Wikipedia input. Also, they achieve almost-ideal resource utilization (99.2%). The approach is applicable to any SIMD enabled processor and matches well the trend toward wider SIMD units in contemporary architecture design. 相似文献

15.

A sliding memory plane array processor

Sunwoo M.H. Aggarwal J.K. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(6):601-612

A mesh-connected single-input multiple-data (SIMD) architecture called a sliding memory plane (SliM) array processor is proposed. Differing from existing mesh-connected SIMD architectures, SliM has several salient features such as a sliding memory plane that provides inter-PE communication during computation. Two I/O planes provide an I/O overlapping capability. Thus, inter-PE communication and I/O overhead can be overlapped with computation. Inter-PE communication time is invisible in most image processing tasks because the computation time is larger than the communication time on SliM. The ability to overlap inter-PE communication with computation, regardless of window size and shape and without using a coprocessor or an on-chip DMA controller is unique to SliM 相似文献

16.

基于Impulse—C的可重构编程技术研究

毛兴权《数字社区&智能家居》2009,5(2):991-993

可重构计算的研究使用高度灵活的计算结构进行高性能计算。近年来采用FPGA器件来创建可重计算平台的研究大量出现。基于高级语言的FPGA编程技术可以让软件工程师摆脱硬件的干扰,致力于算法的实现。Impulse C语言工具集就是一种对软硬件划分和软硬件过程协同设计的相对简单的、基于C语言的方法,它与高效的基于FPGA的硬件编译器相结合,形成了一种完整的混合处理器和FPGA实现的方法。这些工具极大地简化了可重构部件的设计过程,但是在高效性和电路优化等方面跟手工设计仍有差距。相似文献

17.

SIMD计算机的优化编译器设计 总被引：1，自引：1，他引：0

下载免费PDF全文

赵辉黄石《计算机工程》2009,35(1):201-203

利用处理器的相关资源,提高编译器优化性能和增强代码可适应性是SIMD处理器优化编译的关键。该文基于M语言和LSSIMD体系结构,结合现代编译器的编译技术,提出针对SIMD协处理器编译器的优化和实现方法,包括寄存器分配、单值合并、代码压缩等。实验结果表明,编译生成的目标代码准确、高效。相似文献

18.

The Splash 2 software environment

Jeffrey M. Arnold 《The Journal of supercomputing》1995,9(3):277-290

相似文献

19.

面向SIMD机器的全局自动数据分割

林进朱宁宁张兆庆乔如良《计算机学报》1999,22(6):596-602

提出了一种面向ＳＩＭＤ机器的全局数据自动分割算法,该算法能处理多个非紧嵌折循环嵌套,并且数组下标存取为循环变量的线性式,首先通过数据与迭代映射抽象了计算中的通信方式,然事提出识别规则模式通信模式的形式比条件,接着建立包含对准信息和相应通信开销的数据迭代图,并在数据迭代图的基础上提出了一个启发式算法来计算较优的数据分布和迭代分布,以优化处理单元之间的通信开销,通过发析多个循环嵌套所涉及的多个数组映和相似文献

20.

可重构密码协处理器的概念及其设计原理

曲英杰《计算机工程与应用》2003,39(12):7-9,19

提出了可重构密码协处理器的概念并论述了其设计原理。所谓可重构密码协处理器实际上是一个其内部逻辑电路结构和功能可被灵活改变的密码处理单元,它能够在主处理器的控制和驱动下灵活、快速地实现多种不同的密码操作,以便适应不同密码算法的需求。基于可重构密码协处理器的可重构密码系统具有灵活、快速、安全的特点,在保密通讯和网络安全等领域中具有良好的应用前景。相似文献