共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
《Parallel Computing》2013,39(10):586-602
Multimedia applications have become increasingly important in daily computing. These applications are composed of heterogeneous regions of code mixed with data-level parallelism (DLP) and instruction-level parallelism (ILP). A standard solution for a multimedia coprocessor resembles of single-instruction multiple-data (SIMD) engines into architectures exploiting ILP at compile time, such as very long instruction word (VLIW) and transport triggered architecture (TTA). However, the ILP regions fail to scale with the increased vector length to achieve high performance in the DLP regions. Furthermore, the register-to-register nature of SIMD instructions causes current SIMD engines to have limitations in handling memory alignment, data reorganization, and control flow. Many supporting instructions such as data permutations, address generations, and loop branches, are required to aid in the execution of the real SIMD computation instructions. To mitigate these problems, we propose optimized SIMD engines that have the capabilities for combining VLIW or TTA processing with a unified scalar and long vector computations as well as efficient SIMD hardware for real computation. Our new architecture is based on TTA and is called multimedia coprocessor (MCP). This architecture includes following features: (1) a simple coprocessor structure with 8-way TTA, (2) cost-effective SIMD hardware capable of performing floating-point operations, (3) long vector capabilities built upon existing SIMD hardware and a single register file and processor data path for both scalar operands and vector elements, and (4) an optimized SIMD architecture that addresses the SIMD limitations. Our experimental evaluations show that MCP can outperform conventional SIMD techniques by an average of 39% and 12% in performance for multimedia kernels and applications, respectively. 相似文献
3.
一种改进的嵌入式SIMD协处理器设计 总被引:1,自引:0,他引:1
论文介绍的SIMD协处理器是用于低层图像理解的16位定点嵌入式阵列处理器。该协处理器采用load/store体系结构,并且除SIMD固有的数据并行性外,还具有三级流水和三组指令并发执行的并行性。三组指令并发执行使数据交换操作和其它类型操作并发执行,从而实现了数据交换操作的隐含执行,大大减少了通信和I/O操作的开销。 相似文献
4.
5.
SRC researchers have designed and fabricated a processor-in-memory (PIM) chip, a standard 4-bit memory augmented with a single-bit ALU controlling each column of memory. In principle, PIM chips can replace the memory of any processor, including a supercomputer. To validate the notion of integrating SIMD computing into conventional processors on a more modest scale, we have built a half dozen Terasys workstations, which are Sun Microsystems Sparcstation-2 workstations in which 8 megabytes of address space consist of PIM memory holding 32K single-bit ALUs. We have designed and implemented a high-level parallel language, called data parallel bit C (dbC), for Terasys and demonstrated that dbC applications using the PIM memory as a SIMD array run at the speed of multiple Cray-YMP processors. Thus, we can deliver supercomputer performance for a small fraction of supercomputer cost. Since the successful creation of the Terasys research prototype, we have begun work on processing in memory in a supercomputer setting. In a collaborative research project, we are working with Cray Computer to incorporate a new Cray-designed implementation of the PIM chips into two octants of Cray-3 memory 相似文献
6.
S.D. Kaushik C.-H. Huang P. Sadayappan 《Journal of Parallel and Distributed Computing》1996,38(2):237
In languages such as High Performance Fortran (HPF), array statements are used to express data parallelism. In compiling array statements for distributed-memory machines, efficient enumeration of local index sets and commmunication sets is important. A method based on a virtual processor approach has been proposed for efficient index set enumeration for array statements involving arrays distributed using block-cyclic distributions. The virtual processor approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or block-wise) mapped to the physical processors. The key idea of the method is to first develop closed forms in terms of simple regular sections for the index sets for arrays distributed using block or cyclic distributions. These closed forms are then used with the virtual processor approach to give an efficient solution for arrays with the block-cyclic distribution. HPF supports a two-level mapping of arrays to processors. Arrays are first aligned with a template at an offset and a stride and the template is then distributed among the processors using a regular data distribution. The introduction of a nonunit stride in the alignment creates “holes” in the distributed arrays which leads to memory wastage. In this paper, using simple mathematical properties of regular sections, we extend the virtual processor approach to address the memory allocation and index set enumeration problems for array statements involving arrays mapped using the two-level mapping. We develop a methodology for translating the closed forms for block and cyclically distributed arrays mapped using a one-level mapping to closed forms for arrays mapped using the two-level mapping. Using these closed forms, the virtual processor approach is extended to handle array statements involving arrays mapped using two-level mappings. Performance results on the Cray T3D are presented to demonstrate the efficacy of the extensions and identify various trade-offs associated with the proposed method. 相似文献
7.
8.
A SAT Solver Using Reconfigurable Hardware and Virtual Logic 总被引:1,自引:0,他引:1
In this paper, we present the architecture of a new SAT solver using reconfigurable logic and a virtual logic scheme. Our main contributions include new forms of massive fine-grain parallelism, structured design techniques based on iterative logic arrays that reduce compilation times from hours to minutes, and a decomposition technique that creates independent subproblems that may be concurrently solved by unconnected FPGAs. The decomposition technique is the basis of the virtual logic scheme, since it allows solving problems that exceed the hardware capacity. Our architecture is easily scalable. Our results show several orders of magnitude speedup compared with a state-of-the-art software implementation, and also with respect to prior SAT solvers using reconfigurable hardware. 相似文献
9.
10.
We consider the problem of automatic mapping of computation-intensive loop nests onto FPGA hardware. The regular cell array structure of these chips reflects the parallelism in regular loop-like computations. Furthermore, the flexibility of FPGAs allows the cost-effective implementation of reconfigurable high performance processor arrays. So far, there exists no continuous design flow that allows automated generation of FPGA configuration data from a loop nest specified in a high level language. Here, we present a methodology for automatic generation of synthesizable VHDL code specifying a processor array and optimized for FPGA implementation. 相似文献
11.
Di Bias A. Dahle D.M. Diekhans M. Grate L. Hirschberg J. Karplus K. Keller H. Kendrick M. Mesa-Martinez F.J. Pease D. Rice E. Schultz A. Speck D. Hughey R. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(1):80-92
The architectural landscape of high-performance computing stretches from superscalar uniprocessor to explicitly parallel systems, to dedicated hardware implementations of algorithms. Single-purpose hardware can achieve the highest performance and uniprocessors can be the most programmable. Between these extremes, programmable and reconfigurable architectures provide a wide range of choice in flexibility, programmability, computational density, and performance. The UCSC Kestrel parallel processor strives to attain single-purpose performance while maintaining user programmability. Kestrel is a single-instruction stream, multiple-data stream (SIMD) parallel processor with a 512-element linear array of 8-bit processing elements. The system design focuses on efficient high-throughput DNA and protein sequence analysis, but its programmability enables high performance on computational chemistry, image processing, machine learning, and other applications. The Kestrel system has had unexpected longevity in its utility due to a careful design and analysis process. Experience with the system leads to the conclusion that programmable SIMD architectures can excel in both programmability and performance. This work presents the architecture, implementation, applications, and observations of the Kestrel project at the University of California at Santa Cruz. 相似文献
12.
Francesca Palumbo Nicola Carta Danilo Pani Paolo Meloni Luigi Raffo 《Journal of Real-Time Image Processing》2014,9(1):233-249
Dataflow specifications are suitable to describe both signal processing applications and the relative specialized hardware architectures, fostering the hardware–software development gap closure. They can be exploited for the development of automatic tools aimed at the integration of multiple applications on the same coarse-grained computational substrate. In this paper, the multi-dataflow composer (MDC) tool, a novel automatic platform builder exploiting dataflow specifications for the creation of run-time reconfigurable multi-application systems, is presented and evaluated. In order to prove the effectiveness of the adopted approach, a coprocessor for still image and video processing acceleration has been assembled and implemented on both FPGA and 90 nm ASIC technology. 60 % of savings for both area occupancy and power consumption can be achieved with the MDC generated coprocessor compared to an equivalent non-reconfigurable design, without performance losses. Thanks to the generality of high-level dataflow specification approach, this tool can be successfully applied in different application domains. 相似文献
13.
《Microprocessors and Microsystems》2005,29(2-3):63-73
Reconfigurable architectures that tightly integrate a standard CPU core with a field-programmable hardware structure have recently been receiving increased attention. The design of such a hybrid reconfigurable processor involves a multitude of design decisions regarding the field-programmable structure as well as its system integration with the CPU core. Determining the impact of these design decisions on the overall system performance is a challenging task. In this paper, we first present a framework for the cycle-accurate performance evaluation of hybrid reconfigurable processors on the system level. Then, we discuss a reconfigurable processor for data-streaming applications, which attaches a coarse-grained reconfigurable unit to the coprocessor interface of a standard embedded CPU core. By means of a case study we evaluate the system-level impact of certain design features for the reconfigurable unit, such as multiple contexts, register replication, and hardware context scheduling. The results illustrate that a system-level evaluation framework is of paramount importance for studying the architectural trade-offs and optimizing design parameters for reconfigurable processors. 相似文献
14.
In the last decade, the volume of unstructured data that Internet and enterprise applications create and consume has been
growing at impressive rates. The tools we use to process these data are search engines, business analytics suites, natural-language
processors and XML processors. These tools rely on tokenization, a form of regular expression matching aimed at extracting
words and keywords in a character stream. The further growth of unstructured data-processing paradigms depends critically
on the availability of high-performance tokenizers. Despite the impressive amount of parallelism that the multi-core revolution
has made available (in terms of multiple threads and wider SIMD units), most applications employ tokenizers that do not exploit
this parallelism. I present a technique to design tokenizers that exploit multiple threads and wide SIMD units to process
multiple independent streams of data at a high throughput. The technique benefits indefinitely from any future scaling in
the number of threads or SIMD width. I show the approach’s viability by presenting a family of tokenizer kernels optimized
for the Cell/B.E. processor that deliver a performance seen, so far, only on dedicated hardware. These kernels deliver a peak
throughput of 14.30 Gbps per chip, and a typical throughput of 9.76 Gbps on Wikipedia input. Also, they achieve almost-ideal
resource utilization (99.2%). The approach is applicable to any SIMD enabled processor and matches well the trend toward wider
SIMD units in contemporary architecture design. 相似文献
15.
A mesh-connected single-input multiple-data (SIMD) architecture called a sliding memory plane (SliM) array processor is proposed. Differing from existing mesh-connected SIMD architectures, SliM has several salient features such as a sliding memory plane that provides inter-PE communication during computation. Two I/O planes provide an I/O overlapping capability. Thus, inter-PE communication and I/O overhead can be overlapped with computation. Inter-PE communication time is invisible in most image processing tasks because the computation time is larger than the communication time on SliM. The ability to overlap inter-PE communication with computation, regardless of window size and shape and without using a coprocessor or an on-chip DMA controller is unique to SliM 相似文献
16.
毛兴权 《数字社区&智能家居》2009,5(2):991-993
可重构计算的研究使用高度灵活的计算结构进行高性能计算。近年来采用FPGA器件来创建可重计算平台的研究大量出现。基于高级语言的FPGA编程技术可以让软件工程师摆脱硬件的干扰,致力于算法的实现。Impulse C语言工具集就是一种对软硬件划分和软硬件过程协同设计的相对简单的、基于C语言的方法,它与高效的基于FPGA的硬件编译器相结合,形成了一种完整的混合处理器和FPGA实现的方法。这些工具极大地简化了可重构部件的设计过程,但是在高效性和电路优化等方面跟手工设计仍有差距。 相似文献
17.
18.
19.
20.
曲英杰 《计算机工程与应用》2003,39(12):7-9,19
提出了可重构密码协处理器的概念并论述了其设计原理。所谓可重构密码协处理器实际上是一个其内部逻辑电路结构和功能可被灵活改变的密码处理单元,它能够在主处理器的控制和驱动下灵活、快速地实现多种不同的密码操作,以便适应不同密码算法的需求。基于可重构密码协处理器的可重构密码系统具有灵活、快速、安全的特点,在保密通讯和网络安全等领域中具有良好的应用前景。 相似文献