首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
With the de facto transformation of technology into nano-technology, more and more functional components can be embedded on a single silicon die, thus enabling high degree pipelining operations such as those required for multimedia applications. In recent years, system-on-chip designs have migrated from fairly simple single processor and memory designs to relatively complicated systems with multiple processors, on-chip memories, standard peripherals, and other functional blocks. The communication between these IP blocks is becoming the dominant critical system path and performance bottleneck of system-on-chip designs. Network-on-chip architectures, such as Virtual Channel (2004), Black-bus (2004), Pirate (2004), AEthereal (2005), and VICHAR (2006) architectures, emerged as promising solutions for future system-on-chip communication architecture designs. However, these existing architectures all suffer from certain problems, including high area cost and communication latency and/or low network throughput. This paper presents a novel network-on-chip architecture, Pipelining Multi-channel Central Caching, to address the shortcomings of the existing architectures. By embedding a central cache into every switch of the network, blocked head packets can be removed from the input buffers and stored in the caches temporally, thus alleviating the effect of head-of-line and deadlock problems and achieving higher network throughput and lower communication latency without paying the price of higher area cost. Experimental results showed that the proposed architecture exhibits both hardware simplicity and system performance improvement compared to the existing network-on-chip architectures.  相似文献   

2.
We develop new algorithms and architectures for matrix multiplication on configurable devices. These have reduced energy dissipation and latency compared with the state-of-the-art field-programmable gate array (FPGA)-based designs. By profiling well-known designs, we identify "energy hot spots", which are responsible for most of the energy dissipation. Based on this, we develop algorithms and architectures that offer tradeoffs among the number of I/O ports, the number of registers, and the number of PEs. To avoid time-consuming low-level simulations for energy profiling and performance prediction of many alternate designs, we derive functions to represent the impact of algorithm design choices on the system-wide energy dissipation, area, and latency. These functions are used to either optimize the energy performance or provide tradeoffs for a family of candidate algorithms and architectures. For selected designs, we perform extensive low-level simulations using state-of-the-art tools and target FPGA devices. We show a design space for matrix multiplication on FPGAs that results in tradeoffs among energy, area, and latency. For example, our designs improve the energy performance of state-of-the-art FPGA-based designs by 29%-51% without any increase in the area-latency product. The latency of our designs is reduced one-third to one-fifteenth while area is increased 1.9-9.4 times. In terms of comprehensive metrics such as Energy-Area-Time, our designs exhibit superior performance compared with the state-of-the-art by 50%-79%.  相似文献   

3.
Variable block-size motion estimation (VBSME) process occupies a major part of computation of an H.264 encoder, which is usually accelerated by bit-parallel hardware architectures with large I/O bit width to meet real-time constrains. However, such kind of architectures increase the area overhead and pin count, and therefore will not be suitable for area-constrained electronic consumer designs such as small portable multimedia devices. This paper addresses this problem by proposing two area efficient least significant bit (LSB) bit-serial architectures with small pin numbers. Both designs take advantage of data reusing technique in different ways for sum of absolute differences (SAD) computation and reading reference pixels, leading to a considerable reduction of memory bandwidth. The first architecture propagates the partial SAD and sum results and broadcasts the reference pixel rows whereas the second design reuse the SAD of small blocks and has a reconfigurable reference buffer leading to a better memory bandwidth when using hardware parallelism. The proposed designs benefit from several optimization techniques including an efficient serial absolute difference architecture, word length reduction by parallelism, bit truncation, mode filtering, and macroblock (MB) level subsampling, which significantly enhance their performances in terms of silicon area, throughput, latency, and power consumption. The first and second designs can support full search VBSME of 720?×?480 video with 30 frames per second (fps), two reference frames, and [?16, 15] search range at a clock frequency of 414 MHz with 29.28 k and 31.5 k gates, respectively.  相似文献   

4.
A 64-kb subnanosecond Josephson–CMOS hybrid random-access memory (RAM) has been developed with ultrafast hybrid interface circuits. The hybrid memory is designed and fabricated using a commercial 0.18- $muhbox{m}$ CMOS process and NEC-SRL's 2.5- $hbox{kA/cm}^{2}$ Nb process for Josephson circuits. The millivolt-level Josephson signals are amplified to volt-level CMOS digital signals by a hybrid interface amplifier, which is the most challenging part of the memory system. The performance of this amplifier is optimized by minimizing its parasitic capacitance loading. The 4-K operation of short-channel CMOS devices and circuits is reviewed, and a complete 4-K CMOS BSIM3 model, which has been verified by experiments, is discussed. The memory bit-line output currents are detected by ultralow-power high-speed Josephson devices. Here, we report the first high-frequency access-time measurements on the full critical path showing 600 ps for a single bit. We discuss future designs made to reduce the crosstalk and improve margins, as well as plans to reduce power dissipation and latency.   相似文献   

5.
While hardware/software partitioning has been shown to provide significant performance gains, most hardware/software partitioning approaches are limited to partitioning computational kernels utilizing integers or fixed point implementations. Software developers often initially develop an application using floating point representations built-in to most programming languages and later convert the application to a fixed point representation—a potentially time consuming process. In this paper, we present the Arizona Float Fixed Hardware Library (AFFHL) consisting of efficient, configurable floating point to fixed point and fixed point to floating point hardware converters. By utilizing these converters, a system’s hardware/software implementation can be separated into a floating point domain consisting of the microprocessor and memory subsystem and a fixed point domain consisting of one or more partitioned hardware coprocessors. This separation enables a rapid hardware/software partitioning approach in which floating point software kernels can be implemented using fixed point hardware coprocessors without the need for application developers to first rewrite software applications as fixed point implementations. We further present an overview of a basic hardware/software partitioning methodology for rapidly partitioning computational kernels within floating point software application to either statically determined fixed point hardware coprocessors or dynamically adaptable fixed point hardware coprocessors in which the required fixed point representation can be dynamically determined and adjusted at runtime.  相似文献   

6.
The storage requirements of the array-dominated and loop-organized algorithmic specifications running on embedded systems can be significant. Employing a data memory space much larger than needed has negative consequences on the energy consumption, latency, and chip area. Finding an optimized storage of the usually large arrays from these algorithmic specifications is an essential task of memory allocation. This paper proposes an efficient algorithm for mapping multidimensional arrays to the data memory. Similarly to , it computes bounding windows for live elements in the index space of arrays, but this algorithm is several times faster. More important, since this algorithm works not only for entire arrays, but also parts of arrays—like, for instance, array references or, more general, sets of array elements represented by lattices , this signal-to-memory mapping technique can be also applied in hierarchical memory architectures.   相似文献   

7.
This paper describes the design of a soft decision Viterbi Decoder for orthogonal frequency division multiplexing-based wireless local area networks and evaluates different architectural options by means of their field programmable gate-array (FPGA) implementation. A finite precision analysis has been performed to reduce the data-path widths under the specifications of IEEE 802.11a and Hiperlan/2 standards. Four implementation strategies (register exchange, trace back, trace back with double rate memory read and pointer trace back) for the survivor management unit have been evaluated together with two different normalization methods for the add–compare–select unit. The results of the implementation in FPGA have been given and it is shown that register exchange and pointer trace back architectures with pre-normalization in the add–compare–select unit achieve the best performance. Both architectures can decode 200 Mbps in a Virtex-4 device with lower latency that the conventional trace back one and pointer trace back exhibits the lowest power consumption, these characteristics make them suitable for future multiple-output multiple-input WLAN systems.
V. AlmenarEmail:
  相似文献   

8.
Many different video processor architectures exist. Its architecture gives a processor strength for a particular application. Hardwired logic yields the best performance/cost, but a programmable processor is important for applications that support multiple coding standards, proprietary functions, or future changes to application requirements. Programmable video processor architectures achieve best performance through the use of parallelism at the data (SIMD), instruction (VLIW), and multiprocessor level, and optimally sized ALU, multiplier, and load/store datapaths. Because low-cost memory architectures are not optimized for the random access patterns of video processing, the performance of video processors is often limited by memory bandwidth rather than processing resources. Careful data organization alleviates memory bandwidth limitations. When choosing a video processor it is important to consider many factors, particularly performance, cost, power consumption, programmability, and peripheral support.
Jonah ProbellEmail:
  相似文献   

9.
In an orthogonal frequency division multiplexing (OFDM) based wireless systems, Fast Fourier Transform (FFT) is a critical block as it occupies large area and consumes more power. In this paper, we present an area-efficient and low power 16-bit word-width 64-point radix-22 and radix-23 pipelined FFT architectures for an OFDM-based IEEE 802.11a wireless LAN baseband. The designs are derived from radix-2k algorithm and adopt a Single-Path Delay Feedback (SDF) architecture for hardware implementation. To eliminate the complex multipliers and read-only memory (ROM) which is used for internal storage of twiddle factor coefficients, the proposed 64-point FFT employs a Canonical Signed Digit (CSD) complex constant multiplier using adders, multiplexers and shifters. The complex constant multiplier (CCM) is modified using common sub-expression sharing block that reduces the area of the design. The proposed radix-22 and radix-23 pipelined FFT architectures are modeled and implemented using TSMC 180 nm CMOS technology with a supply voltage of 1.8 V. The implementation results show that the proposed architectures significantly reduces the hardware cost and power consumption in comparison to existing 64-point FFT architectures.  相似文献   

10.
Pipelined systolic architectures for DLMS adaptive filtering   总被引:6,自引:0,他引:6  
This work reports two new pipelined, systolic architectures for delayed least mean squares (DLMS) adaptive filtering. In contrast to existing systolic architectures, which introduce a tracking delay that increases linearly with filter order, those presented here, do not. They support the same sampling rate as the fastest such architecture reported so far, even when unpipelined. Our designs use significantly less hardware (i.e., multiply-accumulate modules and registers) with minimal control logic requirement on account of the algebraic projection techniques that we employ, implying a net gain in terms of the silicon area utilized and the dynamic power dissipated. Further, one of these architectures introduces only half the adaptation delay that is conventionally used for systolization; the other requires the normal adaptation delay, but compensates by using considerably reduced control logic. The sampling rates supported by our architectures are further increased by pipelining the processor modules to the level of a 42 compressor. This requires only small adaptation and tracking delays, which are independent of filter order, and is possible without requiring a modification of the basic algorithm (in terms of introducing a lookahead in the adaptation), all in contrast with the only pipelined DLMS architecture reported so far. We propose and implement a scheme in our architectures, for computing a normalized step size for delayed adaptation, in the general context of a nonstationary real-time environment. The simulation studies performed with our architectures indicate remarkably improved convergence properties over those of previously reported architectures.  相似文献   

11.
We improve a carry-select technique for decimal adders, where pairs of corrective carry-out bits for all decimal positions are computed in parallel. Selection is based on the corresponding positional carry-in bits, which are produced by a quaternary parallel prefix carry network. Carry-out bits select pairs of corrected or intact sum-digits to be later selected by actual carry-in bits at the end of addition process. Analytical evaluation and synthesis results for various hardware sharing architectures on binary, decimal, adders, and subtractors show lower area consumption and less power dissipation of the proposed designs at no additional latency, compared to previous works.  相似文献   

12.
This paper presents a novel hardware interleaver architecture for unified parallel turbo decoding. The architecture is fully re-configurable among multiple standards like HSPA Evolution, DVB-SH, 3GPP-LTE and WiMAX. Turbo codes being widely used for error correction in today’s consumer electronics are prone to introduce higher latency due to bigger block sizes and multiple iterations. Many parallel turbo decoding architectures have recently been proposed to enhance the channel throughput but the interleaving algorithms used in different standards do not freely allow using them due to higher percentage of memory conflicts. The architecture presented in this paper provides a re-configurable platform for implementing the parallel interleavers for different standards by managing the conflicts involved in each. The memory conflicts are managed by applying different approaches like stream misalignment, memory division and use of small FIFO buffer. The proposed flexible architecture is low cost and consumes 0.085 mm2 area in 65 nm CMOS process. It can implement up to 8 parallel interleavers and can operate at a frequency of 200 MHz, thus providing significant support to higher throughput systems based on parallel SISO processors.  相似文献   

13.
Dynamically reconfigurable hardware has already been deployed for accelerating computationally demanding applications. Some of these hardware architectures allow run time reconfiguration but this usually leads to a large reconfiguration overhead. The advantage of run time reconfiguration is that it allows new algorithmic solutions for many applications. To study the potential of frequent run time reconfiguration it is interesting to investigate its costs and benefits from an abstract point of view and to develop new architectural concepts. Multi-level reconfigurable architectures are one such concept that introduces several levels of reconfiguration. This paper deals with new types of multi-level reconfigurable architectures. The corresponding problem of finding the best granularity for different reconfiguration levels is formulated and investigated. Although this problem is shown to be NP-complete, an interesting restricted subcase is solved optimally in polynomial time. For the general case, a good heuristic is proposed that is based on solutions for the restricted case. Results on three example applications show that the reconfiguration cost can be reduced with the new architectures. Based on a proposed measure of relative efficiency it is also shown that the new architectures are more efficient so that they obtain a larger reconfiguration cost reduction with less additional hardware.
Martin MiddendorfEmail:
  相似文献   

14.
Global motion estimation and compensation (GME/GMC) is an important video processing technique and has been applied to many applications including video segmentation, sprite/mosaic generation, and video coding. In MPEG-4 Advanced Simple Profile (ASP), GME/GMC is adopted to compensate camera motions. Since GME is important, many GME algorithms have been proposed. These algorithms have two common characteristics, huge computation complexity and ultra large memory bandwidth. Hence for realtime applications, a hardware accelerator of GME is required. However, there are many hardware design challenges of GME like irregular memory access and huge memory bandwidth, and only few hardware architectures have been proposed. In this paper, we first analyzed three typical algorithms of GME, and a fast GME algorithm is proposed. By using temporal prediction and skipping the redundant computation, 91% memory bandwidth and 80% iterations are saved, while the performance is kept, compared to Gradient Descent in MPEG-4 Verification Model. Based on our proposed algorithm, a hardware architecture of GME is also presented. A new scheduling, Reference-Based Scheduling, is developed to solve the irregular memory access problem. An interleaved memory arrangement is applied to satisfy the memory access requirement of interpolation. The total gate count of hardware implementation is 131 K with Artisan 0.18 um cell library, and the internal memory size is about 7.9 Kb. Its processing ability is MPEG-4 ASP@L3, which is 352×288 with 30 fps, at 30 MHz.
Liang-Gee ChenEmail:
  相似文献   

15.
We present low area and low power semi-systolic array architectures for polynomial basis multiplication over GF(2m) using Progressive Multiplier Reduction Technique (PMR). These architectures are explored using linear and nonlinear techniques applied to the polynomial multiplication algorithm. The nonlinear techniques allow the designer, to control the processor workload and reduce the inter-processor communications. The semi-systolic architectures obtained have simple structure with local communication. ASIC implementations of our designs and comparable published designs show that the proposed scalable semi-systolic structures have less area complexity (56.8–94.6 %) and power consumption (55.2–84.2 %) except for a scalable design published by the same authors. However, one of the proposed scalable designs outperforms this design in terms of throughput by 73.8 %. This makes the proposed designs suited to embedded applications that require low power consumption and moderate speed.  相似文献   

16.
For the demands of mobile multimedia applications, a stream processor core is designed with 8.91 ${hbox {mm}}^{2}$ area in 0.18$ mu{hbox {m}}$ CMOS technology at 50 MHz. Several techniques and architectures are proposed to achieve high performance with low power consumption. First of all, an optimized core pipeline is designed with 2-issue VLIW architecture to achieve the processing capability of 400 MFLOPS or 800 MOPS. In addition, adaptive multi-thread scheme can increase the performance by increasing hardware utilization, and the proposed configurable memory array architecture can reduce off-chip memory accessing frequency by caching both input data and output results. Furthermore, for graphics applications, a geometry-content-aware technique called early-rejection-after-transformation is proposed to remove redundant operations for invisible triangles. As for video applications, the proposed video accelerating instruction set can support motion estimation for video coding. Experimental results show that 86% power reduction and more than ten times speedup of the VLIW architecture can be achieved with the proposed techniques to provide the processing speed of 25 Mvertices/s and power consumption of 8.6 mW. Moreover, CIF (352 $times$ 288) 30 fps video encoding with the search range of {H[ $-$24,24), V[ $-$16,16]} is also supported by the proposed stream processor. By supporting both video and graphics functions, this highly efficient, high performance, and low power processor core is applicable to multimedia mobile devices.   相似文献   

17.
Many sequential multipliers for polynomial basis GF(2k) fields have been proposed using the LSbit and MSbit multiplication algorithm. However, all those designs are defined over fixed size GF(2k) fields and sometimes over fixed special form irreducible polynomials (AOL, trinomials, pentanomials). When such architectures are redesigned for arbitrary GF(2k) fields and generic irreducible polynomials, therefore made versatile, they result in high space complexity (gate–latch number), low frequency (high critical path) and high latency designs. In this paper a Montgomery multiplication element (MME) architecture specially designed for arbitrary GF(2k) fields defined over general irreducible polynomials, is proposed, based on an optimized version of the Montgomery multiplication (MM) algorithm for GF(2k) fields. To evaluate the proposed MME and prove the efficiency of the MM algorithm in versatile designing, three distinct versatile Montgomery multiplier architectures are presented using this proposed MME. They achieve small gate–latch number and high clock frequency compared to other sequential versatile designs.  相似文献   

18.
A methodology for rapid silicon design of biorthogonal wavelet transform systems has been developed. This is based on generic, scalable architectures for the forward and inverse wavelet filters. These architectures offer efficient hardware utilisation by combining the linear phase property of biorthogonal filters with decimation and interpolation. The resulting designs have been parameterised in terms of types of wavelet and wordlengths for data and coefficients. Control circuitry is embedded within these cores that allows them to be cascaded for any desired level of decomposition without any interface logic. The time to produce silicon designs for a biorthogonal wavelet system is only the time required to run synthesis and layout tools with no further design effort required. The resulting silicon cores produced are comparable in area and performance to hand-crafted designs. These designs are also portable across a range of foundries and are suitable for FPGA and PLD implementations.  相似文献   

19.
The suitability of the 2D Discrete Wavelet Transform (DWT) as a tool in image and video compression is nowadays indisputable. For the execution of the multilevel 2D DWT, several computation schedules based on different input traversal patterns have been proposed. Among these, the most commonly used in practical designs are: the row–column, the line-based and the block-based. In this work, these schedules are implemented on FPGA-based platforms for the forward 2D DWT by using a lifting-based filter-bank implementation. Our designs were realized in VHDL and optimized in terms of throughput and memory requirements, in accordance with the principles of both the schedules and the lifting decomposition. The implementations are fully parameterized with respect to the size of the input image and the number of decomposition levels. We provide detailed experimental results concerning the throughput, the area, the memory requirements and the energy dissipation, associated with every point of the parameter space. These results demonstrate that the choice of the suitable schedule is a decision that should be dependent on the given algorithmic specifications.
Yiannis AndreopoulosEmail:
  相似文献   

20.
对JPEG2 0 0 0中推荐的 5 /3整数滤波器和 9/7实数滤波器进行了硬件实现时所需要的有限精度分析 ;确定了小波变换过程中各个参数的最佳数据宽度 ,还确定了整个变换系统的数据通路的数据宽度。基于lifting的小波变换的特点结合嵌入式延拓算法提出了两种小波变换———折叠结构和长流水线结构 ;对两种结构进行了分析比较。最后 ,对折叠结构和相关的其它结构在所需存储单元的数量、存储单元的访问次数、处理能力以及功耗等方面进行了分析比较 ,可以看出文中提出的结构在性能上有明显优点。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号