共查询到20条相似文献,搜索用时 15 毫秒
1.
PMCNOC: A Pipelining Multi-channel Central Caching Network-on-chip Communication Architecture Design
N. Wang A. Sanusi P. Y. Zhao M. Elgamel M. A. Bayoumi 《Journal of Signal Processing Systems》2010,60(3):315-331
With the de facto transformation of technology into nano-technology, more and more functional components can be embedded on
a single silicon die, thus enabling high degree pipelining operations such as those required for multimedia applications.
In recent years, system-on-chip designs have migrated from fairly simple single processor and memory designs to relatively
complicated systems with multiple processors, on-chip memories, standard peripherals, and other functional blocks. The communication
between these IP blocks is becoming the dominant critical system path and performance bottleneck of system-on-chip designs.
Network-on-chip architectures, such as Virtual Channel (2004), Black-bus (2004), Pirate (2004), AEthereal (2005), and VICHAR (2006) architectures, emerged as promising solutions for future system-on-chip communication architecture
designs. However, these existing architectures all suffer from certain problems, including high area cost and communication
latency and/or low network throughput. This paper presents a novel network-on-chip architecture, Pipelining Multi-channel
Central Caching, to address the shortcomings of the existing architectures. By embedding a central cache into every switch
of the network, blocked head packets can be removed from the input buffers and stored in the caches temporally, thus alleviating
the effect of head-of-line and deadlock problems and achieving higher network throughput and lower communication latency without
paying the price of higher area cost. Experimental results showed that the proposed architecture exhibits both hardware simplicity
and system performance improvement compared to the existing network-on-chip architectures. 相似文献
2.
Ju-Wook Jang Choi S.B. Prasanna V.K. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(11):1305-1319
We develop new algorithms and architectures for matrix multiplication on configurable devices. These have reduced energy dissipation and latency compared with the state-of-the-art field-programmable gate array (FPGA)-based designs. By profiling well-known designs, we identify "energy hot spots", which are responsible for most of the energy dissipation. Based on this, we develop algorithms and architectures that offer tradeoffs among the number of I/O ports, the number of registers, and the number of PEs. To avoid time-consuming low-level simulations for energy profiling and performance prediction of many alternate designs, we derive functions to represent the impact of algorithm design choices on the system-wide energy dissipation, area, and latency. These functions are used to either optimize the energy performance or provide tradeoffs for a family of candidate algorithms and architectures. For selected designs, we perform extensive low-level simulations using state-of-the-art tools and target FPGA devices. We show a design space for matrix multiplication on FPGAs that results in tradeoffs among energy, area, and latency. For example, our designs improve the energy performance of state-of-the-art FPGA-based designs by 29%-51% without any increase in the area-latency product. The latency of our designs is reduced one-third to one-fifteenth while area is increased 1.9-9.4 times. In terms of comprehensive metrics such as Energy-Area-Time, our designs exhibit superior performance compared with the state-of-the-art by 50%-79%. 相似文献
3.
Mohammad R. H. Fatemi Hasan Ates Rosli Salleh 《Journal of Signal Processing Systems》2013,71(2):111-121
Variable block-size motion estimation (VBSME) process occupies a major part of computation of an H.264 encoder, which is usually accelerated by bit-parallel hardware architectures with large I/O bit width to meet real-time constrains. However, such kind of architectures increase the area overhead and pin count, and therefore will not be suitable for area-constrained electronic consumer designs such as small portable multimedia devices. This paper addresses this problem by proposing two area efficient least significant bit (LSB) bit-serial architectures with small pin numbers. Both designs take advantage of data reusing technique in different ways for sum of absolute differences (SAD) computation and reading reference pixels, leading to a considerable reduction of memory bandwidth. The first architecture propagates the partial SAD and sum results and broadcasts the reference pixel rows whereas the second design reuse the SAD of small blocks and has a reconfigurable reference buffer leading to a better memory bandwidth when using hardware parallelism. The proposed designs benefit from several optimization techniques including an efficient serial absolute difference architecture, word length reduction by parallelism, bit truncation, mode filtering, and macroblock (MB) level subsampling, which significantly enhance their performances in terms of silicon area, throughput, latency, and power consumption. The first and second designs can support full search VBSME of 720?×?480 video with 30 frames per second (fps), two reference frames, and [?16, 15] search range at a clock frequency of 414 MHz with 29.28 k and 31.5 k gates, respectively. 相似文献
4.
Fujiwara K. Liu Q. Van Duzer T. Meng X. Yoshikawa N. 《Applied Superconductivity, IEEE Transactions on》2010,20(1):14-20
5.
While hardware/software partitioning has been shown to provide significant performance gains, most hardware/software partitioning
approaches are limited to partitioning computational kernels utilizing integers or fixed point implementations. Software developers
often initially develop an application using floating point representations built-in to most programming languages and later
convert the application to a fixed point representation—a potentially time consuming process. In this paper, we present the
Arizona Float ⇔ Fixed Hardware Library (AFFHL) consisting of efficient, configurable floating point to fixed point and fixed point to floating
point hardware converters. By utilizing these converters, a system’s hardware/software implementation can be separated into
a floating point domain consisting of the microprocessor and memory subsystem and a fixed point domain consisting of one or
more partitioned hardware coprocessors. This separation enables a rapid hardware/software partitioning approach in which floating
point software kernels can be implemented using fixed point hardware coprocessors without the need for application developers
to first rewrite software applications as fixed point implementations. We further present an overview of a basic hardware/software
partitioning methodology for rapidly partitioning computational kernels within floating point software application to either
statically determined fixed point hardware coprocessors or dynamically adaptable fixed point hardware coprocessors in which
the required fixed point representation can be dynamically determined and adjusted at runtime. 相似文献
6.
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(9):1304-1317
7.
F. Angarita M. J. Canet T. Sansaloni J. Valls V. Almenar 《Journal of Signal Processing Systems》2008,52(1):35-44
This paper describes the design of a soft decision Viterbi Decoder for orthogonal frequency division multiplexing-based wireless
local area networks and evaluates different architectural options by means of their field programmable gate-array (FPGA) implementation.
A finite precision analysis has been performed to reduce the data-path widths under the specifications of IEEE 802.11a and
Hiperlan/2 standards. Four implementation strategies (register exchange, trace back, trace back with double rate memory read
and pointer trace back) for the survivor management unit have been evaluated together with two different normalization methods
for the add–compare–select unit. The results of the implementation in FPGA have been given and it is shown that register exchange
and pointer trace back architectures with pre-normalization in the add–compare–select unit achieve the best performance. Both
architectures can decode 200 Mbps in a Virtex-4 device with lower latency that the conventional trace back one and pointer
trace back exhibits the lowest power consumption, these characteristics make them suitable for future multiple-output multiple-input
WLAN systems.
相似文献
V. AlmenarEmail: |
8.
Jonah Probell 《Journal of Signal Processing Systems》2008,50(1):33-39
Many different video processor architectures exist. Its architecture gives a processor strength for a particular application.
Hardwired logic yields the best performance/cost, but a programmable processor is important for applications that support
multiple coding standards, proprietary functions, or future changes to application requirements. Programmable video processor
architectures achieve best performance through the use of parallelism at the data (SIMD), instruction (VLIW), and multiprocessor
level, and optimally sized ALU, multiplier, and load/store datapaths. Because low-cost memory architectures are not optimized
for the random access patterns of video processing, the performance of video processors is often limited by memory bandwidth
rather than processing resources. Careful data organization alleviates memory bandwidth limitations. When choosing a video
processor it is important to consider many factors, particularly performance, cost, power consumption, programmability, and
peripheral support.
相似文献
Jonah ProbellEmail: |
9.
In an orthogonal frequency division multiplexing (OFDM) based wireless systems, Fast Fourier Transform (FFT) is a critical block as it occupies large area and consumes more power. In this paper, we present an area-efficient and low power 16-bit word-width 64-point radix-22 and radix-23 pipelined FFT architectures for an OFDM-based IEEE 802.11a wireless LAN baseband. The designs are derived from radix-2k algorithm and adopt a Single-Path Delay Feedback (SDF) architecture for hardware implementation. To eliminate the complex multipliers and read-only memory (ROM) which is used for internal storage of twiddle factor coefficients, the proposed 64-point FFT employs a Canonical Signed Digit (CSD) complex constant multiplier using adders, multiplexers and shifters. The complex constant multiplier (CCM) is modified using common sub-expression sharing block that reduces the area of the design. The proposed radix-22 and radix-23 pipelined FFT architectures are modeled and implemented using TSMC 180 nm CMOS technology with a supply voltage of 1.8 V. The implementation results show that the proposed architectures significantly reduces the hardware cost and power consumption in comparison to existing 64-point FFT architectures. 相似文献
10.
Pipelined systolic architectures for DLMS adaptive filtering 总被引:6,自引:0,他引:6
Joseph Thomas 《The Journal of VLSI Signal Processing》1996,12(3):223-246
This work reports two new pipelined, systolic architectures for delayed least mean squares (DLMS) adaptive filtering. In contrast to existing systolic architectures, which introduce a tracking delay that increases linearly with filter order, those presented here, do not. They support the same sampling rate as the fastest such architecture reported so far, even when unpipelined. Our designs use significantly less hardware (i.e., multiply-accumulate modules and registers) with minimal control logic requirement on account of the algebraic projection techniques that we employ, implying a net gain in terms of the silicon area utilized and the dynamic power dissipated. Further, one of these architectures introduces only half the adaptation delay that is conventionally used for systolization; the other requires the normal adaptation delay, but compensates by using considerably reduced control logic. The sampling rates supported by our architectures are further increased by pipelining the processor modules to the level of a 42 compressor. This requires only small adaptation and tracking delays, which are independent of filter order, and is possible without requiring a modification of the basic algorithm (in terms of introducing a lookahead in the adaptation), all in contrast with the only pipelined DLMS architecture reported so far. We propose and implement a scheme in our architectures, for computing a normalized step size for delayed adaptation, in the general context of a nonstationary real-time environment. The simulation studies performed with our architectures indicate remarkably improved convergence properties over those of previously reported architectures. 相似文献
11.
We improve a carry-select technique for decimal adders, where pairs of corrective carry-out bits for all decimal positions are computed in parallel. Selection is based on the corresponding positional carry-in bits, which are produced by a quaternary parallel prefix carry network. Carry-out bits select pairs of corrected or intact sum-digits to be later selected by actual carry-in bits at the end of addition process. Analytical evaluation and synthesis results for various hardware sharing architectures on binary, decimal, adders, and subtractors show lower area consumption and less power dissipation of the proposed designs at no additional latency, compared to previous works. 相似文献
12.
This paper presents a novel hardware interleaver architecture for unified parallel turbo decoding. The architecture is fully
re-configurable among multiple standards like HSPA Evolution, DVB-SH, 3GPP-LTE and WiMAX. Turbo codes being widely used for
error correction in today’s consumer electronics are prone to introduce higher latency due to bigger block sizes and multiple
iterations. Many parallel turbo decoding architectures have recently been proposed to enhance the channel throughput but the
interleaving algorithms used in different standards do not freely allow using them due to higher percentage of memory conflicts.
The architecture presented in this paper provides a re-configurable platform for implementing the parallel interleavers for
different standards by managing the conflicts involved in each. The memory conflicts are managed by applying different approaches
like stream misalignment, memory division and use of small FIFO buffer. The proposed flexible architecture is low cost and
consumes 0.085 mm2 area in 65 nm CMOS process. It can implement up to 8 parallel interleavers and can operate at a frequency of 200 MHz, thus
providing significant support to higher throughput systems based on parallel SISO processors. 相似文献
13.
Dynamically reconfigurable hardware has already been deployed for accelerating computationally demanding applications. Some
of these hardware architectures allow run time reconfiguration but this usually leads to a large reconfiguration overhead.
The advantage of run time reconfiguration is that it allows new algorithmic solutions for many applications. To study the
potential of frequent run time reconfiguration it is interesting to investigate its costs and benefits from an abstract point
of view and to develop new architectural concepts. Multi-level reconfigurable architectures are one such concept that introduces
several levels of reconfiguration. This paper deals with new types of multi-level reconfigurable architectures. The corresponding
problem of finding the best granularity for different reconfiguration levels is formulated and investigated. Although this
problem is shown to be NP-complete, an interesting restricted subcase is solved optimally in polynomial time. For the general
case, a good heuristic is proposed that is based on solutions for the restricted case. Results on three example applications
show that the reconfiguration cost can be reduced with the new architectures. Based on a proposed measure of relative efficiency
it is also shown that the new architectures are more efficient so that they obtain a larger reconfiguration cost reduction
with less additional hardware.
相似文献
Martin MiddendorfEmail: |
14.
Yi-Hau Chen Shao-Yi Chien Ching-Yeh Chen Yu-Wen Huang Liang-Gee Chen 《Journal of Signal Processing Systems》2008,53(3):285-300
Global motion estimation and compensation (GME/GMC) is an important video processing technique and has been applied to many
applications including video segmentation, sprite/mosaic generation, and video coding. In MPEG-4 Advanced Simple Profile (ASP),
GME/GMC is adopted to compensate camera motions. Since GME is important, many GME algorithms have been proposed. These algorithms
have two common characteristics, huge computation complexity and ultra large memory bandwidth. Hence for realtime applications,
a hardware accelerator of GME is required. However, there are many hardware design challenges of GME like irregular memory
access and huge memory bandwidth, and only few hardware architectures have been proposed. In this paper, we first analyzed
three typical algorithms of GME, and a fast GME algorithm is proposed. By using temporal prediction and skipping the redundant
computation, 91% memory bandwidth and 80% iterations are saved, while the performance is kept, compared to Gradient Descent
in MPEG-4 Verification Model. Based on our proposed algorithm, a hardware architecture of GME is also presented. A new scheduling,
Reference-Based Scheduling, is developed to solve the irregular memory access problem. An interleaved memory arrangement is
applied to satisfy the memory access requirement of interpolation. The total gate count of hardware implementation is 131 K
with Artisan 0.18 um cell library, and the internal memory size is about 7.9 Kb. Its processing ability is MPEG-4 ASP@L3, which is 352×288 with
30 fps, at 30 MHz.
相似文献
Liang-Gee ChenEmail: |
15.
We present low area and low power semi-systolic array architectures for polynomial basis multiplication over GF(2m) using Progressive Multiplier Reduction Technique (PMR). These architectures are explored using linear and nonlinear techniques applied to the polynomial multiplication algorithm. The nonlinear techniques allow the designer, to control the processor workload and reduce the inter-processor communications. The semi-systolic architectures obtained have simple structure with local communication. ASIC implementations of our designs and comparable published designs show that the proposed scalable semi-systolic structures have less area complexity (56.8–94.6 %) and power consumption (55.2–84.2 %) except for a scalable design published by the same authors. However, one of the proposed scalable designs outperforms this design in terms of throughput by 73.8 %. This makes the proposed designs suited to embedded applications that require low power consumption and moderate speed. 相似文献
16.
《Solid-State Circuits, IEEE Journal of》2008,43(9):2025-2035
17.
Many sequential multipliers for polynomial basis GF(2k) fields have been proposed using the LSbit and MSbit multiplication algorithm. However, all those designs are defined over fixed size GF(2k) fields and sometimes over fixed special form irreducible polynomials (AOL, trinomials, pentanomials). When such architectures are redesigned for arbitrary GF(2k) fields and generic irreducible polynomials, therefore made versatile, they result in high space complexity (gate–latch number), low frequency (high critical path) and high latency designs. In this paper a Montgomery multiplication element (MME) architecture specially designed for arbitrary GF(2k) fields defined over general irreducible polynomials, is proposed, based on an optimized version of the Montgomery multiplication (MM) algorithm for GF(2k) fields. To evaluate the proposed MME and prove the efficiency of the MM algorithm in versatile designing, three distinct versatile Montgomery multiplier architectures are presented using this proposed MME. They achieve small gate–latch number and high clock frequency compared to other sequential versatile designs. 相似文献
18.
A methodology for rapid silicon design of biorthogonal wavelet transform systems has been developed. This is based on generic, scalable architectures for the forward and inverse wavelet filters. These architectures offer efficient hardware utilisation by combining the linear phase property of biorthogonal filters with decimation and interpolation. The resulting designs have been parameterised in terms of types of wavelet and wordlengths for data and coefficients. Control circuitry is embedded within these cores that allows them to be cascaded for any desired level of decomposition without any interface logic. The time to produce silicon designs for a biorthogonal wavelet system is only the time required to run synthesis and layout tools with no further design effort required. The resulting silicon cores produced are comparable in area and performance to hand-crafted designs. These designs are also portable across a range of foundries and are suitable for FPGA and PLD implementations. 相似文献
19.
Maria E. Angelopoulou Konstantinos Masselos Peter Y. K. Cheung Yiannis Andreopoulos 《Journal of Signal Processing Systems》2008,51(1):3-21
The suitability of the 2D Discrete Wavelet Transform (DWT) as a tool in image and video compression is nowadays indisputable.
For the execution of the multilevel 2D DWT, several computation schedules based on different input traversal patterns have
been proposed. Among these, the most commonly used in practical designs are: the row–column, the line-based and the block-based.
In this work, these schedules are implemented on FPGA-based platforms for the forward 2D DWT by using a lifting-based filter-bank
implementation. Our designs were realized in VHDL and optimized in terms of throughput and memory requirements, in accordance
with the principles of both the schedules and the lifting decomposition. The implementations are fully parameterized with
respect to the size of the input image and the number of decomposition levels. We provide detailed experimental results concerning
the throughput, the area, the memory requirements and the energy dissipation, associated with every point of the parameter
space. These results demonstrate that the choice of the suitable schedule is a decision that should be dependent on the given
algorithmic specifications.
相似文献
Yiannis AndreopoulosEmail: |
20.
对JPEG2 0 0 0中推荐的 5 /3整数滤波器和 9/7实数滤波器进行了硬件实现时所需要的有限精度分析 ;确定了小波变换过程中各个参数的最佳数据宽度 ,还确定了整个变换系统的数据通路的数据宽度。基于lifting的小波变换的特点结合嵌入式延拓算法提出了两种小波变换———折叠结构和长流水线结构 ;对两种结构进行了分析比较。最后 ,对折叠结构和相关的其它结构在所需存储单元的数量、存储单元的访问次数、处理能力以及功耗等方面进行了分析比较 ,可以看出文中提出的结构在性能上有明显优点。 相似文献