共查询到20条相似文献,搜索用时 46 毫秒
1.
Rafael A. Arce-Nazario Manuel Jiménez Domingo Rodríguez 《Journal of Signal Processing Systems》2008,53(3):367-382
We present an algorithmically-aware, high-level partitioning methodology for discrete cosine transforms (DCT) targeted to
distributed hardware architectures. The methodology relies on the exploration of alternate DCT formulations as part of the
partition optimization process. To the best of our knowledge, no previously proposed DCT algorithm exists that is capable
of consistently producing alternate regular formulations for an n-size DCT. Hence, a new Cooley-Tukey-like DCT factorization algorithm was developed to allow exploration of alternate formulations
as part of the partitioning optimization process. The use of our factorization mechanism along with a greedy strategy to explore
the space of equivalent DCT formulations yielded partitioning solutions with as much as 18% reduction in latency and 83% reduction
in run-time as compared to previously proposed regular DCT formulations.
相似文献
Domingo RodríguezEmail: |
2.
We present an efficient approach for the partitioning of algorithms implementing long convolutions. The dependence graph (DG) of a convolution algorithm is locally sequential globally parallel (LSGP) partitioned into smaller, less complex convolution algorithms. The LSGP partitioned DG is mapped onto a signal flow graph (SFG), in which each processor element (PE) performs a small convolution algorithm. The key is then to reduce the complexity of the SFG in two steps: 1. local reduction of complexity: the short Fast Fourier Transform (FFT) is used to perform the small convolution within the PE; and 2. global reduction of complexity: the short FFTs within the PEs are relocated to the global level, where redundant short FFT operations are eliminated. The remaining operation within the PEs is now a simple element-wise multiply-add. After a graph transform, the structure of the SFG kernel is recognized as a set of parallel small convolutions. If we use the short FFT to perform these short convolutions, we come to our final realization of the long convolution algorithm. The computational complexity of this realization is close to the optimum for convolutions, that is, O(N log N). Our approach is thus achieving this N log N –low without having to implement large-size FFTs. We use, instead, small FFT blocks. The advantage is that small FFT transforms are commercially available, and that they can even be implemented in single-chip VLSI architectures. Our final SFG is three dimensional and can be mapped efficiently onto prototype architectures or dedicated VLSI processors. We demonstrate the procedure in the paper by a design example: the implementation of a prototype convolution architecture that we designed for a real-time radar imaging system. 相似文献
3.
4.
Journal of Signal Processing Systems - Running discrete Fourier transform (running DFT) is being used to overcome the drawbacks of ping pong buffer technique by employing fast Fourier transform... 相似文献
5.
《Solid-State Circuits, IEEE Journal of》1986,21(1):140-149
Architecture elements suitable for VLSI implementation and real-time operation in movement-compensated video (MCV) processors are presented. The algorithm used in the video processor is based on motion estimation and compensation techniques. An overview of the algorithm is given with emphasis placed on one of the key functions used in the prediction, the two-dimensional interpolator. A VLSI implementation is presented which incorporates design techniques of pipelining, parallelism, and module replication. Furthermore, it is shown that modifications to the algorithm can be made based on the use of a high degree of parallelism yielding an efficient structure which relieves constraints for high-speed execution. The operations then rely on a simpler one-dimensional interpolator to form one of the building blocks of the two-dimensional interpolator. It is indicated that the parallel structure which is formed with these building blocks can be implemented on two circuits and that it can operate at speeds meeting real-time requirements. 相似文献
6.
《Circuits and Systems II: Express Briefs, IEEE Transactions on》2006,53(9):911-915
This brief addresses the design of a decision feedback equalizer (DFE) for gigabit throughput rate. It is well known that the feedback loop in a DFE limits an upper bound of the achievable speed. For a$L$ -tap feedbackward filter (FBF) and$M$ -pulse amplitude modulation, Parhi (1991) and Kasturia and Winters (1991) reformulated the FBF as a$(M) ^L$ -to-1 multiplexer. Due to the reformulation, the overhead of extra adders and extra multiplexers are as large as$(M) ^L$ . The required hardware overhead should be more severe when the DFE is implemented in parallel. In this brief, we propose two new approaches to implement the DFE when gigabit throughput rate is desired. The first approach is partial pre-computation scheme, which can trade-off between hardware complexity and computational speed. The second approach is two-stage pre-computation scheme, which can be applied to higher speed applications. In the later case, we can reduce the hardware overhead to about$2(M) ^(-L/2)$ times of , , and the iteration bound is$(log _2 W+2)/(L/2+1)+(log _2 M)$ multiplexer-delays, where$W$ is the wordlength of weight coefficient of a FBF. We demonstrate the proposed architectures by apply it to the 10 Gbase-LX4 optical communication systems. 相似文献
7.
《IEEE transactions on circuits and systems. I, Regular papers》2008,55(7):1939-1952
8.
本文提出一种归并排序算法-插入归并算法,并通过该算法的Systolic阵列映射,重点阐述了正则映射生成VLSI阵列的实现方法。最事,还指出了改进脉动阵列通用性和灵活性的途径。 相似文献
9.
A general purpose rail-to-rail input stage suitable for analogue and mixed signal applications and compatible with modern submicron CMOS technologies, is introduced. The circuit provides, simultaneously, a constant small- and large-signal behaviour over the entire input common-mode voltage range, whilst imposing no appreciable constraint for high-frequency operation. Experimental results are given. 相似文献
10.
In this paper, we present a solution to the problem of joint tiling and scheduling a given loop nest with uniform data dependencies symbolically. This challenge arises when the size and number of available processors for parallel loop execution is not known at compile time. But still, in order to avoid any overhead of dynamic (run-time) recompilation, a schedule of loop iterations shall be computed and optimized statically. In this paper, it will be shown that it is possible to derive parameterized latency-optimal schedules statically by proposing a two step approach: First, the iteration space of a loop program is tiled symbolically into orthotopes of parametrized extensions. Subsequently, the resulting tiled program is also scheduled symbolically, resulting in a set of latency-optimal parameterized schedule candidates. At run time, once the size of the processor array becomes known, simple comparisons of latency-determining expressions finally steer which of these schedules will be dynamically selected and the corresponding program configuration executed on the resulting processor array so to avoid any further run-time optimization or expensive recompilation. Our theory of symbolic loop parallelization is applied to a number of loop programs from the domains of signal processing and linear algebra. Finally, as a proof of concept, we demonstrate our proposed methodology for a massively parallel processor array architecture called tightly coupled processor array (TCPA) on which applications may dynamically claim regions of processors in the context of invasive computing. 相似文献
11.
Xuejun Liang Jean J.S.-N. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(3):485-498
Image processing algorithms for template matching, two-dimensional (2-D) digital filtering, morphologic operations, and motion estimation share some common properties. They can all benefit from using reconfigurable computers that use coprocessor boards based on field-programmable gate array (FPGA) chips. This paper characterizes those applications as generalized template matching (GTM) operations and describes the mapping of the GTM operations onto reconfigurable computers. A three-step approach is described. The first two steps enumerate and prune the design space of basic GTM building blocks, which consist of FPGA buffers and GTM computation cores. The last step is to achieve a solution through an optimal combination of these building blocks where the cost function is the FPGA computation time and the constraints are FPGA coprocessor board resources. Various FPGA buffers are presented so as to introduce design options of basic GTM building blocks. Algorithms used for the mapping are described. Experimental results are summarized to reveal the relationship between the GTM mapping results and FPGA board resource parameters. 相似文献
12.
The design of Fast Fourier Transform (FFT) integrated architectures for System-on-Chip (SoC) telecom applications is addressed
in this paper. After reviewing the FFT processing requirements of wireless and wired Orthogonal Frequency Division Multiplexing
(OFDM) standards, including the emerging Multiple Input Multiple Output (MIMO) and OFDM Access (OFDMA) schemes, three FFT
architectures are proposed: a fully parallel, a pipelined cascade and an in-place variable-size architecture, which offer
different trade-offs among flexibility, processing speed and complexity. Silicon implementation results and comparisons with
the state-of-the-art prove that each macrocell outperforms the known works for a target application. The fully parallel is
optimized for throughput requirements up to several GSamples/s enabling Ultra-wideband (UWB) communications by using all channels
foreseen in the standard. The pipelined cascade macrocell minimizes complexity for large size FFTs sustaining throughput up
to 100 MSamples/s. The in-place variable-size FFT macrocell stands for its flexibility by allowing run-time reconfigurability
required in OFDMA schemes while attaining the required throughput to support MIMO communications. The three architectures
are also compared with common case-studies and target technology. 相似文献
13.
一种通用神经网络处理机设计及其VLSI集成化讨论 总被引:6,自引:2,他引:4
本文讨论了通用神经网络处理机的性能要求以及全模拟量处理、全数字量处理和数字模拟混合处理等各种处理方式的优缺点,设计了一种数字模拟混合处理的通用神经网络处理机结构。这种结构在当前VLSI集成工艺的条件下,具有较高的性能价格比。 相似文献
14.
Qiang Liu Tim Todman Wayne Luk George A. Constantinides 《Journal of Signal Processing Systems》2012,67(1):65-78
The MapReduce pattern can be found in many important applications, and can be exploited to significantly improve system parallelism. Unlike
previous work, in which designers explicitly specify how to exploit the pattern, we develop a compilation approach for mapping
applications with the MapReduce pattern automatically onto Field-Programmable Gate Array (FPGA) based parallel computing platforms.
We formulate the problem of mapping the MapReduce pattern to hardware as a geometric programming model; this model exploits
loop-level parallelism and pipelining to give an optimal implementation on given hardware resources. The approach is capable
of handling single and multiple nested MapReduce patterns. Furthermore, we explore important variations of MapReduce, such
as using a linear structure rather than a tree structure for merging intermediate results generated in parallel. Results for
six benchmarks show that our approach can find performance-optimal designs in the design space, improving system performance
by up to 170 times compared to the initial designs on the target platform. 相似文献
15.
A VLSI design synthesis approach with testability, area, and delay constraints is presented. This research differs from other synthesizers by implementing testability as part of the VLSI design solution. A binary-tree data structure is used throughout the testable design search. Its bottom-up and top-down algorithms provide data-path allocation, constraint estimation, and feedback for design exploration. The partitioning and two-dimensional characteristics of the binary-tree structure provide VLSI design floorplans and global information for test incorporation. A differential equation and elliptical wave filter example were used to illustrate the design synthesis with testability constraints methodology. Test methodologies such as multiple-chain scan paths and BIST (built-in self-test) with different test schedules were explored. Design scores comprised of area, delay, fault coverage, and test time were computed and graphed 相似文献
16.
17.
《Solid-State Circuits, IEEE Journal of》1982,17(2):220-226
A fabrication process for the Lightly Doped Drain/Source Field-Effect Transistor, LDDFET, that utilizes RIE produced SiO/sub 2/ sidewall spacers is described. The process is compatible with most conventional polysilicon-gated FET processes and needs no additional photo-masking steps. Excellent control and reproducibility of the n/sup -/ region of the LDD device are obtained. Measurements from dynamic clock generators have shown that LDDFET's have as much as 1.9x performance advantage over conventional devices. 相似文献
18.
《Electron Devices, IEEE Transactions on》1979,26(4):284-291
This paper presents the interaction of VLSI technology progress with future minicomputer product development over the next five years. The beginning portion of the paper demonstrates how VLSI trends in both cost and performance constitute a technology push generating secondary trends that constrain the direction in which minicomputer product development can proceed. Since a minicomputer system consists basically of a CPU, a primary memory, and an I/O subsystem, the interaction of VLSI technology with each of these subsystems is also discussed. 相似文献
19.
《Solid-State Circuits, IEEE Journal of》1979,14(2):206-213
The author discusses the interaction of VLSI technology progress with future minicomputer product development over the next five years. He demonstrates how VLSI trends in both cost and performance constitute a technology push generating secondary trends that constrain the direction in which minicomputer product development can proceed. Since a minicomputer system consists basically of a CPU, a primary memory, and an I/O subsystem, the interaction of VLSI technology with each of these subsystems is also discussed. 相似文献
20.
Two schemes for power-efficient gain-programmable V-I conversion based on class AB CMOS mirrors are introduced. The proposed topologies also allow for high-speed gain-programmable precision rectification. Experimental results from a test chip prototype in 0.5- m CMOS technology with ±1 V supplies are shown that validate the proposed circuits. 相似文献