首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
We present an algorithmically-aware, high-level partitioning methodology for discrete cosine transforms (DCT) targeted to distributed hardware architectures. The methodology relies on the exploration of alternate DCT formulations as part of the partition optimization process. To the best of our knowledge, no previously proposed DCT algorithm exists that is capable of consistently producing alternate regular formulations for an n-size DCT. Hence, a new Cooley-Tukey-like DCT factorization algorithm was developed to allow exploration of alternate formulations as part of the partitioning optimization process. The use of our factorization mechanism along with a greedy strategy to explore the space of equivalent DCT formulations yielded partitioning solutions with as much as 18% reduction in latency and 83% reduction in run-time as compared to previously proposed regular DCT formulations.
Domingo RodríguezEmail:
  相似文献   

2.
We present an efficient approach for the partitioning of algorithms implementing long convolutions. The dependence graph (DG) of a convolution algorithm is locally sequential globally parallel (LSGP) partitioned into smaller, less complex convolution algorithms. The LSGP partitioned DG is mapped onto a signal flow graph (SFG), in which each processor element (PE) performs a small convolution algorithm. The key is then to reduce the complexity of the SFG in two steps: 1. local reduction of complexity: the short Fast Fourier Transform (FFT) is used to perform the small convolution within the PE; and 2. global reduction of complexity: the short FFTs within the PEs are relocated to the global level, where redundant short FFT operations are eliminated. The remaining operation within the PEs is now a simple element-wise multiply-add. After a graph transform, the structure of the SFG kernel is recognized as a set of parallel small convolutions. If we use the short FFT to perform these short convolutions, we come to our final realization of the long convolution algorithm. The computational complexity of this realization is close to the optimum for convolutions, that is, O(N log N). Our approach is thus achieving this N log N –low without having to implement large-size FFTs. We use, instead, small FFT blocks. The advantage is that small FFT transforms are commercially available, and that they can even be implemented in single-chip VLSI architectures. Our final SFG is three dimensional and can be mapped efficiently onto prototype architectures or dedicated VLSI processors. We demonstrate the procedure in the paper by a design example: the implementation of a prototype convolution architecture that we designed for a real-time radar imaging system.  相似文献   

3.
<正> 1 引言在目前的VLSI生产中,图形的复印手段仍是以紫外(UV)光刻系统为主。随着准分子步进光刻的研制成功,可使光学光刻的分辨率达到0.5μm以下,因此,VLSI的器件特征尺寸可再缩小,集成密度可进一步提高。与此同  相似文献   

4.
Journal of Signal Processing Systems - Running discrete Fourier transform (running DFT) is being used to overcome the drawbacks of ping pong buffer technique by employing fast Fourier transform...  相似文献   

5.
Architecture elements suitable for VLSI implementation and real-time operation in movement-compensated video (MCV) processors are presented. The algorithm used in the video processor is based on motion estimation and compensation techniques. An overview of the algorithm is given with emphasis placed on one of the key functions used in the prediction, the two-dimensional interpolator. A VLSI implementation is presented which incorporates design techniques of pipelining, parallelism, and module replication. Furthermore, it is shown that modifications to the algorithm can be made based on the use of a high degree of parallelism yielding an efficient structure which relieves constraints for high-speed execution. The operations then rely on a simpler one-dimensional interpolator to form one of the building blocks of the two-dimensional interpolator. It is indicated that the parallel structure which is formed with these building blocks can be implemented on two circuits and that it can operate at speeds meeting real-time requirements.  相似文献   

6.
This brief addresses the design of a decision feedback equalizer (DFE) for gigabit throughput rate. It is well known that the feedback loop in a DFE limits an upper bound of the achievable speed. For a$L$-tap feedbackward filter (FBF) and$M$-pulse amplitude modulation, Parhi (1991) and Kasturia and Winters (1991) reformulated the FBF as a$(M) ^L$-to-1 multiplexer. Due to the reformulation, the overhead of extra adders and extra multiplexers are as large as$(M) ^L$. The required hardware overhead should be more severe when the DFE is implemented in parallel. In this brief, we propose two new approaches to implement the DFE when gigabit throughput rate is desired. The first approach is partial pre-computation scheme, which can trade-off between hardware complexity and computational speed. The second approach is two-stage pre-computation scheme, which can be applied to higher speed applications. In the later case, we can reduce the hardware overhead to about$2(M) ^(-L/2)$times of , , and the iteration bound is$(log _2 W+2)/(L/2+1)+(log _2 M)$multiplexer-delays, where$W$is the wordlength of weight coefficient of a FBF. We demonstrate the proposed architectures by apply it to the 10 Gbase-LX4 optical communication systems.  相似文献   

7.
The lifting scheme has become an important tool for designing filter banks and transforms of digital signal processing. Recently, the conventional lifting scheme that concerns the construction of 2-channel filter banks has been extended to $M$-channel filter banks $(M>2)$, bringing up the desirable properties of the lifting scheme to a broader range of applications. Many hand-crafted lifting-based VLSI architectures exist, which mostly concentrate on a single and specific target application having fixed data throughput and resource consumption. However, the reusability of such architectures is limited due to the lack of scalability. To overcome this issue, we present a design methodology for automatic synthesis of VLSI architectures suitable for arbitrary lifting-based $M$-channel filter banks and transforms. The proposed methodology enables high parameterizability in terms of data throughput, resource consumption, and arithmetic precision for the generated architectures. The concept of parameterizing design elements is important for modern system-on-chip design, since it features design space exploration and increases reusability. The proposed methodology is implemented as a high-level compilation tool that generates VLSI architectures at the register transfer level. We present results on the implementation of different architectures that were generated by our tool.   相似文献   

8.
本文提出一种归并排序算法-插入归并算法,并通过该算法的Systolic阵列映射,重点阐述了正则映射生成VLSI阵列的实现方法。最事,还指出了改进脉动阵列通用性和灵活性的途径。  相似文献   

9.
A general purpose rail-to-rail input stage suitable for analogue and mixed signal applications and compatible with modern submicron CMOS technologies, is introduced. The circuit provides, simultaneously, a constant small- and large-signal behaviour over the entire input common-mode voltage range, whilst imposing no appreciable constraint for high-frequency operation. Experimental results are given.  相似文献   

10.
In this paper, we present a solution to the problem of joint tiling and scheduling a given loop nest with uniform data dependencies symbolically. This challenge arises when the size and number of available processors for parallel loop execution is not known at compile time. But still, in order to avoid any overhead of dynamic (run-time) recompilation, a schedule of loop iterations shall be computed and optimized statically. In this paper, it will be shown that it is possible to derive parameterized latency-optimal schedules statically by proposing a two step approach: First, the iteration space of a loop program is tiled symbolically into orthotopes of parametrized extensions. Subsequently, the resulting tiled program is also scheduled symbolically, resulting in a set of latency-optimal parameterized schedule candidates. At run time, once the size of the processor array becomes known, simple comparisons of latency-determining expressions finally steer which of these schedules will be dynamically selected and the corresponding program configuration executed on the resulting processor array so to avoid any further run-time optimization or expensive recompilation. Our theory of symbolic loop parallelization is applied to a number of loop programs from the domains of signal processing and linear algebra. Finally, as a proof of concept, we demonstrate our proposed methodology for a massively parallel processor array architecture called tightly coupled processor array (TCPA) on which applications may dynamically claim regions of processors in the context of invasive computing.  相似文献   

11.
Image processing algorithms for template matching, two-dimensional (2-D) digital filtering, morphologic operations, and motion estimation share some common properties. They can all benefit from using reconfigurable computers that use coprocessor boards based on field-programmable gate array (FPGA) chips. This paper characterizes those applications as generalized template matching (GTM) operations and describes the mapping of the GTM operations onto reconfigurable computers. A three-step approach is described. The first two steps enumerate and prune the design space of basic GTM building blocks, which consist of FPGA buffers and GTM computation cores. The last step is to achieve a solution through an optimal combination of these building blocks where the cost function is the FPGA computation time and the constraints are FPGA coprocessor board resources. Various FPGA buffers are presented so as to introduce design options of basic GTM building blocks. Algorithms used for the mapping are described. Experimental results are summarized to reveal the relationship between the GTM mapping results and FPGA board resource parameters.  相似文献   

12.
The design of Fast Fourier Transform (FFT) integrated architectures for System-on-Chip (SoC) telecom applications is addressed in this paper. After reviewing the FFT processing requirements of wireless and wired Orthogonal Frequency Division Multiplexing (OFDM) standards, including the emerging Multiple Input Multiple Output (MIMO) and OFDM Access (OFDMA) schemes, three FFT architectures are proposed: a fully parallel, a pipelined cascade and an in-place variable-size architecture, which offer different trade-offs among flexibility, processing speed and complexity. Silicon implementation results and comparisons with the state-of-the-art prove that each macrocell outperforms the known works for a target application. The fully parallel is optimized for throughput requirements up to several GSamples/s enabling Ultra-wideband (UWB) communications by using all channels foreseen in the standard. The pipelined cascade macrocell minimizes complexity for large size FFTs sustaining throughput up to 100 MSamples/s. The in-place variable-size FFT macrocell stands for its flexibility by allowing run-time reconfigurability required in OFDMA schemes while attaining the required throughput to support MIMO communications. The three architectures are also compared with common case-studies and target technology.  相似文献   

13.
一种通用神经网络处理机设计及其VLSI集成化讨论   总被引:6,自引:2,他引:4  
魏允  王守觉 《电子学报》1995,23(5):7-11
本文讨论了通用神经网络处理机的性能要求以及全模拟量处理、全数字量处理和数字模拟混合处理等各种处理方式的优缺点,设计了一种数字模拟混合处理的通用神经网络处理机结构。这种结构在当前VLSI集成工艺的条件下,具有较高的性能价格比。  相似文献   

14.
The MapReduce pattern can be found in many important applications, and can be exploited to significantly improve system parallelism. Unlike previous work, in which designers explicitly specify how to exploit the pattern, we develop a compilation approach for mapping applications with the MapReduce pattern automatically onto Field-Programmable Gate Array (FPGA) based parallel computing platforms. We formulate the problem of mapping the MapReduce pattern to hardware as a geometric programming model; this model exploits loop-level parallelism and pipelining to give an optimal implementation on given hardware resources. The approach is capable of handling single and multiple nested MapReduce patterns. Furthermore, we explore important variations of MapReduce, such as using a linear structure rather than a tree structure for merging intermediate results generated in parallel. Results for six benchmarks show that our approach can find performance-optimal designs in the design space, improving system performance by up to 170 times compared to the initial designs on the target platform.  相似文献   

15.
A VLSI design synthesis approach with testability, area, and delay constraints is presented. This research differs from other synthesizers by implementing testability as part of the VLSI design solution. A binary-tree data structure is used throughout the testable design search. Its bottom-up and top-down algorithms provide data-path allocation, constraint estimation, and feedback for design exploration. The partitioning and two-dimensional characteristics of the binary-tree structure provide VLSI design floorplans and global information for test incorporation. A differential equation and elliptical wave filter example were used to illustrate the design synthesis with testability constraints methodology. Test methodologies such as multiple-chain scan paths and BIST (built-in self-test) with different test schedules were explored. Design scores comprised of area, delay, fault coverage, and test time were computed and graphed  相似文献   

16.
从结构和算法上对AES算法进行了分析和优化,在一个模块内集成了加密和解密功能,实现了AES算法的所有5种工作模式,使其能满足多种保密性应用的需求.仿真和综合结果表明,此设计结构较好的实现了面积与速度的折中.  相似文献   

17.
A fabrication process for the Lightly Doped Drain/Source Field-Effect Transistor, LDDFET, that utilizes RIE produced SiO/sub 2/ sidewall spacers is described. The process is compatible with most conventional polysilicon-gated FET processes and needs no additional photo-masking steps. Excellent control and reproducibility of the n/sup -/ region of the LDD device are obtained. Measurements from dynamic clock generators have shown that LDDFET's have as much as 1.9x performance advantage over conventional devices.  相似文献   

18.
This paper presents the interaction of VLSI technology progress with future minicomputer product development over the next five years. The beginning portion of the paper demonstrates how VLSI trends in both cost and performance constitute a technology push generating secondary trends that constrain the direction in which minicomputer product development can proceed. Since a minicomputer system consists basically of a CPU, a primary memory, and an I/O subsystem, the interaction of VLSI technology with each of these subsystems is also discussed.  相似文献   

19.
The author discusses the interaction of VLSI technology progress with future minicomputer product development over the next five years. He demonstrates how VLSI trends in both cost and performance constitute a technology push generating secondary trends that constrain the direction in which minicomputer product development can proceed. Since a minicomputer system consists basically of a CPU, a primary memory, and an I/O subsystem, the interaction of VLSI technology with each of these subsystems is also discussed.  相似文献   

20.
Two schemes for power-efficient gain-programmable V-I conversion based on class AB CMOS mirrors are introduced. The proposed topologies also allow for high-speed gain-programmable precision rectification. Experimental results from a test chip prototype in 0.5- m CMOS technology with ±1 V supplies are shown that validate the proposed circuits.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号