首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 560 毫秒
1.
An Algorithm-Hardware-System Approach to VLIW Multimedia Processors   总被引:2,自引:0,他引:2  
Very Long Instruction Word (VLIW) processor architectures for multimedia applications are discussed from an algorithm, hardware and system based point of view. VLIW processors show high flexibility and processing power, as well as a good utilization of resources by compiler-generated code, but their exclusive exploitation of instruction level parallelism (ILP) decreases in efficiency as the degree of parallelism increases. This is mainly caused by characteristics of multimedia algorithms, increasing wiring delays, compiler restrictions, and a widening gap between on-chip processing speed and available bandwidth to external memory. As new multimedia applications and standards continue to evolve (MPEG-4), the demand for higher processing power will continue. Therefore, parallel processing in all its available forms will have to be exploited to achieve significant performance improvements. We show that, due to the diminishing returns from a further increase in ILP, multimedia applications will benefit more from an additional exploitation of parallelism at thread-level. We examine how simultaneous multithreading (SMT), a novel architectural approach combining VLIW techniques with parallel processing of threads, can efficiently be used to further increase performance of typical multimedia workloads.  相似文献   

2.
3.
媒体处理器通常采用汇编语言编程以满足代码大小、性能和能耗方面的要求。本文提出了媒体处理器高级语言的设计原则,并针对Leadtek公司的媒体处理器设计,实现了VP6-C语言及其编译系统。VP6-C语言用于编写多媒体处理中的核心程序,为程序员提供一种C风格的、自然的编程方式。试验结果表明编译生成的目标代码有较高的质量。  相似文献   

4.
In order to develop a low-power and high-performance SoC platform for multimedia applications, the Parallel Architecture Core (PAC) project was initiated in Taiwan in 2003. A VLIW digital signal processor (PACDSP) has been developed from a proprietary instruction set with multimedia-rich instructions, a complexity-effective microarchitecture with an innovative distributed & ping-pong register organization and variable-length VLIW encoding, to a highly-configurable soft IP with several successful silicon implementations. A complete toolchain with an optimizing C compiler has also been developed for PACDSP. A dual-core PAC SoC has been designed and fabricated, which consists of a PACDSP core, an ARM9 core, scratchpad memories, and various on-chip peripherals, to demonstrate the outstanding performance and energy efficiency for multimedia processing such as the real-time H.264 codec. The first part of the two introductory papers of PAC describes the hardware architecture of the PACDSP core, its software development tools, and the PAC SoC with dynamic voltage and frequency scaling (DVFS).  相似文献   

5.
多核处理器已经成为处理器的主流,并发展成为各种通信与媒体应用的主流处理平台。通讯结构是多核系统中的核心技术之一,核间通信的效率是影响多核处理器性能的重要指标。目前有3种主要的通讯架构:总线系统结构、交叉开关网络和片上网络。总线结构设计相对方便、硬件消耗较少、成本较低;交叉开关是适合用于构建大容量系统的交换网络结构;而片上网络是更高层次、更大规模的片上网络系统,目前可以解决多核体系结构问题,是多核系统最有前途的解决方案之一。文中在分析了NoC结构的基本原理、系统结构和功能的同时,也提供了部分单元的设计实现。  相似文献   

6.
This paper describes the design and implementation of the massively parallel processor based on the matrix architecture which is suitable for portable multimedia applications. The proposed architecture in this paper achieves the high performance of 40 GOPS in the case of consecutive fixed-point 16-bit additions at 200MHz clock frequency and the small power dissipation of 250mW. In addition, 1Mbit SRAM for data registers and 2048 2-bit-grained processing elements connected by a flexible switching network are integrated in the small area of 3.1 mm 2 in 90nm CMOS low standby technology. These design techniques and architectures described in this paper are attractive for realizing area-efficient, energy-efficient, and high-performance multimedia processors  相似文献   

7.
High-performance, reliable, and robust products with a short development schedule are general design aims. FACE was developed to achieve these goals, including the organization of a design flow, a frequency-driven information analyzer, compiler techniques (code generator and instruction optimization), and a hierarchical object design library. This paper explores the design space of a retargetable compiler and a reconfigurable hardware, which combine both software and hardware reprogrammability. The environment, FACE, we have developed allows us to quickly move the functions between software and hardware in a state of flux. Finally, it generates the application specific integrated processor (ASIP) and a compiler for the new ASIP architecture. The case study is considered which demonstrates the efficiency in ASIP design of FACE.  相似文献   

8.
PLX is a concise instruction set architecture (ISA) that combines the most useful features from previous generations of multimedia instruction sets with newer ISA features for high-performance, low-cost multimedia information processing. Unlike previous multimedia instruction sets, PLX is not added onto a base processor ISA, but designed from the beginning as a standalone processor architecture optimized for media processing. Its design goals are high performance multimedia processing, general-purpose programmability to support an ever-growing range of applications, simplicity for constrained environments where low power and low cost are paramount, and scalability for higher performance in less constrained multimedia systems. Another design goal of PLX is to facilitate exploration and evaluation of novel techniques in instruction set architecture, microarchitecture, arithmetic, VLSI implementations, compiler optimizations, and parallel algorithm design for new computing paradigms.Key characteristics of PLX are a fully subword-parallel architecture with novel features like wordsize scalability from 32-bit to 128-bit words, a new definition of predication, and an innovative set of subword permutation instructions. We demonstrate the use and high performance of PLX on some frequently-used code kernels selected from image, video, and graphics processing applications: discrete cosine transform, pixel padding, clip test, and median filter. Our results show that a 64-bit PLX processor achieves significant speedups over a basic 64-bit RISC processor and over IA-32 processors with MMX and SSE multimedia extensions. Using PLXs wordsize scalability feature, PLX-128 often provides an additional 2× speedup over PLX-64 in a cost-effective way. Superscalar or VLIW (Very Long Instruction Word) PLX implementations can also add additional performance through inter-instruction, rather than intra-instruction parallelism. We also describe the PLX testbed and its software tools for architecture and related research.Ruby B. Lee is the Forrest G. Hamrick Professor of Engineering and Professor of Electrical Engineering at Princeton University, with an affiliated appointment in the Computer Science department. She is the founder and director of the Princeton Architecture Laboratory for Multimedia and Security (PALMS). Her current research is in rethinking computer architecture for high-performance but low-cost security and multimedia processing. Prior to joining the Princeton faculty in 1998, Dr. Lee served as chief architect at Hewlett-Packard, responsible at different times for processor architecture, multimedia architecture, and security architecture for e-commerce and extended enterprises. She was a key architect in the initial definition and the evolution of the PA-RISC processor architecture used in HP servers and workstations. As chief architect for HPs multimedia architecture team, Dr. Lee led an inter-disciplinary team focused on architecture to facilitate pervasive multimedia information processing using general-purpose computers. She introduced innovative multimedia instruction set architecture (MAX and MAX-2) in microprocessors, resulting in the industrys first real-time, high-fidelity MPEG video and audio player implemented in software on low-end desktop computers. Dr. Lee also co-led an HP-Intel multimedia architecture team for IA-64, released in Intels Itanium microprocessors. Concurrent with full-time employment at HP, Dr. Lee also served as Consulting Professor of Electrical Engineering at Stanford University. Dr. Lee has a Ph.D. in Electrical Engineering and a M.S. in Computer Science, both from Stanford University, and an A.B. from Cornell University, where she was a College Scholar. She is a Fellow of ACM, a Fellow of IEEE, and a member of IS&T, Phi Beta Kappa, and Alpha Lambda Delta. She has been granted 115 U.S. and international patents, with several patent applications pending.A. Murat Fiskiran is a Ph. D. student at the Department of Electrical Engineering at Princeton University. He is a member of the Princeton Architecture Laboratory for Multimedia and Security (PALMS) and a Kodak Fellow. His research interests include computer architecture and computer security.  相似文献   

9.
Novel circuits and design methodology of the massively parallel processor based on the matrix architecture are introduced. A fine-grained processing elements (PE) circuit for high-throughput MAC operations based on the Booth's algorithm enhances the performance of a 16-bit fixed-point signed MAC, which operates up to 30.0GOPS/W. The dedicated I/O interface circuits are designed for converting the direction of data access and supporting the interleaved memory architecture, and they are implemented for maximizing the processor core efficiency. Power management techniques for suppressing current peaks and reducing average power consumption are proposed to enhance the robustness of the macro. The circuits and the design methodology proposal in this paper are attractive for achieving a high performance and robust massively parallel SIMD processor core employed in multimedia SoCs  相似文献   

10.
11.
Many embedded multimedia systems employ special hardware blocks to co‐process with the main processor. Even though an efficient handling of such hardware blocks is critical on the overall performance of real‐time multimedia systems, traditional real‐time scheduling techniques cannot afford to guarantee a high quality of multimedia playbacks with neither delay nor jerking. This paper presents a hardware‐aware rate monotonic scheduling (HA‐RMS) algorithm to manage hardware tasks efficiently and handle special hardware blocks in the embedded multimedia system. The HA‐RMS prioritizes the hardware tasks over software tasks not only to increase the hardware utilization of the system but also to reduce the output jitter of multimedia applications, which results in reducing the overall response time.  相似文献   

12.
Simplifying the programming models is paramount to the success of reconfigurable computing with field programmable gate arrays (FPGAs). This paper presents a methodology to combine true object-oriented design of the compiler/CAD tool with an object-oriented hardware design methodology in C++. The resulting system provides all the benefits of object-oriented design to the compiler/CAD tool designer and to the hardware designer/programmer. The two examples for domain-specific compilers presented are BSAT and StReAm. Each domain-specific compiler is targeted at a very specific application domain, such as applications that accelerate Boolean satisfiability problems with BSAT, and applications which lend themselves for implementation as a stream architecture with StReAm. The key benefit of the presented domain specific compilers is a reduction of design time by orders of magnitude while keeping the optimal performance of hand-designed circuits  相似文献   

13.
Many-core processors are good candidates for speeding up video coding because the parallelism of these applications can be exploited more efficiently by the many-core architecture. Lock methods are important for many-core architecture to ensure correct execution of the program and communication between threads on chip. The efficiency of lock method is critical to overall performance of chipped many-core processor. In this paper, we propose two types of hardware locks for on-chip many-core architecture, a centralized lock and a distributed lock. First, we design the architectures of centralized lock and distributed lock to implement the two hardware lock methods. Then, we evaluate the performance of the two hardware locks and a software lock by quantitative evaluation micro-benchmarks on a many-core processor simulator Godson-T. The experimental results show that the locks with dedicated hardware support have higher performance than the software lock, and the distributed hardware lock is more scalable than the centralized hardware lock.  相似文献   

14.
Embedded system architectures comprising of software programmable components (e.g. DSP, ASIP, and micro-controller cores) and customized hardware co-processors, integrated into a single cost-efficient VLSI chip, are emerging as a key solution to todays microelectronics design problems. This trend is being driven by new emerging applications in the areas of wireless communication, high-speed optical networking, and multimedia computing, fueled by increasing levels of integration. These applications are often subject to stringent requirements in terms of processing performance, power dissipation, and flexibility. A key problem confronted by embedded system designers today is the rapid prototyping of an application-specific embedded system architecture where different combinations of programmable processor components, library hardware components, and customized hardware components must be integrated together, while ensuring that the hardware and software parts communicate correctly. Designers often spend an enormous time on this highly error proned task. In this paper, we present a solution to this embedded architecture co-synthesis and system integration problem based on an orchestrated combination of architectural strategies, parameterized libraries, and software CAD tools.  相似文献   

15.
The single instruction multiple data (SIMD) architecture is very efficient for executing arithmetic intensive programs, but frequently suffers from data-alignment problems. The data-alignment problem not only induces extra time overhead but also hinders automatic vectorization of the SIMD compiler. In this paper, we compare three on-chip memory systems, which are single-bank, multi-bank, and multi-port, for the SIMD architecture to resolve the data-alignment problems. The single-bank memory is the simplest, but supports only the aligned accesses. The multi-bank memory requires a little higher complexity, but enables the unaligned accesses and the stride accesses with a bank-conflict limitation. The multi-port memory is capable of both the unaligned and stride accesses without any restriction, but needs quite much expensive hardware. We also developed a vectorizing compiler that can conduct dynamic memory allocation and SIMD code generation. The performances of the three memory systems with our SIMD compiler are evaluated using several digital signal processing kernels and the MPEG2 encoder. The experimental results show that the multi-bank memory can carry out MPEG2 encoding 5.8 times faster, whereas the single-bank memory only achieves 2.9 times speed-up when employed in a multimedia system with a 2-issue host processor and an 8-way SIMD coprocessor. The multi-port memory obviously shows the best performance, which is however an impractical improvement over the multi-bank memory when the hardware cost is considered.  相似文献   

16.
A high performance digital architecture for the implementation of a nonlinear image enhancement technique is proposed in this paper. The image enhancement is based on an illuminance-reflectance model which improves the visual quality of digital images and video captured under insufficient or non-uniform lighting conditions. The algorithm shows robust performance with appropriate dynamic range compression, good contrast, accurate and consistent color rendition. The algorithm contains a large number of complex computations and thus it requires specialized hardware implementation for real-time applications. Systolic, pipelined and parallel design techniques are utilized effectively in the proposed FPGA-based architectural design to achieve real-time performance. Approximation techniques are used in the hardware algorithmic design to achieve high throughput. The video enhancement system is implemented using Xilinx's multimedia development board that contains a VirtexII-X2000 FPGA and it is capable of processing approximately 63 Mega-pixels (Mpixels) per second.  相似文献   

17.
基于Xscale架构的高端处理器PXA270具有优异的多媒体处理性能.针对PXA270的特点开发出嵌入式应用程序,具有较好的应用前景.研究了PXA270微处理器和Xscale架构原理,介绍了基于PXA270开发板的嵌入式环境搭建过程,并通过交叉编译方式实现了嵌入式Linux操作系统内的应用程序移植.  相似文献   

18.
Low power and high performance are the two most important criteria for many signal-processing system designs, particularly in real-time multimedia applications. There have been many approaches to achieve these two design goals at many different implementation levels ranging from very-large-scale-integration fabrication technology to system design. We review the works that have been done at various levels and focus on the algorithm-based approaches for low-power and high-performance design of signal processing systems. We present the concept of multirate computing that originates from filterbank design, then show how to employ it along with the other algorithmic methods to develop low-power and high-performance signal processing systems. The proposed multirate design methodology is systematic and applicable to many problems. We demonstrate that multirate computing is a powerful tool at the algorithmic level that enables designers to achieve either significant power reduction or high throughput depending on their choice. Design examples on basic multimedia processing blocks such as filtering, source coding, and channel coding are given. A digital signal-processing engine that is an adaptive reconfigurable architecture is also derived from the common features of our approach. Such an architecture forms a new generation of high-performance embedded signal processor based on the adaptive computing model. The goal of this paper is to demonstrate the flexibility and effectiveness of algorithm-based approaches and to show that the multirate approach is an effective and systematic design methodology to achieve low-power and high throughput signal processing at the algorithmic and architectural level  相似文献   

19.
针对现代电子设计低成本、高效率、高灵活性的特点,研究了一种新型的处理器:事件驱动多核心处理器。通过对这种处理器基本构架的研究,以及采用新型处理器与采用传统控制器设计差异的对比,分析出该处理器具有性能高、实时性强、易编程等优点。最后,提出了一种新的设计方法:硬件设计软件化,给众多电子系统设计提供新的思路和参考。  相似文献   

20.
This paper presents a novel unified and programmable 2-D Discrete Wavelet Transform (DWT) system architecture, which was implemented using a Field Programmable Gate Array (FPGA)-based Nios II soft-core processor working in combination with custom hardware accelerators generated through high-level synthesis. The proposed system architecture, synthesized on an Altera DE3 Stratix III FPGA board, was developed through an iterative design space exploration methodology using Altera’s C2H compiler. Experimental results show that the proposed system architecture is capable of real-time video processing performance for grayscale image resolutions of up to 1920?×?1080 (1080p) when ran on the Altera DE3 board, and it outperforms the existing 2-D DWT architecture implementations known in literature by a considerable margin in terms of throughput. While the proposed 2-D DWT system architecture satisfies real-time performance constraints, it can also perform both forward and inverse DWT, support a number of popular DWT filters used for image and video compression and provide architecture programmability in terms of number of levels of decomposition as well as image width and height. Based from the design principles used to implement the proposed 2-D DWT system architecture, a system design guideline can be formulated for SOC designs which plan to incorporate dedicated 2-D DWT hardware acceleration.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号