首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The performance of the W-CDMA cell search algorithm can be significantly improved using homogeneous general purpose Multi-Processor System-on-Chip (MPSoC) architectures. The application also scales well, as the number of processing nodes increases, allowing practical accelerations to become close to the theoretical maximum. In this work we describe a template MPSoC architecture based on multiprocessor computational clusters, called Ninesilica. Each Ninesilica consist of nine processing nodes based on COFFEE RISC architecture. MPSoC inter- and intra-cluster communication are enabled using hierarchical Network-on-Chip with dedicated point to point and broadcast communication services for better performance. Proposed template has been used to instantiate complete systems with one and four Ninesilica clusters, resulting in MPSoCs with respectively 9 and 36 computational nodes. The MPSoCs have been physically prototyped on a FPGA device, and the W-CDMA cell search algorithm has been mapped on both MPSoC platforms. The four Ninesilica MPSoC can execute W-CDMA in 20.5 ms (at 115 MHz, slow mode implementation) with the total speed-up of 24.3X and 3.3X when compared to a single processing core system and to a single Ninesilica cluster, respectively.  相似文献   

2.
In this paper, we study the impact of application task mapping on the reliability of multiprocessor system-on-chip (MPSoC) application in the presence of soft errors. Based on this study, we propose a novel system-level design optimization of an MPSoC application through joint power minimization and reliability improvement. The power minimization is carried out using voltage scaling technique, while reliability improvement is achieved through careful choice of application task mapping on the homogeneous MPSoC processing cores. The overall aim is to minimize the number of single-event upsets (SEUs) experienced by the MPSoC application for suitably identified voltage scaling of the system processing cores such that the power is reduced and the specified real-time constraint is met. We evaluate the effectiveness of the proposed design optimization using a number of different applications, including MPEG-2 video decoder and synthetic applications. We show that for an MPEG-2 decoder with four processing cores, the proposed soft error-aware optimization produces a design with 38% less SEUs than soft error-unaware design optimization for an arbitrary soft error rate of 10?9, while consuming 9% less power and meeting a given real-time constraint. Furthermore, we investigate the impact of architecture allocation (allocation of processing cores) and show that for an MPSoC with six processing cores and a given real-time constraint, the proposed optimization produces design with up to 7% less SEUs compared to soft error-unaware designs at the cost of 5.5% higher power.  相似文献   

3.
Dilation and erosion are two fundamental operations of mathematical morphology for image processing. This paper presents three hybrid wave-pipeline (HWP) architectures for real-time binary dilation operator. With minor changes to the number and/or to the type of the basic gates, they can be employed as erosion operator. In the first HWP-architecture, each single cell utilizes the wave technique along with delay units for balancing the data paths. By minimizing the number of delay units, the second HWP-architecture with reduced power consumption and hardware complexity is obtained. The third HWP-architecture employs wave technique in each three cascaded cells. This architecture improves the above performance further, at the cost of slight reduction in maximum clock frequency and clock frequency range. Simulation results, using a 0.18 μm CMOS technology, indicate that the HWP architectures have higher speed, less hardware complexity, and lower power consumption compared to pipeline (P) architecture. Also, they are faster than wave-pipeline (WP) architecture, without the difficulty of balancing the delay of long signal paths. Simulation illustrates that the third HWP-architecture dilates a 1024 × 1024 image by a 21 × 21 structuring element (SE) in 214.64 μs. The maximum frequency of operation is 5 GHz for the power supply of 1.8 V. The power dissipation is 410 mW, and the chip area is 0.075 mm2.  相似文献   

4.
Fractional Motion Estimation (FME) in high-definition H.264 presents a significant design challenge in terms of memory bandwidth, latency and area cost as there are various modes and complex mode decision flow, which require over 45% of the computation complexity in the H.264 encoding process. In this paper, a new high-performance VLSI architecture for Fractional Motion Estimation (FME) in H.264/AVC based on the full-search algorithm is presented. This architecture is made up of three different pipeline processors to establish a trade-off between processing time and hardware utilization. The computing scheme based on a 4-pixel interpolation unit with a 10-pixel input bandwidth is capable of processing a macroblock (MB) in 870 clock cycles. The final VLSI implementation only requires 11.4 k gates and 4.4kBytes of RAM in a standard 180 nm CMOS technology operating at 290 MHz. Our design generates the residual image and the best MVs and mode in a high throughput and low area cost architecture while achieving enough processing capacity for 1080HD (1920 × 1088@30fps) real-time video streams.  相似文献   

5.
数字信号处理器(DSP)是专门针对数字信号处理运算而设计的微处理器芯片。本文在介绍DSP算法特点的基础上,指出了DSP的基本结构组成以及当前主流DSP的两种典型体系结构。分析了这两种结构各自的优缺点,最后根据DSP应用领域的新情况以及微处理器体系结构的发展,对DSP结构的发展提出了一些看法。  相似文献   

6.
In this paper, we propose a low-power VLSI implementation of H.264/AVC baseline decoder. A systematic methodology for power reduction is proposed and applied at various design abstraction levels. At the algorithm level, the computational complexity is optimized. At the architecture level, pipelining and parallelism are widely adopted to reduce the operating frequency; hierarchical memory organization optimizes power-hungry memory accesses; hardware sharing reduces the total switching capacitance. At the circuit level, the knowledge about signal statistics is exploited to reduce number of transitions; data dependent signal-gating and clock-gating are introduced which are dynamic techniques for power reduction; multiplications are reduced and optimized, while complex dividers are totally eliminated. At the physical level, cell sizing and layout are optimized for power efficiency. The VLSI implementation shows that with UMC 0.18 μm technology, the proposed design is able to decode realtime QCIF 30fps at 1.5 MHz. The decoder contains 169 k logic gates and 2.5 KB on-chip SRAM. The total chip area is 4.4 × 4.4 mm2 in a CQFP 208 package. The measured power consumption is 973 μW @ 1.8 V and 293 μW @ 1.0 V. The low-power and realtime features make our design ideal for portable or mobile applications.  相似文献   

7.
Optical flow (OF) is an integral part of many vision systems, especially in the embedded and mobile application with ever-increasing challenges in achieving higher speed, minimal resource and lower power consumption. The work introduces a Dense High Throughput Optical Flow (DHTOF) architecture based on a novel fast converging Red-Black Successive Over Relaxation (RBSOR) solver architecture for computing dense and accurate OF using Horn and Schunck Optical Flow (HSOF) algorithm from Full High Definition (FHD) frames in real-time. The DHTOF architecture can capture dense OF from Ultra High Definition (UHD) frames at 48 Frames Per Second (FPS) with a throughput of 406 Megapixels/sec achieving a Throughput Per Watt (TPW) of 43 Giga Operation Per Second Per Watt (GOPS/Watt). The superscalar and deeply pipelined DHTOF architecture achieve same or lower Average Angular Error (AAE) with ≈ 4 × lesser number of RBSOR solver iterations as compared to the prior HSOF implementations based on Jacobi solver. It consumes 12.5 × lesser resources and 29.3% lower power for FHD resolution when compared to prior architectures. The proposed DHTOF architecture achieves highest area delay normalized speedup (at least by 28.2 ×) among the state of the art HSOF architectures. The successful evaluation of the proposed architecture for real-time OF sensor is demonstrated in Xilinx Virtex-VC707 Field Programmable Gate Array (FPGA) evaluation board.  相似文献   

8.
ABSTRACT

Otsu’s global automatic image thresholding operation is used in various image processing applications. It needs computation of normalized cumulative histogram, mean and cumulative moments that are compute-intensive operations. In this paper, a custom architecture is presented for an efficient computation of Otsu’s algorithm along with its utilization as an intellectual property (IP) core in a field programmable gate array (FPGA) based system-on-chip (SoC) environment for the application of connected component analysis (CCA). A self-normalization technique is employed, where single-cycle, read–modify–write operations are performed with block random access memories (BRAMs) and digital signal processing (DSP)slices. The architecture is designed for 640 × 480 size of images that are captured by a high-resolution analouge camera and buffered in a DDR2 SDRAM of Xilinx ML-507 platform at 25.175 MHz clock frequency. The embedded PowerPC processor core is used to control the frame acquisition process. Experimental results on Virtex-5 xc5vfx70t FPGA device show that the architecture utilizes 1.4% slices, 2.7% BRAMs and 3.9% DSP48E slices. The total power consumption of the design is 1440.59 mW. The proposed architecture as an IP core is able to work in real-time with standard VGA resolution video and requires low computational resources.  相似文献   

9.
We present the development and use of a real-time digital signal processing (DSP)-based optical coherence tomography (OCT) and Doppler OCT system. Images of microstructure and transient fluid-flow profiles are acquired using the DSP architecture for real-time processing of computationally intensive calculations. This acquisition system is readily configurable for a wide range of real-time signal processing and image processing applications in OCT.  相似文献   

10.
进行数字信号处理运算的微处理器用于实时快速地实现数字信号处理算法。针对ADSP-BF533的特点介绍了基于该芯片的音频处理系统电路设计,主要包括复位电路、电源单元、存储器单元及模数/数模转换单元等,可作为音频信号处理的通用系统。  相似文献   

11.
波束形成是声纳探测系统中探测目标的主要技术手段,在现有设备中,主要采用DSP来实现。在用DSP实现波束形成算法的过程中,由于DSP本身的顺序执行架构,如果采用单片DSP处理,从输入信号到输出结果之间存在非常大的时间延迟,采用5片DSP处理则功耗增加为5倍,时延200 ms。采用FPGA,通过设计并行运算的程序结构来实现波束形成算法可以大大缩短算法实现的时间延迟,功耗也可以降低为采用DSP的1/10。设计的波束形成器采用100 MHz时钟,相比采用5片DSP,运算时间由200 ms缩短到10 ms左右,功耗降低为后者的1/5。  相似文献   

12.
There is increasing research and commercial interest in miniature on-body and implantable devices for continuous real-time biosignal monitoring. A key challenge in realizing this vision is in implementation of biosignal processing algorithms with acceptably low energy consumption. In this article, we investigate implementation of the REACT algorithm for real-time epileptic seizure detection on a Coarse Grained Reconfigurable Array (CGRA) based architecture. Computationally expensive biosignal processing tasks are offloaded from a conventional Digital Signal Processor (DSP) to the CGRA. The CGRA is designed to support low power biosignal processing by means of a systolic architecture, flexible interconnect and low resource usage. The CGRA architecture is shown to provide 38% and 60% improvements in energy consumption and in performance, respectively, for the REACT system, without the use of voltage scaling or increased clock frequency.  相似文献   

13.
The use of digital signal processing (DSP) devices for real-time communication applications is discussed. The authors comment on distinguishing aspects of DSP architecture, describing not so much individual processors as those features common to DSPs and distinct from modern general-purpose processors. They describe three DSP32xx-based machines that support DSP algorithm implementation: SURF-board, HoBo, and DSP3. They also described rtpi, a source-code debugger for workstations and for the AT&T DSP32C signal-processor integrated circuit, and dspx, a collection of subroutines and host programs that provides an execution environment for DSPs akin to the UNIX environment. These tools facilitate the transfer of algorithms from mainframes or workstations to DSP hardware. Included are case studies of two real-time implementations: the low-delay CELP (LD-CELP) speech coder and the decoder side of the perceptual audio coder (PAC), an algorithm that compresses CD-quality audio into a 128-kb/s stream without perceptible distortion  相似文献   

14.
This paper presents the design and implementation of a novel VLIW digital signal processor (DSP) for multimedia applications. The DSP core embodies a distributed & ping-pong register file, which saves 76.8% silicon area and improves 46.9% access time of centralized ones found in most VLIW processors by restricting its access patterns. However, it still has comparable performance (estimated in cycles) with state-of-the-art DSP for multimedia applications. A hierarchical instruction encoding scheme is also adopted to reduce the program sizes to 24.1∼26.0%. The DSP has been fabricated in the UMC 0.13 μm 1P8M Copper Logic Process, and it can operate at 333 MHz while consuming 189 mW power. The core size is 3.2 × 3.15 mm2 including 160 KB on-chip SRAM.
Chih-Wei LiuEmail:
  相似文献   

15.
A methodology for the hierarchical partitioning and mapping of digital signal processing (DSP) tasks to heterogeneous local cluster based network of very large scale integration (VLSI) processors is presented. The goal is to achieve rapid prototyping of VLSI DSP systems. The high level partitioning issues of DSP task graphs and the proposed metrics to guide the partitioning process are described in this paper. Partitioning tominimize power inefficiency in the DSP system is one important metric addressed by this work, since low power signal processing is paramount to new portable and high density multi-chip module (MCM) DSP systems. The application of theRatio Cut Partitioning approach to DSP graphs is explained. We illustrate our results with examples and show how the final partitions vary depending upon the target architecture to meet rapid prototyping requirements. We compare our approach with known techniques and show that it works much better for our target applications.  相似文献   

16.
郭瑞  张月  孙刚  陈曾平 《信号处理》2013,29(9):1238-1243
为了提高逆合成孔径雷达(ISAR)实时成像的性能,本文首先设计了一种基于TMS320C6678多核信号处理器(DSP)的高速实时信号处理平台,优化了功耗的同时提高了信号处理能力。其次,本文提出了一种利用窄带测量信息进行成像条件判断、成像数据选择和指导高速运动补偿的实时成像流程,并通过将该流程分割成几个独立的任务,在分析任务的实时性和任务间的通信的基础上,完成了任务在多核DSP上的分配。利用本文平台对实测数据进行处理,并将成像性能和实时性与单核DSP信号处理平台做对比,进一步验证了多核信号处理平台的处理优势和算法设计的合理性。   相似文献   

17.
Real-time streaming signal processing systems typically desire high throughput and low latency. Many such systems can be modeled as synchronous data flow graphs. In this paper, we address the problem of multi-objective mapping of SDF graphs onto heterogeneous multiprocessor platforms, where we account for the overhead of bus-based inter-processor communication. The primary contributions include (1) an integer linear programming (ILP) model that globally optimizes throughput, latency and cost; (2) low-complexity two-stage heuristics based on a combination of an evolutionary algorithm with an ILP to generate either a single sub-optimal mapping solution or a Pareto front for design space optimization. In our simulations, the proposed heuristic shows up to 12x run-time efficiency compared to the global ILP while maintaining a 10 − 6 optimality gap in throughput.  相似文献   

18.
A 300-MHz 16-b fixed-point digital signal processor (DSP) core LSI has been developed for video signal processing. In order to achieve high performance, the DSP core LSI employs a parallel processing architecture, 300-MHz redundant binary arithmetic units, and a sophisticated high-performance electrical design. The DSP core LSI, which was fabricated with 0.5-μm BICMOS and triple-level-metallization technology, has a 3.9 mm×4.6 mm area, and contains about 57K transistors. It consumes 2 W at a 300-MHz clock frequency with a 3.3-V power supply. Measured clock skew and critical path delay are less than 80 ps and 2.6 ns, respectively  相似文献   

19.
New circuit design techniques for implementing very high-valued resistors are presented, significantly improving power and area efficiency of analog front-end signal processing in ultra-low power biomedical systems. Ranging in value from few hundreds of M\Upomega\hbox{M}\Upomega to few hundreds of G\Upomega\hbox{G}\Upomega, the proposed floating resistors occupy a very small area, and produce accurately tunable characteristics. Using this approach, a low-pass MOSFET-C filter with tunable cutoff frequency (f C  = 20 Hz–184 kHz) has been implemented in a conventional 0.18 μm CMOS technology. Occupying 0.045 mm2/pole, the power consumption of this filter is 540 pW/Hz/pole with a measured IMFDR of 70 dB.  相似文献   

20.
This paper presents a reduced-complexity, fixed-point algorithm and efficient real-time VLSI architectures for multiuser channel estimation, one of the core baseband processing operations in wireless base-station receivers for CDMA. Future wireless base-station receivers will need to use sophisticated algorithms to support extremely high data rates and multimedia. Current DSP implementations of these algorithms are unable to meet real-time requirements. However, there exists massive parallelism and bit level arithmetic present in these algorithms than can be revealed and efficiently implemented in a VLSI architecture. We re-design an existing channel estimation algorithm from an implementation perspective for a reduced complexity, fixed-point hardware implementation. Fixed point simulations are presented to evaluate the precision requirements of the algorithm. A dependence graph of the algorithm is presented and area-time trade-offs are developed. An area-constrained architecture achieves low data rates with minimum hardware, which may be used in pico-cell base-stations. A time-constrained solution exploits the entire available parallelism and determines the maximum theoretical data processing rates. An area-time efficient architecture meets real-time requirements with minimum area overhead.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号