期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Enhancing Microkernel Performance on VLIW DSP Processors via Multiset Context Switch

Kun-Yuan Hsieh Yung-Chia Lin Chien-Chin Huang Jenq-Kuen Lee 《Journal of Signal Processing Systems》2008,51(3):257-268

相似文献

2.

An Efficient Code Generation Algorithm for Non-orthogonal DSP Architecture

Yi-Hsuan Lee Cheng Chen 《The Journal of VLSI Signal Processing》2007,47(3):281-296

To meet strict speed and power requirements for embedded applications, many high-end digital Signal Processors (DSPs) commonly employ non-orthogonal architectures that are typically characterized by irregular data paths, heterogeneous registers, and multiple memory banks. Obviously to harvest the benefits provided by this non-orthogonal architecture sufficient compiler support is necessary and important. However, the complexity of such architectures presents a great challenge to compiler design and the usual compilation techniques for general-purpose CPUs do not adapt well to the irregularity of DSP. The entire code generation process must include the following phases: intermediate representation, code compaction, instruction scheduling, memory bank assignment (or variable partition), and register/accumulator assignment. Much related research only considers some phases, which is inadequate. In this paper, we present an effective code generation algorithm named Rotation Scheduling with Spill Codes Predicting (RSSP) to maximally exploit the benefits of non-orthogonal architectures. It contains six parts that cover almost the entire phases of the code generation process. As well as introducing the detailed principles and algorithms of the proposed RSSP, we use an analytic model to evaluate its preliminary performance. Evaluation results clearly demonstrate the effectiveness of the proposed method. Furthermore, we also present some preliminary ideas to generalize RSSP, which can make it more practicable and suit various DSPs with similar architectural features.

Cheng Chen (Corresponding author)Email:

相似文献

3.

Design and Implementation of a High-Performance and Complexity-Effective VLIW DSP for Multimedia Applications

Tay-Jyi Lin Shin-Kai Chen Yu-Ting Kuo Chih-Wei Liu Pi-Chen Hsiao 《Journal of Signal Processing Systems》2008,51(3):209-223

This paper presents the design and implementation of a novel VLIW digital signal processor (DSP) for multimedia applications. The DSP core embodies a distributed & ping-pong register file, which saves 76.8% silicon area and improves 46.9% access time of centralized ones found in most VLIW processors by restricting its access patterns. However, it still has comparable performance (estimated in cycles) with state-of-the-art DSP for multimedia applications. A hierarchical instruction encoding scheme is also adopted to reduce the program sizes to 24.1∼26.0%. The DSP has been fabricated in the UMC 0.13 μm 1P8M Copper Logic Process, and it can operate at 333 MHz while consuming 189 mW power. The core size is 3.2 × 3.15 mm² including 160 KB on-chip SRAM.

Chih-Wei LiuEmail:

相似文献

4.

A Scalable Configurable Architecture for Advanced Wireless Communication Algorithms

Konstantinos Sarrigeorgidis Jan Rabaey 《The Journal of VLSI Signal Processing》2006,45(3):127-151

相似文献

5.

A Low-Latency Multi-layer Prefix Grouping Technique for Parallel Huffman Decoding of Multimedia Standards

Tsung-Han Tsai Chun-Nan Liu 《Journal of Signal Processing Systems》2008,53(3):323-333

Huffman coding is a popular and important lossless compression scheme for various multimedia applications. This paper presents a low-latency parallel Huffman decoding technique with efficient memory usage for multimedia standards. First, the multi-layer prefix grouping technique is proposed for sub-group partition. It exploits the prefix characteristic in Huffman codewords to solve the problem of table size explosion. Second, a two-level table lookup approach is introduced which can promptly branch to the correct sub-group by level-1 table lookup and decode the symbols by level-2 table lookup. Third, two optimization approaches are developed; one is to reduce the branch cycles and the other is parallel processing between two-level table lookup and direct table lookup approaches to fully utilize the advantage of VLIW parallel processing. An AAC Huffman decoding example is realized on the Parallel Architecture Core DSP (PAC DSP) processor. The simulation results show that the proposed method can further improve about 89% of decoding cycles and 33% of table size comparing to the linear search method.

Chun-Nan LiuEmail:

相似文献

6.

Architecture Considerations for Multi-Format Programmable Video Processors

Jonah Probell 《Journal of Signal Processing Systems》2008,50(1):33-39

Many different video processor architectures exist. Its architecture gives a processor strength for a particular application. Hardwired logic yields the best performance/cost, but a programmable processor is important for applications that support multiple coding standards, proprietary functions, or future changes to application requirements. Programmable video processor architectures achieve best performance through the use of parallelism at the data (SIMD), instruction (VLIW), and multiprocessor level, and optimally sized ALU, multiplier, and load/store datapaths. Because low-cost memory architectures are not optimized for the random access patterns of video processing, the performance of video processors is often limited by memory bandwidth rather than processing resources. Careful data organization alleviates memory bandwidth limitations. When choosing a video processor it is important to consider many factors, particularly performance, cost, power consumption, programmability, and peripheral support.

Jonah ProbellEmail:

相似文献

7.

An effective buffering architecture for optical packet switching networks

Ru-yan Wang Jie Zhang Fang Guo Ke-ping Long 《Photonic Network Communications》2008,16(3):239-243

A novel optical buffering architecture for Optical Packet Switching (OPS) networks is proposed in this article. The architecture which adopts a fiber-sharing mechanism aims at solving the problem of using a large number of fiber delay lines that are used to solve resource contention in the core node in OPS networks. The new architecture employs fewer fiber delay lines compared to other simple architectures, but can achieve the same performance. Simulation results and analysis show that the new architecture can decrease packet loss probability effectively and achieve reasonable performance in average packet delay.

Fang GuoEmail:

相似文献

8.

High-speed & Low Area Hardware Architectures of the Whirlpool Hash Function

Máire McLoone Ciaran McIvor 《The Journal of VLSI Signal Processing》2007,47(1):47-57

High-speed and low area hardware architectures of the Whirlpool hash function are presented in this paper. A full Look-up Table (LUT) based design is shown to be the fastest method by which to implement the non-linear layer of the algorithm in terms of logic. An unrolled Whirlpool architecture implemented on the Virtex XC4VLX100 device achieves a throughput of 4.9 Gbps. This is faster than a SHA-512 design implemented on the same device and other previously reported hash function architectures. A low area iterative architecture, which utilises 64-bit operations as opposed to full 512-bit operations, is also described. It runs at 430 Mbps and occupies 709 slices on a Virtex X4VLX15. This proves to be one of the smallest 512-bit hash function architectures currently available.

Ciaran McIvorEmail:

相似文献

9.

Address Generation Optimization for Embedded High-Performance Processors: A Survey

Guillermo Talavera Murali Jayapala Jordi Carrabina Francky Catthoor 《Journal of Signal Processing Systems》2008,53(3):271-284

Nowadays embedded systems are growing at an impressive rate and provide more and more sophisticated applications characterized by having a complex array index manipulation and a large number of data accesses. Those applications require high performance specific computation that general purpose processors can not deliver at a reasonable energy consumption. Very long instruction word architectures seem a good solution providing enough computational performance at low power with the required programmability to speed up the time to market. Those architectures rely on compiler effort to exploit the available instruction and data parallelism to keep the data path busy all the time. With the density of transistors doubling each 18 months, more and more sophisticated architectures with a high number of computational resources running in parallel are emerging. With this increasing parallel computation, the access to data is becoming the main bottleneck that limits the available parallelism. To alleviate this problem, in current embedded architectures, a special unit works in parallel with the main computing elements to ensure efficient feed and storage of the data: the address generator unit, which comes in many flavors. Future architectures will have to deal with enormous memory bandwidth in distributed memories and the development of address generators units will be crucial for effective next generation of embedded processors where global trade-offs between reaction-time, bandwidth, energy and area must be achieved. This paper provides a survey of methods and techniques that optimize the address generation process for embedded systems, explaining current research trends and needs for future.

Francky CatthoorEmail:

相似文献

10.

Algorithm and Software Optimization of Variable Block Size Motion Estimation for H.264/AVC on a VLIW–SIMD DSP

Wonchul Lee Hyojin Choi Wonyong Sung 《Journal of Signal Processing Systems》2008,51(3):289-302

We implemented the H.264/AVC variable block size motion estimation (VBSME) using a very long instruction word (VLIW)–single instruction multiple data (SIMD) digital signal processor (DSP). The SAD_Reuse method which has a regular structure is chosen for VBSME not only to remove redundant sum of absolute difference (SAD) operations but also to utilize the instruction level parallelism (ILP) and data level parallelism (DLP) of the architecture. A fast mode decision algorithm is developed to reduce the number of ‘compare and update’ operations and simplify the rate distortion optimization (RDO). The developed fast mode decision uses the difference of motion vectors and the maximum a posteriori (MAP) estimation of the rate-distortion costs. Several advanced software techniques that include software pipelining and packed-data processing are employed. Especially, memory access overhead reduction schemes including the multi-block processing and the inter-procedural scheduling are used for the software optimization. In order to reduce the ‘write buffer full’ in the quarter pixel ME, a 4 bit quantization scheme is developed, which increases the number of arithmetic operations but decreases the stall cycles very much. The implemented variable block size ME for H.264/AVC requires an average of 9 M and 78 Mcycles per frame for QCIF and CIF size video sequences, respectively, in the TMS320C64x DSP architecture.

Wonyong SungEmail:

相似文献

11.

OVSF Code Sharing and Reducing the Code Wastage Capacity in WCDMA

Davinder S. Saini Sunil V. Bhooshan 《Wireless Personal Communications》2009,48(4):521-529

The non quantized nature of user rate wastes the code capacity in Orthogonal Variable Spreading Factor Codes (OVSF) based Code Division Multiple Access (CDMA) systems. The code sharing scheme in multi code CDMA is proposed to minimize the code rate wastage. The scheme combines the unused (wastage) capacity of already occupied codes to reduce the code blocking problem. Simulation results are presented to show the superiority of the proposed code assignment scheme as compared to existing schemes.

Sunil V. BhooshanEmail:

相似文献

12.

Architectures for the Implementation of a OFDM-WLAN Viterbi Decoder

F. Angarita M. J. Canet T. Sansaloni J. Valls V. Almenar 《Journal of Signal Processing Systems》2008,52(1):35-44

This paper describes the design of a soft decision Viterbi Decoder for orthogonal frequency division multiplexing-based wireless local area networks and evaluates different architectural options by means of their field programmable gate-array (FPGA) implementation. A finite precision analysis has been performed to reduce the data-path widths under the specifications of IEEE 802.11a and Hiperlan/2 standards. Four implementation strategies (register exchange, trace back, trace back with double rate memory read and pointer trace back) for the survivor management unit have been evaluated together with two different normalization methods for the add–compare–select unit. The results of the implementation in FPGA have been given and it is shown that register exchange and pointer trace back architectures with pre-normalization in the add–compare–select unit achieve the best performance. Both architectures can decode 200 Mbps in a Virtex-4 device with lower latency that the conventional trace back one and pointer trace back exhibits the lowest power consumption, these characteristics make them suitable for future multiple-output multiple-input WLAN systems.

V. AlmenarEmail:

相似文献

13.

A Distributed,Simultaneously Multi-Threaded (SMT) Processor with Clustered Scheduling Windows for Scalable DSP Performance

Mladen Berekovic Mladen Berekovic Tim Niggemeier 《Journal of Signal Processing Systems》2008,50(2):201-229

A scalable, distributed, processor architecture is presented that emphasizes on high performance computing for digital signal processing applications by combining high frequency design techniques with a very high degree of parallel processing on a chip. The architecture is based on a superscalar processor model with a modified Tomasulo scheme that was extended to eliminate all central control structures for the data flow and to support simultaneous instruction issue from multiple independent threads [simultaneously multi-threaded (SMT)]. Consequent application of fine clustering reduces the cycle-time for wire-sensitive building blocks of the processor like the register file and the scheduling window and leads to a distributed architecture model, where independent thread processing units, arithmetic logic units, registers files and memories are distributed across the chip and communicate with each other by special network. A special communication protocol replaces broadcasting and associative compare of destination tags in a centralised instruction scheduler with explicit operand transfer instructions, thus decentralizing the control of the data flow to the greatest extent. As a result, the processor cycle time does neither depend on the issue bandwidth of a single thread nor on the execution bandwidth of the SMT processor. This makes the performance of the architecture scalable with both the number of function and the number of thread units without having any impact on the processors cycle-time. Performance and scalability of the proposed microarchitecture is demonstrated with critical signal processing kernels from the MPEG-4 video coding standard on a cycle-true simulator.

Tim NiggemeierEmail:

相似文献

14.

A Systolic Architecture and Implementation of Feedback Network for Blind Source Separation

H. Jeong Y. Kim 《The Journal of VLSI Signal Processing》2007,47(2):117-126

Blind source separation of independent sources from their convolutive mixtures is a problem in many real-world multi-sensor applications. However, the existing BSS architectures are more often than not based upon software and thus not suitable for direct implementation on hardware. The existing software of feedback network algorithm is not suitable for real-time implementations. In this paper, we present a parallel algorithm and architecture for hardware implementation of blind source separation. The algorithm is based on feedback network and is highly suited for parallel processing. The implementation is designed to operate in real time for speech signal sequences. It is systolic and easily scalable by simple adding and connecting chips or modules. In order to verify the proposed architecture, we have also designed and implemented it in a hardware prototyping with Xilinx FPGAs running at 33 MHz.

H. JeongEmail: Email:

相似文献

15.

Generic Architecture and Mechanisms for Protocol Reconfiguration

Nancy Alonistioti Eleni Patouni Vangelis Gazis 《Mobile Networks and Applications》2006,11(6):917-934

The next generation of wireless mobile communications termed beyond 3G (or 4G), will be based on a heterogeneous infrastructure that comprises different wireless networks in a complementary manner. Beyond 3G will introduce reconfiguration capabilities to flexibly and dynamically (i.e., during operation) adapt the wireless protocol stacks to better meet the ever-changing service requirements. For the dynamic reconfiguration of protocol stacks during runtime operation to become a practical capability of mobile communication systems, it is necessary to establish a software architecture that functionally supports reconfiguration. In the present paper, a generic architecture and respective mechanisms to achieve protocol stack and component based protocol layer reconfiguration are proposed.

Vangelis GazisEmail:

相似文献

16.

Implementation of a Coarse-Grained Reconfigurable Media Processor for AVC Decoder

B. Mei B. De Sutter T. Vander Aa M. Wouters A. Kanstein S. Dupont 《Journal of Signal Processing Systems》2008,51(3):225-243

Architecture for Dynamically Reconfigurable Embedded Systems (ADRES) is a templatized coarse-grained reconfigurable processor architecture. It targets at embedded applications which demand high-performance, low-power and high-level language programmability. Compared with typical very long instruction word-based digital signal processor, ADRES can exploit higher parallelism by using more scalable hardware with support of novel compilation techniques. We developed a complete tool-chain, including compiler, simulator and HDL generator. This paper describes the design case of a media processor targeting at H.264 decoder and other video tasks based on the ADRES template. The whole processor design, hardware implementaiton and application mapping are done in a relative short period. Yet we obtain C-programmed real-time H.264/AVC CIF decoding at 50 MHz. The die size, clock speed and the power consumption are also very competitive compared with other processors.

S. DupontEmail:

相似文献

17.

Integration of TETRA with Satellite Networks: A Contribution to the IMT-A Vision

Emiliano Re Marina Ruggieri Giovanni Guidotti 《Wireless Personal Communications》2008,45(4):559-568

The paper addresses the integration architecture (I-concept) between a terrestrial technology—TETRA (TErrestrial Trunked Radio)—and satellite systems. This approach, that enhances and harmonises the features of both technologies, could provide an interesting contribution to the effectiveness of the International Mobile Telecommunications-Advanced (IMT-A) and, hence, to the 4G vision. TETRA can represent an interesting building block of an integrated network devoted to both civil and military scenarios; it meets the “suitable technological capability” requirement for integration, because it represents a consolidated terrestrial technology that can be trusted, hence focusing the integration effort on the definition, design and implementation of proper interfaces. System architectures are here proposed referring to short, medium term and long term scenarios.

Giovanni GuidottiEmail:

相似文献

18.

ASIP Approach for Implementation of H.264/AVC

Sung Dae Kim Myung H. Sunwoo 《Journal of Signal Processing Systems》2008,50(1):53-67

This paper presents an Application Specific Instruction Set Processor (ASIP) for implementation of H.264/AVC, called Video Specific Instruction-set Processor (VSIP). The proposed VSIP has novel instructions and optimized hardware architectures for specific applications, such as intra prediction, in-loop deblocking filter, integer transform, etc. Moreover, VSIP has coprocessors for computation intensive parts in video signal processing, such as inter prediction and entropy coding. The proposed VSIP has much smaller area and can dramatically reduce the number of memory access compared with commercial DSP chips, which result in low power consumption. Moreover, the proposed hardware accelerators have small size, consume low power consumption, and thus, they can support real-time video processing. VSIP has been thoroughly verified using an FPGA board having the Xilinx™ Virtex II. VSIP can implement a real-time H.264/AVC decoder. The proposed VSIP is one of promising solutions for video signal processing.

Sung Dae KimEmail:

相似文献

19.

Cognitive Radio with Single Carrier TDCS and Multicarrier OFDM Approach with V-BLAST Receiver in Rayleigh Fading Channel 总被引：1，自引：0，他引：1

I. Budiarjo H. Nikookar L. P. Ligthart 《Mobile Networks and Applications》2008,13(5):416-423

This article presents the performance comparison of TDCS and OFDM based cognitive radio for MIMO system using VBLAST receiver architecture to reconstruct the transmitted data. The interference avoidance performance in terms of BER and bitrate are improved by adding multiple antennas to the system and the use of V-BLAST technique at the receiver. The results show the most promising interference avoidance technique combined with MIMO V-BLAST architecture to be applied in the CR system.

L. P. LigthartEmail:

相似文献

20.

Performance improvement of bandwidth-limited coherent OCDMA system

Xiaogang Chen Deyi Chen Zonglong Wang 《Photonic Network Communications》2008,16(2):149-154

The performance of an optical code division multiplexing access (OCDMA) system employing the differential phase shift keying (DPSK) data format and turbo code is investigated and simulated. A bandwidth-limited coherent time spreading (TS) OCDMA system is considered. Theoretical results show that performance degradation due to bandwidth limitation could be effectively restrained by the application of the DPSK data format in a coherent OCDMA system, and further performance improvement could be achieved by incorporating turbo coding into the OCDMA system.

Xiaogang ChenEmail:

相似文献