共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
To meet strict speed and power requirements for embedded applications, many high-end digital Signal Processors (DSPs) commonly
employ non-orthogonal architectures that are typically characterized by irregular data paths, heterogeneous registers, and multiple memory banks.
Obviously to harvest the benefits provided by this non-orthogonal architecture sufficient compiler support is necessary and
important. However, the complexity of such architectures presents a great challenge to compiler design and the usual compilation
techniques for general-purpose CPUs do not adapt well to the irregularity of DSP. The entire code generation process must
include the following phases: intermediate representation, code compaction, instruction scheduling, memory bank assignment
(or variable partition), and register/accumulator assignment. Much related research only considers some phases, which is inadequate.
In this paper, we present an effective code generation algorithm named Rotation Scheduling with Spill Codes Predicting (RSSP) to maximally exploit the benefits of non-orthogonal architectures. It contains six parts that cover almost the entire phases
of the code generation process. As well as introducing the detailed principles and algorithms of the proposed RSSP, we use
an analytic model to evaluate its preliminary performance. Evaluation results clearly demonstrate the effectiveness of the
proposed method. Furthermore, we also present some preliminary ideas to generalize RSSP, which can make it more practicable
and suit various DSPs with similar architectural features.
相似文献
Cheng Chen (Corresponding author)Email: |
3.
Tay-Jyi Lin Shin-Kai Chen Yu-Ting Kuo Chih-Wei Liu Pi-Chen Hsiao 《Journal of Signal Processing Systems》2008,51(3):209-223
This paper presents the design and implementation of a novel VLIW digital signal processor (DSP) for multimedia applications.
The DSP core embodies a distributed & ping-pong register file, which saves 76.8% silicon area and improves 46.9% access time
of centralized ones found in most VLIW processors by restricting its access patterns. However, it still has comparable performance
(estimated in cycles) with state-of-the-art DSP for multimedia applications. A hierarchical instruction encoding scheme is
also adopted to reduce the program sizes to 24.1∼26.0%. The DSP has been fabricated in the UMC 0.13 μm 1P8M Copper Logic Process,
and it can operate at 333 MHz while consuming 189 mW power. The core size is 3.2 × 3.15 mm2 including 160 KB on-chip SRAM.
相似文献
Chih-Wei LiuEmail: |
4.
5.
Huffman coding is a popular and important lossless compression scheme for various multimedia applications. This paper presents
a low-latency parallel Huffman decoding technique with efficient memory usage for multimedia standards. First, the multi-layer
prefix grouping technique is proposed for sub-group partition. It exploits the prefix characteristic in Huffman codewords
to solve the problem of table size explosion. Second, a two-level table lookup approach is introduced which can promptly branch
to the correct sub-group by level-1 table lookup and decode the symbols by level-2 table lookup. Third, two optimization approaches
are developed; one is to reduce the branch cycles and the other is parallel processing between two-level table lookup and
direct table lookup approaches to fully utilize the advantage of VLIW parallel processing. An AAC Huffman decoding example
is realized on the Parallel Architecture Core DSP (PAC DSP) processor. The simulation results show that the proposed method
can further improve about 89% of decoding cycles and 33% of table size comparing to the linear search method.
相似文献
Chun-Nan LiuEmail: |
6.
Jonah Probell 《Journal of Signal Processing Systems》2008,50(1):33-39
Many different video processor architectures exist. Its architecture gives a processor strength for a particular application.
Hardwired logic yields the best performance/cost, but a programmable processor is important for applications that support
multiple coding standards, proprietary functions, or future changes to application requirements. Programmable video processor
architectures achieve best performance through the use of parallelism at the data (SIMD), instruction (VLIW), and multiprocessor
level, and optimally sized ALU, multiplier, and load/store datapaths. Because low-cost memory architectures are not optimized
for the random access patterns of video processing, the performance of video processors is often limited by memory bandwidth
rather than processing resources. Careful data organization alleviates memory bandwidth limitations. When choosing a video
processor it is important to consider many factors, particularly performance, cost, power consumption, programmability, and
peripheral support.
相似文献
Jonah ProbellEmail: |
7.
A novel optical buffering architecture for Optical Packet Switching (OPS) networks is proposed in this article. The architecture
which adopts a fiber-sharing mechanism aims at solving the problem of using a large number of fiber delay lines that are used
to solve resource contention in the core node in OPS networks. The new architecture employs fewer fiber delay lines compared
to other simple architectures, but can achieve the same performance. Simulation results and analysis show that the new architecture
can decrease packet loss probability effectively and achieve reasonable performance in average packet delay.
相似文献
Fang GuoEmail: |
8.
High-speed and low area hardware architectures of the Whirlpool hash function are presented in this paper. A full Look-up
Table (LUT) based design is shown to be the fastest method by which to implement the non-linear layer of the algorithm in
terms of logic. An unrolled Whirlpool architecture implemented on the Virtex XC4VLX100 device achieves a throughput of 4.9 Gbps.
This is faster than a SHA-512 design implemented on the same device and other previously reported hash function architectures.
A low area iterative architecture, which utilises 64-bit operations as opposed to full 512-bit operations, is also described.
It runs at 430 Mbps and occupies 709 slices on a Virtex X4VLX15. This proves to be one of the smallest 512-bit hash function
architectures currently available.
相似文献
Ciaran McIvorEmail: |
9.
Guillermo Talavera Murali Jayapala Jordi Carrabina Francky Catthoor 《Journal of Signal Processing Systems》2008,53(3):271-284
Nowadays embedded systems are growing at an impressive rate and provide more and more sophisticated applications characterized
by having a complex array index manipulation and a large number of data accesses. Those applications require high performance
specific computation that general purpose processors can not deliver at a reasonable energy consumption. Very long instruction
word architectures seem a good solution providing enough computational performance at low power with the required programmability
to speed up the time to market. Those architectures rely on compiler effort to exploit the available instruction and data
parallelism to keep the data path busy all the time. With the density of transistors doubling each 18 months, more and more
sophisticated architectures with a high number of computational resources running in parallel are emerging. With this increasing
parallel computation, the access to data is becoming the main bottleneck that limits the available parallelism. To alleviate
this problem, in current embedded architectures, a special unit works in parallel with the main computing elements to ensure
efficient feed and storage of the data: the address generator unit, which comes in many flavors. Future architectures will
have to deal with enormous memory bandwidth in distributed memories and the development of address generators units will be
crucial for effective next generation of embedded processors where global trade-offs between reaction-time, bandwidth, energy
and area must be achieved. This paper provides a survey of methods and techniques that optimize the address generation process
for embedded systems, explaining current research trends and needs for future.
相似文献
Francky CatthoorEmail: |
10.
We implemented the H.264/AVC variable block size motion estimation (VBSME) using a very long instruction word (VLIW)–single
instruction multiple data (SIMD) digital signal processor (DSP). The SAD_Reuse method which has a regular structure is chosen
for VBSME not only to remove redundant sum of absolute difference (SAD) operations but also to utilize the instruction level
parallelism (ILP) and data level parallelism (DLP) of the architecture. A fast mode decision algorithm is developed to reduce
the number of ‘compare and update’ operations and simplify the rate distortion optimization (RDO). The developed fast mode
decision uses the difference of motion vectors and the maximum a posteriori (MAP) estimation of the rate-distortion costs.
Several advanced software techniques that include software pipelining and packed-data processing are employed. Especially,
memory access overhead reduction schemes including the multi-block processing and the inter-procedural scheduling are used
for the software optimization. In order to reduce the ‘write buffer full’ in the quarter pixel ME, a 4 bit quantization scheme
is developed, which increases the number of arithmetic operations but decreases the stall cycles very much. The implemented
variable block size ME for H.264/AVC requires an average of 9 M and 78 Mcycles per frame for QCIF and CIF size video sequences,
respectively, in the TMS320C64x DSP architecture.
相似文献
Wonyong SungEmail: |
11.
The non quantized nature of user rate wastes the code capacity in Orthogonal Variable Spreading Factor Codes (OVSF) based
Code Division Multiple Access (CDMA) systems. The code sharing scheme in multi code CDMA is proposed to minimize the code
rate wastage. The scheme combines the unused (wastage) capacity of already occupied codes to reduce the code blocking problem.
Simulation results are presented to show the superiority of the proposed code assignment scheme as compared to existing schemes.
相似文献
Sunil V. BhooshanEmail: |
12.
F. Angarita M. J. Canet T. Sansaloni J. Valls V. Almenar 《Journal of Signal Processing Systems》2008,52(1):35-44
This paper describes the design of a soft decision Viterbi Decoder for orthogonal frequency division multiplexing-based wireless
local area networks and evaluates different architectural options by means of their field programmable gate-array (FPGA) implementation.
A finite precision analysis has been performed to reduce the data-path widths under the specifications of IEEE 802.11a and
Hiperlan/2 standards. Four implementation strategies (register exchange, trace back, trace back with double rate memory read
and pointer trace back) for the survivor management unit have been evaluated together with two different normalization methods
for the add–compare–select unit. The results of the implementation in FPGA have been given and it is shown that register exchange
and pointer trace back architectures with pre-normalization in the add–compare–select unit achieve the best performance. Both
architectures can decode 200 Mbps in a Virtex-4 device with lower latency that the conventional trace back one and pointer
trace back exhibits the lowest power consumption, these characteristics make them suitable for future multiple-output multiple-input
WLAN systems.
相似文献
V. AlmenarEmail: |
13.
Mladen Berekovic Mladen Berekovic Tim Niggemeier 《Journal of Signal Processing Systems》2008,50(2):201-229
A scalable, distributed, processor architecture is presented that emphasizes on high performance computing for digital signal
processing applications by combining high frequency design techniques with a very high degree of parallel processing on a
chip. The architecture is based on a superscalar processor model with a modified Tomasulo scheme that was extended to eliminate
all central control structures for the data flow and to support simultaneous instruction issue from multiple independent threads
[simultaneously multi-threaded (SMT)]. Consequent application of fine clustering reduces the cycle-time for wire-sensitive
building blocks of the processor like the register file and the scheduling window and leads to a distributed architecture
model, where independent thread processing units, arithmetic logic units, registers files and memories are distributed across
the chip and communicate with each other by special network. A special communication protocol replaces broadcasting and associative
compare of destination tags in a centralised instruction scheduler with explicit operand transfer instructions, thus decentralizing
the control of the data flow to the greatest extent. As a result, the processor cycle time does neither depend on the issue
bandwidth of a single thread nor on the execution bandwidth of the SMT processor. This makes the performance of the architecture
scalable with both the number of function and the number of thread units without having any impact on the processors cycle-time.
Performance and scalability of the proposed microarchitecture is demonstrated with critical signal processing kernels from
the MPEG-4 video coding standard on a cycle-true simulator.
相似文献
Tim NiggemeierEmail: |
14.
Blind source separation of independent sources from their convolutive mixtures is a problem in many real-world multi-sensor
applications. However, the existing BSS architectures are more often than not based upon software and thus not suitable for
direct implementation on hardware. The existing software of feedback network algorithm is not suitable for real-time implementations.
In this paper, we present a parallel algorithm and architecture for hardware implementation of blind source separation. The
algorithm is based on feedback network and is highly suited for parallel processing. The implementation is designed to operate
in real time for speech signal sequences. It is systolic and easily scalable by simple adding and connecting chips or modules.
In order to verify the proposed architecture, we have also designed and implemented it in a hardware prototyping with Xilinx
FPGAs running at 33 MHz.
相似文献
H. JeongEmail: Email: |
15.
The next generation of wireless mobile communications termed beyond 3G (or 4G), will be based on a heterogeneous infrastructure
that comprises different wireless networks in a complementary manner. Beyond 3G will introduce reconfiguration capabilities
to flexibly and dynamically (i.e., during operation) adapt the wireless protocol stacks to better meet the ever-changing service
requirements. For the dynamic reconfiguration of protocol stacks during runtime operation to become a practical capability
of mobile communication systems, it is necessary to establish a software architecture that functionally supports reconfiguration.
In the present paper, a generic architecture and respective mechanisms to achieve protocol stack and component based protocol
layer reconfiguration are proposed.
相似文献
Vangelis GazisEmail: |
16.
B. Mei B. De Sutter T. Vander Aa M. Wouters A. Kanstein S. Dupont 《Journal of Signal Processing Systems》2008,51(3):225-243
Architecture for Dynamically Reconfigurable Embedded Systems (ADRES) is a templatized coarse-grained reconfigurable processor
architecture. It targets at embedded applications which demand high-performance, low-power and high-level language programmability.
Compared with typical very long instruction word-based digital signal processor, ADRES can exploit higher parallelism by using
more scalable hardware with support of novel compilation techniques. We developed a complete tool-chain, including compiler,
simulator and HDL generator. This paper describes the design case of a media processor targeting at H.264 decoder and other
video tasks based on the ADRES template. The whole processor design, hardware implementaiton and application mapping are done
in a relative short period. Yet we obtain C-programmed real-time H.264/AVC CIF decoding at 50 MHz. The die size, clock speed
and the power consumption are also very competitive compared with other processors.
相似文献
S. DupontEmail: |
17.
The paper addresses the integration architecture (I-concept) between a terrestrial technology—TETRA (TErrestrial Trunked Radio)—and
satellite systems. This approach, that enhances and harmonises the features of both technologies, could provide an interesting
contribution to the effectiveness of the International Mobile Telecommunications-Advanced (IMT-A) and, hence, to the 4G vision.
TETRA can represent an interesting building block of an integrated network devoted to both civil and military scenarios; it
meets the “suitable technological capability” requirement for integration, because it represents a consolidated terrestrial
technology that can be trusted, hence focusing the integration effort on the definition, design and implementation of proper
interfaces. System architectures are here proposed referring to short, medium term and long term scenarios.
相似文献
Giovanni GuidottiEmail: |
18.
This paper presents an Application Specific Instruction Set Processor (ASIP) for implementation of H.264/AVC, called Video
Specific Instruction-set Processor (VSIP). The proposed VSIP has novel instructions and optimized hardware architectures for
specific applications, such as intra prediction, in-loop deblocking filter, integer transform, etc. Moreover, VSIP has coprocessors
for computation intensive parts in video signal processing, such as inter prediction and entropy coding. The proposed VSIP
has much smaller area and can dramatically reduce the number of memory access compared with commercial DSP chips, which result
in low power consumption. Moreover, the proposed hardware accelerators have small size, consume low power consumption, and
thus, they can support real-time video processing. VSIP has been thoroughly verified using an FPGA board having the Xilinx™
Virtex II. VSIP can implement a real-time H.264/AVC decoder. The proposed VSIP is one of promising solutions for video signal
processing.
相似文献
Sung Dae KimEmail: |
19.
Cognitive Radio with Single Carrier TDCS and Multicarrier OFDM Approach with V-BLAST Receiver in Rayleigh Fading Channel 总被引:1,自引:0,他引:1
This article presents the performance comparison of TDCS and OFDM based cognitive radio for MIMO system using VBLAST receiver
architecture to reconstruct the transmitted data. The interference avoidance performance in terms of BER and bitrate are improved
by adding multiple antennas to the system and the use of V-BLAST technique at the receiver. The results show the most promising
interference avoidance technique combined with MIMO V-BLAST architecture to be applied in the CR system.
相似文献
L. P. LigthartEmail: |
20.
The performance of an optical code division multiplexing access (OCDMA) system employing the differential phase shift keying
(DPSK) data format and turbo code is investigated and simulated. A bandwidth-limited coherent time spreading (TS) OCDMA system
is considered. Theoretical results show that performance degradation due to bandwidth limitation could be effectively restrained
by the application of the DPSK data format in a coherent OCDMA system, and further performance improvement could be achieved
by incorporating turbo coding into the OCDMA system.
相似文献
Xiaogang ChenEmail: |