首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 328 毫秒
1.
This paper presents an architecture for high-throughput decoding of high-rate Low-Density Parity-Check (LDPC) codes. The proposed architecture is a modification of the sliced message passing (SMP) decoding architecture which overlaps the check-node and variable-node update stages, achieving a good tradeoff between area and throughput, and also, high hardware utilization efficiency (HUE). The proposed modification does not affect the performance of the SMP algorithm and yields an area reduction of 33%. As an example, SMP architecture and the proposed modification was synthesized in a 90 nm CMOS process for the 2048-bit LDPC code of the IEEE802.3an standard with 16 iterations achieving a throughput of 5.9 Gbps with 15.3 mm2 and 6.2 Gbps with 10.2 mm2, respectively.  相似文献   

2.
The packet classification is a fundamental process in provisioning security and quality of service for many intelligent network-embedded systems running in the Internet of Things (IoT). In recent years, researchers have tried to develop hardware-based solutions for the classification of Internet packets. Due to higher throughput and shorter delays, these solutions are considered as a major key to improving the quality of services. Most of these efforts have attempted to implement a software algorithm on the FPGA to reduce the processing time and enhance the throughput. The proposed architectures, however, cannot reach a compromise among power consumption, memory usage, and throughput rate. In view of this, the architecture proposed in this paper contains a pipeline-based micro-core that is used in network processors to classify packets. To this end, three architectures have been implemented using the proposed micro-core. The first architecture performs parallel classification based on header fields. The second one classifies packets in a serial manner. The last architecture is the pipeline-based classifier, which can increase performance by nine times. The proposed architectures have been implemented on an FPGA chip. The results are indicative of a reduction in memory usage as well as an increase in speedup and throughput. The architecture has a power consumption of is 1.294w, and its throughput with a frequency of 233 ?MHz exceeds 147 Gbps.  相似文献   

3.
This paper presents a novel type of high-speed and area-efficient register-based transpose memory architecture enabled by reporting on both edges of the clock. The proposed new architecture, by using the double-edge triggered registers, doubles the throughput and increases the maximum frequency by avoiding some of the combinational circuit used in prior work. The proposed design is evaluated with both FPGA and ASIC flow in 28/32nm technology. The experimental results show that the proposed memory achieves almost 4X improvement in throughput while consuming 46 % less area with the FPGA implementations compared to prior work. For ASIC implementations, it achieves more than 60 % area reduction and at least 2X performance improvement while burning 60 % less power compared to other register-based designs implemented with the same flow. As an example, a proposed 8X8 transpose memory with 12-bit input/output resolution is able to achieve a throughput of 107.83Gbps at 647MHz by taking only 140 slices on a Virtex-7 Xilinx FPGA platform, and achieve a throughput of 88.2Gbps at 529MHz by taking 0.024mm 2 silicon area for ASIC. The proposed transpose memory is integrated in both 2D-DCT and 2D-IDCT blocks for signal processing applications on the same FPGA platform. The new architecture allows a 3.5X speed-up in performance for the 2D-DCT algorithm, compared to the previous work, while consuming 28 % less area, and 2D-IDCT achieves a 3X speed-up while consuming 20 % less area.  相似文献   

4.
5.
We present a novel 4096 complex-point, fully systolic VLSI FFT architecture based on the combination of three consecutive radix-4 stages resulting in a 64-point FFT engine. The outcome of cascading these 64-point FFT engines is an improved architecture that efficiently processes large input data sets in real time. Using 64-point FFT engines reduces the buffering and the latency to one third of a fully unfolded radix-4 architecture, while the radix-4 schema simplifies the calculations within each engine. The proposed 4096 complex point architecture has been implemented on a FPGA achieving a post-route clock frequency of 200 MHz resulting in a sustained throughput of 4096 point/20.48 μs. It has also been implemented on a high performance 0.13 μm, 1P8M CMOS process achieving a worst-case (0.9 V, 125 C) post-route clock frequency of 604.5 MHz and a sustained throughput of 4096 point/3.89 μs while consuming 4.4 W. The architecture is extended to accomplish FFT computations of 16K, 64K and 256K complex points with 352, 256 and 188 MHz operating frequencies respectively.  相似文献   

6.
Achieving high image quality is an important aspect in an increasing number of wireless multimedia applications. These applications require resource efficient error correction hardware to detect and correct errors introduced by the communication channel. This paper presents an innovative flexible architecture for error correction using Low-Density Parity-Check (LDPC) codes. The proposed partially-parallel decoder architecture utilizes a novel code construction technique based on multi-level Hierarchical Quasi-Cyclic (HQC) matrix. The proposed architecture is resource efficient, provides scalable throughput and requires substantially less power compared to other decoders reported to date. The proposed decoder has been implemented on a Xilinx FPGA suitable for WiMAX application and achieves a throughput of 548 Mbps. Performance evaluation of the decoder has been carried out by transmitting JPEG images over a wireless noisy channel and comparing the quality of the reconstructed images with those from other similar decoders.  相似文献   

7.
This paper presents a link adaptation algorithm dedicated for 100 Gbps wireless transmission. Interleaved Reed-Solomon codes are selected as forward error correction (FEC) algorithms. The redundancy of the codes is selected according to the channel bit error rate (BER). The uncomplicated FEC scheme allows implementing a complete data link layer processor in an FPGA (field programmable gate array). In our case, we use the Virtex7 FPGA to validate the functionality of our implementation. The proposed FPGA-processor achieves 169 Gbps throughput. Moreover, the implementation is synthesized into 40 nm CMOS technology and the described link adaptation algorithm allows reducing consumed energy per bit to values below 1 pJ/bit at BER <1e−4. With higher BER, the energy increases up to ∼13 pJ/bit.  相似文献   

8.
A Split decoding algorithm is proposed which divides each row of the parity check matrix into two or multiple nearly-independent simplified partitions. The proposed method significantly reduces the wire interconnect and decoder complexity and therefore results in fast, small, and high energy efficiency circuits. Three full-parallel decoder chips for a (2,048, 1,723) LDPC code compliant with the 10GBASE-T standard using MinSum normalized, MinSum Split-2, and MinSum Split-4 methods are designed in 65 nm, seven metal layer CMOS. The Split-4 decoder occupies 6.1 mm2, operates at 146 MHz, delivers 19.9 Gbps throughput, with 15 decoding iterations. At 0.79 V, it operates at 47 MHz, delivers 6.4 Gbps and dissipates 226 mW. Compared to MinSum normalized, the Split-4 decoder chip is 3.3 times smaller, has a clock rate and throughput 2.5 times higher, is 2.5 times more energy efficient, and has an error performance degradation of 0.55 dB with 15 iterations.  相似文献   

9.
This study presents a design of two-dimensional (2D) discrete cosine transform (DCT) hardware architecture dedicated for High Efficiency Video Coding (HEVC) in field programmable gate array (FPGA) platforms. The proposed methodology efficiently proceeds 2D-DCT computation to fit internal components and characteristics of FPGA resources. A four-stage circuit architecture is developed to implement the proposed methodology. This architecture supports variable size of DCT computation, including 4 × 4, 8 × 8, 16 × 16, and 32 × 32. The proposed architecture has been implemented in System Verilog and synthesized in various FPGA platforms. Compared with existing related works in literature, this proposed architecture demonstrates significant advantages in hardware cost and performance improvement. The proposed architecture is able to sustain 4 K@30 fps ultra high definition (UHD) TV real-time encoding applications with a reduction of 31–64% in hardware cost.  相似文献   

10.
Massive computation of the reconstruction algorithm for compressive sensing (CS) has been a major concern for its real‐time application. In this paper, we propose a novel high‐speed architecture for the orthogonal matching pursuit (OMP) algorithm, which is the most frequently used to reconstruct compressively sensed signals. The proposed design offers a very high throughput and includes an innovative pipeline architecture and scheduling algorithm. Least‐squares problem solving, which requires a huge amount of computations in the OMP, is implemented by using systolic arrays with four new processing elements. In addition, a distributed‐arithmetic‐based circuit for matrix multiplication is proposed to counterbalance the area overhead caused by the multi‐stage pipelining. The results of logic synthesis show that the proposed design reconstructs signals nearly 19 times faster while occupying an only 1.06 times larger area than the existing designs for N = 256, M = 64, and m = 16, where N is the number of the original samples, M is the length of the measurement vector, and m is the sparsity level of the signal.  相似文献   

11.
面向DVB-S2标准LDPC码,该文旨在实现一种基于FPGA的高效编码结构,提出一种快速流水线并向递归编码算法,可以显著提高编码数据信息吞吐率。同时,通过并向移位运算和并向异或运算的处理结构计算编码中间变量及校验位信息,在提高编码并行度的同时可有效减少存储资源的消耗。此外,针对动态自适应编码的情况优化了LDPC码编码存储结构,有效复用了数据存储单元和RAM地址发生器,进一步提高FPGA的硬件逻辑资源利用率。针对DVB-S2标准LDPC码,基于Stratix IV系列FPGA的验证结果表明,所提编码结构在系统时钟为126.17 MHz时,编码数据信息吞吐率达20 Gbps以上。  相似文献   

12.
In July 2004, a new amendment called Fidelity Range Extensions (FRExt) was added to the H.264/AVC as a standardization initiative motivated by the rapidly growing demands when coding higher-fidelity video material. One improvement present in the FRExt is the inclusion of a new 8×8 integer transform that only makes use of additions and shifters to avoid mismatches between encoders and decoders. This paper presents a processor with pipeline architecture for real-time implementation of the complete process for the 8×8 Transform Coding in H.264: forward 8×8 integer transform, quantization and scaling, re-scaling, inverse 8×8 integer transform and reconstruction of the image block. This architecture has been conceived with the aim of achieving a high operation frequency and high throughput without increasing the hardware complexity. In order to achieve an efficient implementation, hardware solutions have been developed for the different circuit modules. 8×8 forward and inverse transforms are calculated using the separability property with architecture more suitable for pipeline schemes made up of two 1D processors and a transpose register array. New expressions for forward quantization and scaling are presented allowing efficient hardware implementation by avoiding the sign conversion. The inverse quantization has also been optimized in terms of hardware complexity by minimizing the involved arithmetic operations. Furthermore, an exhaustive analysis in the dynamic range of the datapath is made to fix the optimum bus widths with the aim of reducing the size of the circuit while avoiding overflow. Finally, the critical paths of the various computing units have been carefully analyzed and balanced using a pipeline scheme in order to maximize the operation frequency without introducing an excessive latency. A prototype with the proposed architecture has been synthesized in a 130 nm HCMOS technology process, which achieves a maximum speed of 330 MHz with a throughput of 2640 Mpixels/s.  相似文献   

13.
Multiple inputs multiple outputs orthogonal frequency division multiplexing (MIMO-OFDM) technology is regarded as a promising solution to offer ultra-high data rate in wireless communications. This paper presents a field-programmable gate array (FPGA) implementation of an early-pruned K-Best detection algorithm applicable to ultra-high data throughput MIMO-OFDM communication systems. The algorithm simplifies the computation significantly compared to conventional K-Best algorithm with negligible bit error ratio (BER) degradation. A fully parallel structure is implemented on a FPGA platform, which achieves 1.9Gb/s detection throughput and is about three times over previous implementation. Moreover, a pre-processing method is realized to reduce the number of multipliers inside the detector and shrinks the critical path delay down to 8.32 ns. Together with candidate sharing and early-pruning architecture to further save the hardware cost, a high-speed, compact MIMO signal detector is demonstrated.  相似文献   

14.
基于FPGA硬件加密的设计与实现   总被引:1,自引:1,他引:0  
以FPGA芯片Cyclone II系列为核心,构建FPGA硬件平台,提出一种以资源优先为目的的DES、AES加解密设计方案。通过分析S盒的非线性特征,构造新的复合域变换,避免因同构变换产生的资源损耗。加解密过程中利用轮函数硬件结构的复用,达到硬件资源占用的最小化。整体采用内嵌流水线结构,减少逻辑复杂度的同时提高处理速度。实验结果验证了FPGA硬件加密的资源占用率远低于ASIC的硬件加密,执行速度达到Gbit/s,加密性能大大提高。  相似文献   

15.
准循环LDPC码的半并行译码器设计   总被引:2,自引:2,他引:0  
利用准循环LDPC码的结构特点,使用半并行结构的译码器可以实现复杂度和译码速率的有效折中.提出了一种半并行结构的实现方法,并通过FPGA上的实现验证了性能.  相似文献   

16.
It’s a promising way to improve performance significantly by adding reconfigurable processing unit (RPU) to a general purpose processor. In this paper, a Reconfigurable Multi-Core (RMC) architecture combining general multi-core and reconfigurable logic is proposed. Reconfigurable logic is separated into RPUs logically, which are coupled with general purpose cores as co-processors via a full crossbar switch. An RPU Manager (RPU-M) is also designed to manage RPUs. To verify RMC, a simulation method based on the Simics and Virtex 5 FPGA is adopted, which simplifies the simulation and assures the evaluation accuracy of hardware function cores. Five workloads are selected to test RMC, including 3-DES, AES, SHA2, IDCT and JPEG_ENC. The experimental results show a 3.10 times average speedup over software implementation on the original multi-core, and the data and control communication overhead on RMC is acceptable.  相似文献   

17.
The Block Decoder (BD) which is an indispensable component of the JPEG 2000 image compression standard has the highest computational complexity and determines the speed of the overall decoder system. This paper proposes a high throughput pass parallel BD architecture, which can decode more than one bit per clock cycle. In BD, the dependency between context generation and arithmetic decoding unit incorporates stalling and reduces the throughput of the decoding process. The proposed selective byte input and synchronous sample skipping techniques are used to prevent stalling in the decoding process. The proposed architecture achieves 86% more throughput with 50% increment in the hardware cost than that of the best available serial BD architecture. In comparison with the best available pass parallel architecture, throughput improves almost 8.2 times with 61% increment in the hardware cost. Incorporation of the speed up techniques in the design is the main reason for more hardware consumption. The Figure of Merit of the proposed design, which is the ratio of throughput and hardware cost, is more than that of the available BD architectures for typical code block (CB) size of 32 × 32. The ASIC implementation of the proposed design consumes 66 mW power at maximum operating frequency.  相似文献   

18.
In this article, a novel block-based visible image watermark VLSI architecture design and its hardware implementation in field programmable gate array (FPGA) is proposed. In this watermarking process, 1D-DCT is introduced to facilitate hardware implementation. Mathematical model is developed to reduce the computational complexity for the calculation of embedding and scaling factors, which are used to make the resultant image of best quality with uniform watermark visibility. The proposed architecture has a 12–stage pipeline. Parallelism techniques are employed in block level in order to achieve high performance. A single 8-point fast 1D-DCT is used to calculate the DCT coefficient values of the host image and the watermark image to minimize the resource utilization and power consumption. The hardware implementation of this algorithm leads to numerous advantages including reduced power, area and higher pipeline throughput. The performance of the architecture is studied by implementing Xilinx Virtex V technology based FPGA with DSP 48E. Throughput achieved based on this VLSI architecture is 5.21 Gbits/s with a total resource utilization of 4058BELs.  相似文献   

19.
10Gbps线路接口设计分析与实现   总被引:5,自引:0,他引:5  
本文结合国家863项目T比特高性能路由器的研发,提出了一种基于FPGA实现的支持IPv4,IPv6和MPLS的10Gbps线路接口实现方案,设计了一种支持9M容量10G线速数据查表的流水线查表结构。基于文中提出方案的硬件实现和性能测试结果表明,该设计满足了10Gbps线路接口的设计需求。  相似文献   

20.
Low Density Parity-Check (LDPC) codes achieve the best performance when they are decoded with the sum-product (SP) algorithm. This is a two-phase iterative algorithm where two types of messages are interchanged and updated in each iteration. The group-shuffled or layered decoding schemes applied to the SP algorithm speed up its convergence by modifying its schedule, so they yield a reduction in the number of iterations required to achieve a given performance. However, the two-phase processing is still maintained. In this paper a modification of the group-shuffled scheme suitable for high-rate LDPC codes is proposed. The modification allows the overlapping of the two-phase computation, achieving a convergence speed up close to that of the group-shuffled scheme with higher throughput. Besides, high throughput architectures are presented for the modified algorithm. As an example, the proposed architecture has been implemented for the 2048-bit LDPC code of the IEEE 802.3an standard and it was synthesized in a 90 nm CMOS process achieving a throughput of 22.40 Gbps at 14 iterations with a clock frequency of 306 MHz and a total area of 10.5 mm2. Furthermore, the decoder performs within 0.5 dB of the floating-point 100 iterations sum-product algorithm at a PER of 10−5.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号