首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
徐启迪  刘争红  郑霖 《计算机应用》2022,42(12):3841-3846
随着通信技术的发展,通信终端逐渐采用软件的方式来兼容多种通信制式和协议。针对以计算机中央处理器(CPU)作为运算单元的传统软件无线电架构,无法满足高速无线通信系统如多进多出(MIMO)等宽带数据的吞吐率要求问题,提出了一种基于图形处理器(GPU)的低密度奇偶校验(LDPC)码译码器的加速方法。首先,根据GPU并行加速异构计算在GNU Radio 4G/5G物理层信号处理模块中的加速表现的理论分析,采用了并行效率更高的分层归一化最小和(LNMS)算法;其次,通过使用全局同步策略、合理分配GPU内存空间以及流并行机制等方法减少了译码器的译码时延,同时配合GPU多线程并行技术对LDPC码的译码流程进行了并行优化;最后,在软件无线电平台上对提出的GPU加速译码器进行了实现与验证,并分析了该并行译码器的误码率性能和加速性能的瓶颈。实验结果表明,与传统的CPU串行码处理方式相比,CPU+GPU异构平台对LDPC码的译码速率可提升至原来的200倍左右,译码器的吞吐量可以达到1 Gb/s以上,特别是在大规模数据的情况下对传统译码器的译码性有着较大的提升。  相似文献   

2.
基于GPU的LDPC增强准最大似然译码器并行实现   总被引:1,自引:0,他引:1  
增强准最大似然(EQML)译码器对于码长较短的低密度奇偶校验(LDPC)码的译码性能优于传统置信传播(BP)译码器,可较好满足5G移动通信的高可靠性要求,但由于其计算结构复杂导致译码速度大幅降低。为提高EQML译码器的译码速度,提出一种基于GPU的EQML译码器并行化加速方案,压缩并存储不规则LDPC码的奇偶校验矩阵,通过对传统BP译码算法进行重新排序以最大化利用Kernel中的线程,并对再处理过程中的每个阶段进行多码字并行译码,实现内存访问优化及流并行译码。实验结果表明,基于GPU的EQML译码器在保持纠错性能的同时,相比基于CPU的EQML译码器的译码速度约提升了2个数量级。  相似文献   

3.
基于并行分层译码算法的LDPC译码器可以使用较小的芯片面积实现较高的译码速率。提出一种基于该算法的译码器硬件设计方法。该设计方法通过使用移位寄存器链,来进一步降低基于并行分层译码算法的译码器芯片面积。该硬件设计使用TSMC 65 nm工艺实现,并在实现中使用IEEE 802.16e中的1/2码率LDPC码。该译码器设计在迭代次数设置为10次时可实现1.2 Gb/s的译码速率,芯片面积1.1 mm2。译码器设计通过打孔产生1/2至1之间的连续码率。  相似文献   

4.
高速LDPC码分层译码器设计   总被引:2,自引:0,他引:2  
设计一种新型准并行LDPC分层译码器,实现对0.5码率,4608码长(3,6)规则准循环LDPC的实时译码.并在Altera公司的Stratix II系列EP2S60器件上完成了布局布线.最高工作频94.47MHz,当最大迭代次数为25次时译码吞吐量可达58.70Mbps.与传统的TPMP译码方案相比,可减少近一半的平均译码迭代次数,而且可以显著降低RAM块的使用数量.整个设计具有很强的扩展性和通用性,只需作事先存储校验矩阵式样及行重信息,即可支持任意码率、规则及非规则码的准循环LDPC译码.  相似文献   

5.
针对高效LDPC译码器设计过程中的参数选择问题,提出了针对Turbo译码消息传播(Turbo decoding message passing,TDMP)译码算法的离散密度进化算法。利用这种离散密度进化算法对译码算法中的校正因子及量化精度进行了优化。与传统的通过数值仿真进行优化的方法相比,本文算法效率大大提高,且效果显著。测试结果表明,优化的定点化译码器与纯浮点仿真相比性能只相差0.1 dB左右。在译码器实现结构设计中提出了一种基于分布式RAM的P消息循环存储结构,与传统的基于寄存器和Benes网络的存储器结构相比,资源消耗明显下降。在Xilinx公司的FPGA平台上进行了硬件实现与测试,结果表明与同类译码器相比在资源消耗和吞吐率上均有一定优势,是一种高效的LDPC硬件译码器。  相似文献   

6.
怀钰  戴逸民 《计算机仿真》2010,27(5):309-313
针对在结构化LDPC码译码器中使用流水线结构,对最小和分层译码算法进行了分析。为进一步提高译码器的性能,提出了一种修正分层最小算法,使得结构化LDPC码的译码器能使用流水线结构来增加系统吞吐量。根据修正算法,设计了一种低复杂度的译码器结构,并详细描述了串行校验节点处理器和灵活置换器这两个模块的设计。分析了流水线译码器对处理时延的提高,并仿真了同一码长不同译码算法的性能。仿真结果表明修正算法和最小和译码算法相比,性能上几乎没有损失,由于译码器采用了流水线结构,吞吐量提高了2到3倍,并能灵活的支持各种码长和码率的结构化LDPC码。  相似文献   

7.
在对分层译码算法优化的基础上,提出一种多码率QC-LDPC译码器。采用改进的分层消息传播算法实现快速收敛,将译码迭代次数降到经典方法的50%以下。架构中用于存储中间置信信息的存储器数量只有4个,减少了芯片面积和功耗。校验节点置信度更新采用校正的整数量化的分层算法,降低了计算复杂度。选取的校正因子降低了译码器的误码率。基于该架构实现QC-LDPC译码器,融合3种码率,芯片规模为60万门,时钟频率为110 MHz,1/2码率的译码速率可达134 Mb/s。  相似文献   

8.
The subset‐sum problem is a well‐known non‐deterministic polynomial‐time complete (NP‐complete) decision problem. This paper proposes a novel and efficient implementation of a parallel two‐list algorithm for solving the problem on a graphics processing unit (GPU) using Compute Unified Device Architecture (CUDA). The algorithm is composed of a generation stage, a pruning stage, and a search stage. It is not easy to effectively implement the three stages of the algorithm on a GPU. Ways to achieve better performance, reasonable task distribution between CPU and GPU, effective GPU memory management, and CPU–GPU communication cost minimization are discussed. The generation stage of the algorithm adopts a typical recursive divide‐and‐conquer strategy. Because recursion cannot be well supported by current GPUs with compute capability less than 3.5, a new vector‐based iterative implementation mechanism is designed to replace the explicit recursion. Furthermore, to optimize the performance of the GPU implementation, this paper improves the three stages of the algorithm. The experimental results show that the GPU implementation has much better performance than the CPU implementation and can achieve high speedup on different GPU cards. The experimental results also illustrate that the improved algorithm can bring significant performance benefits for the GPU implementation. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

9.
In wireless communication, Viterbi decoding algorithm (VDA) is the one of most popular channel decoding algorithms, which is widely used in WLAN, WiMAX, or 3G communications. However, the throughput of Viterbi decoder is constrained by the convolutional characteristic. Recently, the three‐point VDA (TVDA) was proposed to solve this problem. In TVDA, the whole procedure can be divided into three phases, the forward, trace‐back, and decoding phases. In this paper, we analyze the parallelism of TVDA and propose parallel TVDA on the multi‐core CPU, graphics processing unit (GPU), and field programmable gate array (FPGA). We demonstrate approaches that fully exploit its performance potential on CPU, GPU, and FPGA computing platforms. For CPU platforms, we perform two optimization methods, single instruction multiple data and multithreading to gain over 145 × speedup over the naive CPU version on a quad‐core CPU platform. For GPU platforms, we propose the combination of cached memory optimization, coalesced global memory accesses, codeword packing scheme, and asynchronous data transition, achieving the throughput of 404.65 Mbps and 12 × speedup over initial GPU versions on an NVIDIA GeForce GTX580 card and 7 × speedup over Intel quad‐core CPU i5‐2300, under the same manufacturing year and both with fully optimized schemes. In addition, for FPGA platforms, we customize a radix‐4 pipelined architecture for the TVDA in a 45‐nm FPGA chip from Xilinx (XC6VLX760). Under 209.15‐MHz clock rate, it achieves a throughput of 418.30 Mbps. Finally, we also discuss the performance evaluation and efficiency comparison of different flexible architectures for real‐time Viterbi decoding in terms of the decoding throughput, power consumption, optimization schemes, programming costs, and price costs.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

10.
The one‐step leapfrog alternative‐direction‐implicit finite‐difference time‐domain (ADI‐FDTD), free from the Courant‐Friedrichs‐Lewy (CFL) stability condition and sub‐step computations, is efficient when dealing with fine grid problems. However, solution of the numerous tridiagonal systems still imposes a great computational burden and makes the method hard to execute in parallel. In this paper, we proposed an efficient graphic processing unit (GPU)‐based parallel implementation of the one‐step leapfrog ADI‐FDTD for the far‐field EM scattering simulation of objects, in which we present and analyze the manners of calculation area division and thread allocation and a data layout transformation of z components is proposed to achieve better memory access mode, which is a key factor affecting GPU execution efficiency. The simulation experiment is carried out to verify the accuracy and efficiency of the GPU‐based implementation. The simulation results show that there is a good agreement between the proposed one‐step leapfrog ADI‐FDTD method and Yee's FDTD in solving the far‐field scattering problem and huge benefits in performance were encountered when the method was accelerated using GPU technology.  相似文献   

11.
信息协调是量子密钥分发中的关键步骤,基于LDPC实现量子信息协调是当前国内外研究的焦点。目前QKD系统LDPC译码器普遍采用单码字顺序译码机制设计,且采用的是性能较差的准循环LDPC码,LDPC译码器吞吐量和纠错上限较低,无法满足高速率高误码下量子安全性及性能需求。设计了一种面向量子密钥分发的新型自适应LDPC双码并行机制ADCPM,采用随机型LDPC码,且在译码的同时进行双密钥串并行纠错,较传统方法吞吐量提升了近1倍。真实平台实验结果表明,ADCPM支持高达10%的误码率,吞吐量超过140 Mbps,可有效支撑高误码下高速安全量子信息协调。  相似文献   

12.
Sorting is a very important task in computer science and becomes a critical operation for programs making heavy use of sorting algorithms. General‐purpose computing has been successfully used on Graphics Processing Units (GPUs) to parallelize some sorting algorithms. Two GPU‐based implementations of the quicksort were presented in literature: the GPU‐quicksort, a compute‐unified device architecture (CUDA) iterative implementation, and the CUDA dynamic parallel (CDP) quicksort, a recursive implementation provided by NVIDIA Corporation. We propose CUDA‐quicksort an iterative GPU‐based implementation of the sorting algorithm. CUDA‐quicksort has been designed starting from GPU‐quicksort. Unlike GPU‐quicksort, it uses atomic primitives to perform inter‐block communications while ensuring an optimized access to the GPU memory. Experiments performed on six sorting benchmark distributions show that CUDA‐quicksort is up to four times faster than GPU‐quicksort and up to three times faster than CDP‐quicksort. An in‐depth analysis of the performance between CUDA‐quicksort and GPU‐quicksort shows that the main improvement is related to the optimized GPU memory access rather than to the use of atomic primitives. Moreover, in order to assess the advantages of using the CUDA dynamic parallelism, we implemented a recursive version of the CUDA‐quicksort. Experimental results show that CUDA‐quicksort is faster than the CDP‐quicksort provided by NVIDIA, with better performance achieved using the iterative implementation. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

13.
This paper presents an effective scheme for clustering a huge data set using a PC cluster system, in which each PC is equipped with a commodity programmable graphics processing unit (GPU). The proposed scheme is devised to achieve three-level hierarchical parallel processing of massive data clustering. The divide-and-conquer approach to parallel data clustering is employed to perform the coarse-grain parallel processing by multiple PCs with a message passing mechanism. By taking advantage of the GPU’s parallel processing capability, moreover, the proposed scheme can exploit two types of the fine-grain data parallelism at the different levels in the nearest neighbor search, which is the most computationally-intensive part of the data-clustering process. The performance of our scheme is discussed in comparison with that of the implementation entirely running on CPU. Experimental results clearly show that the proposed hierarchial parallel processing can remarkably accelerate the data clustering task. Especially, GPU co-processing is quite effective to improve the computational efficiency of parallel data clustering on a PC cluster. Although data-transfer from GPU to CPU is generally costly, acceleration by GPU co-processing is significant to save the total execution time of data-clustering.  相似文献   

14.
Low-Density Parity-heck Codes (LDPC) with excellent error-correction capabilities have been widely used in both data communication and storage fields, to construct reliable cyber-physical systems that are resilient to real-world noises. Fast prototyping field-programmable gate array (FPGA)-based decoder is essential to achieve high decoding performance while accelerating the development process. This paper proposes a three-level parallel architecture, TLP-LDPC, to achieve high throughput by fully exploiting the characteristics of both LDPC and underlying hardware while effectively scaling to large-size FPGA platforms. The three-level parallel architecture contains a low-level decoding unit, a mid-level multi-unit decoding core, and a high-level multi-core decoder. The low-level decoding unit is a basic LDPC computation component that effectively combines the features of the LDPC algorithm and hardware with the specific structure (e.g., Look-Up-Table, LUT) of the FPGA and eliminates potential data conflicts. The mid-level decoding core integrates the input/output and multiple decoding units in a well-balancing pipelined fashion. The top-level multi-core architecture conveniently makes full use of board-level resources to improve the overall throughput. We develop an LDPC C++ code with dedicated pragmas and leverage HLS tools to implement the TLP-LDPC architecture. Experimental results show that TLP-LDPC achieves 9.63 Gbps end-to-end decoding throughput on a Xilinx Alveo U50 platform, 3.9x higher than existing HLS-based FPGA implementations.  相似文献   

15.
基于消息传递的LDPC码硬判决解码算法建模   总被引:1,自引:0,他引:1  
提出了一种以奇偶校验和作为消息传递的LDPC码硬判决的解码方案.该方案以奇偶校验方程是否满足约束为条件,从而决定接收分组中的错误位,并对错误位进行翻转.分析了迭代消息流传递机制和迭代解码过程,最后提出一种具体可实现的解码算法模型。  相似文献   

16.
Turbo乘积码是一类前向纠错码,在高码率下具有良好的误码率性能。TPC编码器的实现相对简单,其译码器的译码复杂度也比较合理。因此,TPC被广泛用于各种场景,例如卫星通信系统和数据存储系统等。提出了一种基于GPU的并行TPC译码器,可以同时译码二维乘积码矩阵的所有行或列。设计了一种并行基本译码器,以简化由扩展汉明码构成的TPC的译码过程。实现了测试样例和有效码字计算的并行化,降低了译码延迟。为了进一步提高译码吞吐率,提出了多通道TPC译码器。在不同的GPU上测量了并行译码器的性能,实验结果表明,与基于CPU的TPC译码器相比,基于GPU的并行TPC译码器的译码延迟显著降低。此外,基于GPU的并行TPC译码器的吞吐率在NVIDIA RTX 2080 Ti上达到30 Mbps,在NVIDIA GTX Titan V上达到38 Mbps,是基于CPU的TPC译码器性能的44倍和54倍。  相似文献   

17.
基于软、硬件结合的方法,本文提出了一种高效通用的QC-LDPC译码器架构。该架构可以对不同码长、码率和校验矩阵结构的规则或非规则QC-LDPC码进行译码,支持Min-Sum近似及其改进译码算法,而且可以实现多种消息传递调度策略。通过将部分复杂的信息更新交由硬件加速器来完成,提高了译码吞吐量。针对QC-LDPC码校验矩阵:仁循环的结构,以块为单位对信息进行存储和处理。该架构还可以实现信息的并行处理,而译码器复杂度只有略微增加。  相似文献   

18.
在研究几种加权比特翻转算法的基础上,提出了一种新的针对LDPC码的改进加权比特翻转算法。加权比特翻转(WBF)算法中的错误度量考虑了校验节点的可信度信息,在此基础上,相关的改进WBF(IWBF)算法考虑了消息本身对符号判决的影响,进一步提高了性能。但是在IWBF算法中,必须通过仿真,才能获得使译码性能较优的符号可信度加权参数。提出了一种同时考虑符号可信度和校验可信度的算法,不需要调整加权参数,即可获得较优性能。仿真显示提出的加权比特翻转算法是可行且有效的。  相似文献   

19.
Branch‐and‐bound (B&B) algorithms are attractive methods for solving to optimality combinatorial optimization problems using an implicit enumeration of a dynamically built tree‐based search space. Nevertheless, they are time‐consuming when dealing with large problem instances. Therefore, pruning tree nodes (subproblems) is traditionally used as a powerful mechanism to reduce the size of the explored search space. Pruning requires to perform the bounding operation, which consists of applying a lower bound function to the subproblems generated during the exploration process. Preliminary experiments performed on the Flow‐Shop scheduling problem (FSP) have shown that the bounding operation consumes over 98% of the execution time of the B&B algorithm. In this paper, we investigate the use of graphics processing unit (GPU) computing as a major complementary way to speed up the search. We revisit the design and implementation of the parallel bounding model on GPU accelerators. The proposed approach enables data access optimization. Extensive experiments have been carried out on well‐known FSP benchmarks using an Nvidia Tesla C2050 GPU card. Compared to a CPU‐based single core execution using an Intel Core i7‐970 processor without GPU, speedups higher than 100 times faster are achieved for large problem instances. At an equivalent peak performance, GPU‐accelerated B&B is twice faster than its multi‐core counterpart. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

20.
在H.264/AVC标准中,基于上下文的自适应可变长编码(CAVLC)解码算法的复杂度较高。为此,提出一种基于熵解码算法的新型熵解码器,在对视频压缩码流实现熵解码的过程中,引入并行处理方式,并改进二叉树法。通过采用QuartusⅡ7.2版环境波形仿真和FPGA硬件实现方法进行实验,结果表明该熵解码器在硬件资源节省和解码速度方面具有较好的性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号