首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 343 毫秒
1.
叶笑春  林伟  范东睿  张浩 《软件学报》2010,21(12):3094-3105
在生物信息学中,蛋白质序列比对是最为重要的算法之一,生物技术的发展使得已知的序列库变得越来越庞大,这类算法本身又具有计算密集型的特点,这导致进行序列比对所消耗的时间也越来越长,目前的单核或者数量较少的多核系统均已经难以满足对计算速度的要求.Godson-T是一个包含诸多创新结构的众核平台,在该系统上实现了对一种蛋白质序列比对算法的并行化,并且结合蛋白质比对算法以及Godson-T结构的特征,针对同步开销、存储访问竞争以及负载均衡3个方面对算法进行了细致的优化,最终并行部分整体也获得了更优的、接近线性的加速比,并且实际性能远远优于基于AMD Opteron处理器的工作站平台.  相似文献   

2.
Godson-T是中国科学院计算技术研究所计算机系统结构重点实验室先进微系统组正在研制开发的适合于超深亚微米工艺实现的大规模片上众核系统.Godson-T片上存储的单端口结构节省了芯片面积但制约了共享数据的读取效率.直接在Godson-T上实现传统的Broadcast算法需要大量的同步互斥开销,无法达到很好的性能提升.基于Godson-T体系结构,对数据共享的重要并行算法Broadcast进行优化,提高了Godson-T体系结构下的数据共读的效率.主要采取了以下3项技术:消除大规模的线程同步,建立源地址到目的地址的映射表和用汇编语言实现Broadcast的核心部分.优化后Broadcast在小核数为32时即可达到5.8倍加速比.  相似文献   

3.
数据流编程模型将程序设计与媒体处理相结合,已大量应用到各个领域.众核处理器已经成为主流和工业标准,如何利用众核架构的特性来提高流应用执行性能已成为目前研究工作的一大难点.文中提出了一个高效的流编译框架来优化流应用的执行,该框架包含3个优化策略:设计一个最优的软件流水调度方法;提出一个高效的数据存储分配算法;并采用合理的众核间的映射策略,减小通信以及同步的开销.文中在Godson-T上实现了该编译器框架,实验结果表明,该方法比优化前有较大性能改进.  相似文献   

4.
处理器阵列的容错重构技术是片上网络多核、众核高性能体系结构的可靠性技术之一。现有的最大逻辑阵列并行重构技术仅对单条逻辑列的构造实现了并行化,而对多条逻辑列的同步并行仍未见可行算法。依据处理器阵列的潜在并行性,在分治策略的基础上,提出了一种阵列分块的并行重构算法。算法对处理器阵列实施横向分块划分,对每个阵列块进行并行重构,并对所得逻辑子阵列进行归并,实现了多条逻辑列的同步并行重构。与现有的并行算法相比,新算法同样能够生成最大逻辑列,并且减少了通信开销与计算中的数据冗余,有效提高了运行速度。实验结果表明,在物理阵列大小为64×64的处理器阵列上,运行速度比现有并行算法提高39.55%,并且具有良好的可扩展性。  相似文献   

5.
针对CESM中的有限差分算法并行过程中存在内存读取冗余过大、通信开销过高的问题,设计出根据数据结构进行数据重构、计算核心捆绑、流水线通信等多种并行优化方案。弥补了申威26010处理器在数据读取过程中缺少共享缓存区、带宽利用率不高等不足,缓解了申威26010处理器在有限差分法求解过程的通信瓶颈。对CESM中以有限差分法为核心计算的两个函数,在申威26010众核处理器上的测试结果表明,提出算法及优化策略拥有21.2倍的性能提升。  相似文献   

6.
基于软硬件的协同支持在众核上对1-DFFT算法的优化研究   总被引:2,自引:0,他引:2  
随着高性能计算需求的日益增加,片上众核(many-core)处理器成为未来处理器架构的发展方向.快速傅立叶变换(FFT)作为高性能计算中的重要应用,对计算能力和通信带宽都有较高的要求.因此基于众核处理器平台,实现高效、可扩展的FFT算法是算法和体系结构设计者共同面临的挑战.文中在众核处理器Godson-T平台上对1-D FFT算法进行了优化和评估,在节省几乎三分之一L2 Cache存储开销的情况下,通过隐藏矩阵转置,计算与通信重叠等优化策略,使得优化后的1-D FFT算法达到3倍以上的性能提升.并通过片上网络拥塞状况的实验分析,发现对于像FFT这样访存带宽受限的应用,增加L2 Cache的访问带宽,可以缓解因为爆发式读写带给片上网络和L2 Cache的压力,进一步提高程序的性能和扩展性.  相似文献   

7.
大整数运算广泛地应用于公钥加密算法、大规模科学计算中高精度浮点数运算类以及构建大特征值等领域,然而其大部分算法空间和时间开销都很大,尤其对于核心运算之一的大整数乘法,当数据达到一定规模时,超长的串行计算时间已成为制约算法应用的巨大瓶颈.近几年来,伴随着多核、众核芯片的迅猛发展,通过充分挖掘算法本身的并行度以利用并行处理器的强大计算能力,进而高效地提升算法性能,成为一种研究趋势.本文基于通用多核并行计算平台,研究了大整数乘法Comba及Karatsuba快速算法的并行化,提出了高效的多核并行算法.在算法实现及性能优化上,采用了OpenMP+SIMD的多级并行技术,使性能获得巨大提升.在性能测试上,我们使用优化的并行算法与原始串行算法进行对比试验,结果显示,8线程并行Comba算法和Karatsuba算法相比串行对应算法分别实现了5.85倍以及6.14倍的性能加速比提升.  相似文献   

8.
众核处理器片上同步机制和评估方法研究   总被引:1,自引:0,他引:1  
同步机制是片上多核/众核处理器正确执行和协同通信的关键,其效率对处理器的性能非常重要.针对片上众核体系结构,提出并实现了两种粗粒度同步机制和一种细粒度同步机制,即片上专用硬件支持的同步机制、基于原语的片上互斥访问同步机制和基于满空标志位的细粒度同步机制;提出了粗粒度同步机制的评估标准和评估方法,并设计了量化评估程序.以片上同构众核处理器Godson-T模拟器和AMD Opteron商业片上多核处理器为平台,评估比较了提出的硬件支持的同步机制与基于原语的同步机制的性能.结果表明,硬件支持可以使得片上众核处理器的同步机制性能明显提高;在传统基于原语的同步机制中,大部分性能损失是由于负载不平衡和同步点的串行化操作而造成的等待时间.  相似文献   

9.
同步操作在保证多核处理器线程的数据一致性和正确性等方面起着重要作用。随着处理器内核数量的不断增加,同步操作的开销也越来越大。栅栏同步是并行应用中多核同步的重要方法之一。软件同步方法通常需要数千个周期才能完成多个内核之间的同步,这种高延迟和串行化同步会导致多核程序性能的显著下降。相比于软件栅栏同步方法,硬件栅栏能够实现较低的同步延迟,然而传统集中式硬件栅栏的可扩展性有限,难以适应众核处理器系统的同步需求。面向众核处理器提出了一种层次化硬件栅栏机制——HSync,它由本地栅栏单元和全局栅栏单元组成,二者协调配合,以实现低硬件开销的快速同步。实验结果表明,与传统的集中式硬件栅栏相比,层次化硬件栅栏机制将众核处理器系统性能提高了1.13倍,同时网络流量减少了74%。  相似文献   

10.
在国产异构众核平台神威·太湖之光上的非结构网格计算具有稀疏存储、离散访存、数据依赖等特点,严重制约了众核处理器的性能发挥。为解决稀疏存储和离散访存问题,提出一种N阶对角染色算法,以有效平衡主从核计算并利用从核将全局访存转化为LDM访问。针对数据依赖造成的计算竞争问题,采用自适应和无依赖的任务划分方法,避免并行计算时的数据冲突。为对处理器架构和非结构网格计算进行优化,采用主核与从核异步并行的方式,差异化使用主从核以充分利用硬件资源,同时,取消处理器提供的寄存器通信机制,降低从核阵列的同步开销同时便于扩展到新一代神威平台。此外,使用计算访存异步重叠技术来充分隐藏访存延迟。利用SpMV、Integration、calcLudsFcc算子进行实验,结果表明,相比主核实现,组合加速算法在不同算例规模下平均取得了10倍的加速效果,加速比最高可达24倍,N阶对角染色算法相比非染色分块算法取得了超过5.8倍的性能加速,有效提升了数据局部性和计算并行度。该算法对有依赖关系的计算冲突算子同样具有良好的加速性能,验证了自适应和无依赖任务划分方法的有效性。  相似文献   

11.
针对H.264视频压缩编码标准中去块效应滤波器部分提出了一种基于YHFT Matrix DSP的并行设计及向量实现方法。重点对H.264协议中去块效应滤波器进行理论分析,并利用向量数据访问单元、向量处理单元、高效的混洗单元和灵活的矩阵对其进行并行算法设计。将去块滤波算法分别映射到YHFT Matrix和TI的TMS320C6415中,通过统计两者性能,表明YHFT Matrix的性能优于TMS320C6415。  相似文献   

12.
This paper proposes a parallelization method for a deblocking filter in a high-efficiency video coding (HEVC) decoder based on complexity estimation. A deblocking filter of HEVC is generally considered to be appropriate for data-level parallelism (DLP) because there are no data-level dependencies among adjacent blocks in the horizontal and vertical filtering processes. However, an imbalanced workload can increase the idle time on some of the threads, and thus, the maximum parallel performance cannot be achieved with only DLP. To alleviate this problem, the proposed method estimates the computational complexity by utilizing the coding unit (CU) segment information and the on/off flag of the deblocking filter in advance. Then, the workload is distributed equally across all threads. The experimental results indicate that the proposed method can accelerate the decoding speed by a factor of 3.97, with six threads on top of the sequential deblocking filtering. In addition, the ratio of the maximum elapsed time to ideal elapsed time is reduced to approximately 21 %, as compared to conventional DLP-based parallel deblocking filtering methods that do not implement a complexity estimation method.  相似文献   

13.
The high parallelism of future Teradevices, which are going to contain more than 1,000 complex cores on a single die, requests new execution paradigms. Coarse-grained dataflow execution models are able to exploit such parallelism, since they combine side-effect free execution and reduced synchronization overhead. However, the terascale transistor integration of such future chips make them orders of magnitude more vulnerable to voltage fluctuation, radiation, and process variations. This means dynamic fault-tolerance mechanisms have to be an essential part of such future system. In this paper, we present a fault tolerant architecture for a coarse-grained dataflow system, leveraging the inherent features of the dataflow execution model. In detail, we provide methods to dynamically detect and manage permanent, intermittent, and transient faults during runtime. Furthermore, we exploit the dataflow execution model for a thread-level recovery scheme. Our results showed that redundant execution of dataflow threads can efficiently make use of underutilized resources in a multi-core, while the overhead in a fully utilized system stays reasonable. Moreover, thread-level recovery suffered from moderate overhead, even in the case of high fault rates.  相似文献   

14.
针对低码率的情况下效应比较严重的情况,提出在低码率下对环路滤波的一种改进的方法。对于空间基本层的环路滤波,采取与H.264/AVC兼容的环路滤波算法。但是对于空间增强层,特别是在低码率的情况下,加入一个对图像边界判断的算法,使其在滤波的时候,不会滤掉真实边界。与原来的算法相比,在低码率的情况下,码率有所降低,而且PSNR值也有增加,主观视觉上,也比原来的算法有提高。  相似文献   

15.
Examines the generation of parallel evaluators for attribute grammars, targeted to shared-memory MIMD computers. Evaluation-time overhead due to process scheduling and synchronization is reduced by detecting coarse-grain parallelism (as opposed to the naive one-process-per-node approach). As a means to more clearly expose inherent parallelism, it is shown how to automatically transform productions of the form XY X into list-productions of the form XY+. This transformation allows for many simplifications to be applied to the semantic rules, which can expose a significant degree of inherent parallelism, and thus further increase the evaluator's performance. Effectively, this constitutes an extension of the concept of attribute grammars to the level of abstract syntax  相似文献   

16.
Block synchronization is an essential component of blockchain systems. Traditionally, blockchain systems tend to send all the transactions from one node to another for synchronization. However, such a method may lead to an extremely high network bandwidth overhead and significant transmission latency. It is crucial to speed up such a block synchronization process and save bandwidth consumption. A feasible solution is to reduce the amount of data transmission in the block synchronization process between any pair of peers. However, existing methods based on the Bloom filter or its variants still suffer from multiple roundtrips of communications and significant synchronization delay. In this paper, we propose a novel protocol named Gauze for fast block synchronization. It utilizes the Cuckoo filter (CF) to discern the transactions in the receiver’s mempool and the block to verify, providing an efficient solution to the problem of set reconciliation in the P2P (Peer-to-Peer Network) network. By up to two rounds of exchanging and querying the CFs, the sending node can acknowledge whether the transactions in a block are contained by the receiver’s mempool or not. Based on this message, the sender only needs to transfer the missed transactions to the receiver, which speeds up the block synchronization and saves precious bandwidth resources. The evaluation results show that Gauze outperforms existing methods in terms of the average processing latency (about 10× lower than Graphene) and the total synchronization space cost (about 10× lower than Compact Blocks) in different scenarios.  相似文献   

17.
王超  胡福乔 《微计算机信息》2007,23(15):283-284
本文介绍了新一代视频编码压缩标准H.264中的去块滤波器的滤波过程,对滤波的复杂度进行了详细分析,并提出了一种新的多档次的去块滤波的方法。试验证明H.264中的去块滤波器在主观及客观图像质量上都获得了提高,新提出的滤波方法较好的解决了图像质量和滤波消耗时间上的平衡性问题。  相似文献   

18.
Exploiting the parallelism available in loops   总被引:1,自引:0,他引:1  
Lilja  D.J. 《Computer》1994,27(2):13-26
Because a loop's body often executes many times, loops provide a rich opportunity for exploiting parallelism. This article explains scheduling techniques and compares results on different architectures. Since parallel architectures differ in synchronization overhead, instruction scheduling constraints, memory latencies, and implementation details, determining the best approach for exploiting parallelism can be difficult. To indicate their performance potential, this article surveys several architectures and compilation techniques using a common notation and consistent terminology. First we develop the critical dependence ratio to determine a loop's maximum possible parallelism, given infinite hardware. Then we look at specific architectures and techniques. Loops can provide a large portion of the parallelism available in an application program, since the iterations of a loop may be executed many times. To exploit this parallelism, however, one must look beyond a single basic block or a single iteration for independent operations. The choice of technique depends on the underlying architecture of the parallel machine and the characteristics of each individual loop  相似文献   

19.
20.
基于FPGA的H.264去块滤波系统的优化设计   总被引:1,自引:0,他引:1       下载免费PDF全文
欧阳剑  杜学亮 《计算机工程》2008,34(12):239-241
提出一种H.264去块滤波系统的优化设计方法。通过合理设计流水线级数提高并行性,适当增加内部SRAM来提高系统速度和总线利用率,使用一种层次化的有限状态机设计方法,实现对数据流的精确控制并且有效降低硬件实现复杂度。基于FPGA的验证结果显示在最坏情况下滤波每个宏块平均只需220个时钟,比原有方案快10个时钟以上。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号