期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

彭自然王国军《计算机工程与应用》2017,53(4):33-38

快速小波变换是数字信号处理面临的一个重要问题,针对并行小波算法展开研究,缩减小波变换中卷积运算的规模,提高小波变换过程中的并行效能,以实现小波变换的快速并行计算。通过FFT矩阵代入计算,消去了并行计算过程中的同步通信,降低了乘法运算次数。对算法思想进行了理论分析,说明新算法在短小数据分段情况下能够减少50%~75%的乘法操作;通过搭建两种不同平台进行了对比测试,证明了算法的先进性与有效性。基于FFT矩阵的并行小波变换算法是一种稳定有效的经典小波并行算法。相似文献

2.

GRAPES动力框架中大规模稀疏线性系统并行求解及优化

张琨贾金芳严文昕黄建强王晓英《计算机工程》2022,48(1):149-154+162

赫姆霍兹方程求解是GRAPES数值天气预报系统动力框架中的核心部分,可转换为大规模稀疏线性系统的求解问题,但受限于硬件资源和数据规模,其求解效率成为限制系统计算性能提升的瓶颈。分别通过MPI、MPI+OpenMP、CUDA三种并行方式实现求解大规模稀疏线性方程组的广义共轭余差法,并利用不完全分解LU预处理子（ILU）优化系数矩阵的条件数,加快迭代法收敛。在CPU并行方案中,MPI负责进程间粗粒度并行和通信,OpenMP结合共享内存实现进程内部的细粒度并行,而在GPU并行方案中,CUDA模型采用数据传输、访存合并及共享存储器方面的优化措施。实验结果表明,通过预处理优化减少迭代次数对计算性能提升明显,MPI+OpenMP混合并行优化较MPI并行优化性能提高约35%,CUDA并行优化较MPI+OpenMP混合并行优化性能提高约50%,优化性能最佳。相似文献

3.

一种在多核嵌入式平台上实现FFT的快速并行算法

彭自然王国军《计算机应用研究》2017,34(11)

快速傅利叶变换（fast Fourier transform,FFT）算法是对实时数字信号进行快速分析处理的一个基本方法。针对多核嵌入式实时环境下并行FFT算法进行了研究,以有效提高实时信号处理的速度。提出了一种新的静态多项式FFT算法,充分利用静态多项式奇偶项的不同特点直接代入数据计算,免去了层层迭代的计算过程,减少了运算过程中的通信提高并行性能。对该算法思想本文在理论进行了严密论证,通过嵌入式实时平台上运行测试和仿真实验,证实了在数据分段较短的约束条件下,该多项式静态算法较经典的FFT并行算法在时间复杂度上有一定优势。本文结论:多项式静态FFT算法能够有效提高并行FFT运行速度。相似文献

4.

迭代方法中基于渐近规模的通信与计算比分析

徐小文莫则尧武林平《计算机学报》2013,36(4)

迭代方法是科学计算中求解大规模稀疏线性代数方程组最常用的方法.大量实际应用表明,迭代方法通常具有较高的通信与计算比,只有在粗粒度并行下才能取得较好的并行可扩展性能.而实际应用大规模计算的需求和当前多核/众核体系结构的发展趋势要求迭代方法具备细粒度并行可扩展能力.文中引入渐近规模,即满足加速条件的计算规模下界,来反映并行迭代方法适应细粒度并行的能力,并由此刻画通信与计算比.基于矩阵的稀疏模式及其通信模式、机器的通信参数和迭代方法的基本运算,给出了渐近规模的理论预测公式.在一台包含128个双路4核计算节点的并行机上,分别基于纯进程并行(MPI)和进程/线程混合并行(MPI/OpenMP),以实际应用中3种常用迭代方法Jacobi、CG、BiCGSTAB为例,分析其渐近规模.并行可扩展性测试表明了渐近规模用于刻画迭代方法通信与计算比的准确性.对于纯进程情形,给出了渐近规模的理论预测与实际测试的对比,表明了理论预测结果的正确性.最后,基于这些结果,从迭代方法的算法设计和并行实现等方面讨论了面向未来更大规模的计算系统,降低通信与计算比的途径. 相似文献

5.

分支定界算法的分布并行化研究

李一明李毅周明天《计算机应用》2006,26(3):723-0726

介绍了一种专用于计算分支定界算法的机群计算平台，其中所使用的分布并行策略减少了分支定界算法计算时间复杂度，减小了问题的规模；可以把计算平台机群中的任何一台计算机上计算出的当前全局最佳本分值，实时地广播给所有其他并行的计算机，并作为它们新的最佳本分值，实现分支节点的快速并行淘汰；应用启发式算法修改了分支定界算法，提高了分支节点的淘汰效率。选用旅行商问题实例作为测试基准。计算表明，在保证求得最优解的前提下，该平台能很好地提高分支定界算法的效率。相似文献

6.

改进的RSA算法在无线传感器网络中的应用

杜治国胡大辉《计算机应用》2012,32(6):1609-1612

针对公钥密码体制在无线传感器网络密钥管理中存在计算速度慢、能量消耗大等问题,提出将一种改进的公钥算法应用其中。新算法利用蒙哥马利算法把大数的幂模运算转换成模幂运算,并使用中国剩余定理把模幂运算转换成求解同余方程组。算法安全性分析与实验结果表明,新算法能减少55%的运算开销,减少67%的存储空间占用,并增加21%的节点生命周期。新算法在保证密钥安全性的同时减少了运算量和存储空间,更加适合节点运算能力较低且能量有限的无线传感器网络。相似文献

7.

基于GPU的大规模拓扑优化问题并行计算方法

韩琪蔡勇《计算机仿真》2015,32(4):221-226,304

针对进行大规模拓扑优化问题计算量庞大且计算效率低的问题,设计并实现了一种基于图形处理器(GPU)的并行拓扑优化方法.采用双向渐进结构拓扑优化(BESO)为基础优化算法,采用一种基于节点计算的共轭梯度求解方法用于有限元方程组求解.通过对原串行算法的研究,并结合GPU的计算特点,实现了迭代过程全流程的并行计算.上述方法的程序设计和编写采用统一计算架构(CUDA),提出了基于单元和基于节点的两种并行策略.编写程序时充分使用CUDA自带的各种数学运算库,保证了程序的稳定性和易用性.数值算例证明,并行计算方法稳定并且高效,在优化结果一致的前提下,采用GTX580显卡可以取得巨大的计算加速比. 相似文献

8.

一个改进的并行关联规则算法研究

陈涛张玮《微机发展》2007,17(1):139-141

在研究关联规则挖掘算法的基础上,对并行关联规则算法进行了比较全面的分析,并给出了并行数据挖掘的计算框架。提出了一个以计算服务器为中心节点的并行挖掘算法,可以发挥各局部节点的优势,无需各局部节点进行通信,减少了各局部节点的通信负荷。通过理论分析和实验数据验证,该算法具有较好的可扩展性和海量处理能力,特别是在节点数目较多的情况下更显示出优势。相似文献

9.

蚁群算法解决TSP问题的并行化研究与实现 总被引：3，自引：2，他引：1

张军刘羽卢奉良《计算机技术与发展》2011,21(5)

蚁群算法在处理大规模TSP(Traveling Salesman Problem)问题时耗时较长,为了解决这一不足,给出一种基于多核环境下的并行优化算法.采用OpenMp并行优化技术对蚁群算法中最为耗时的循环迭代和循环赋值部分进行改进,减少其运算时间,同时利用粗粒度并行策略和PC机多核的优势将具有一定规模的小蚁群分配到对应的处理器上,使其并行执行,并且在适当时机让各处理器上的蚁群进行相互间的通信.通过实验证明,改进后的并行蚁群算法程序执行时间明显缩短,执行效率显著提高.由此可见,改进后的并行蚁群算法是可行有效的. 相似文献

10.

一个改进的并行关联规则算法研究

陈涛张玮《计算机技术与发展》2007,17(1):139-141

在研究关联规则挖掘算法的基础上，对并行关联规则算法进行了比较全面的分析．井给出了并行数据挖掘的计算框架。提出了一个以计算服务器为中心节点的并行挖掘算法，可以发挥各局部节点的优势，无需各局部节点进行通信，减少了各局部节点的通信负荷。通过理论分析和实验数据验证，该算法具有较好的可扩展性和海量处理能力，特别是在节点数目较多的情况下更显示出优势。相似文献

11.

自适应谱聚类算法并行实现及优化

苏琳赵永华李瑞琳《数据与计算发展前沿》2016,7(6):44-53

谱聚类算法是基于谱图分割理论的聚类方法,其对高维、非凸数据分布问题有很好的聚类效果。但对大规模数据问题的聚类,该方法存在着计算时间和存储空间等方面的瓶颈。本文给出了一个自适应的谱聚类并行算法,通过局部计算和异步循环通信并行方法,最大限度减少了并行谱聚类中数据通信次数,并通过计算与通信重叠策略,进一步降低了并行算法的通信开销。在并行算法实现中,将自主开发的最优预条件共轭梯度法并行求解器 PLOBPCG 用于谱聚类的特征降维。在中科院的“元”超级计算机上,通过对两类大规模数据聚类的测试表明,在 2048 核上的加速比接近线性加速,并行效率达到96%以上。相似文献

12.

KD60集群消息传递接口群集通信算法优化

郑启龙汪睿周寰《计算机应用》2011,31(6):1453-1457

大规模集群已经发展到多核的时代,多核架构对并行计算提出了新的要求。消息传递接口(MPI)是最常用的并行编程模型,而群集通信又是MPI中的重要组成部分。研究高效的群集通信算法对并行计算效率的提升有着重要的作用。KD60平台是采用首款国产多核芯片——龙芯3号搭建的国产万亿次多核集群。首先分析了KD60平台多核集群的体系特征以及多核架构下通信具有的层次性特征;然后分析原有群集通信算法实现原理及其不足;最后以广播为例,在原有算法基础上,采用一种基于片上多核(CMP)架构改进算法,改变原有算法通信模式,同时结合实验平台KD60体系特征,对算法做了体系相关优化。实验结果表明,改进算法能够很好地利用多核结构的特点,提高了群集通信广播算法的性能。相似文献

13.

Parallel MCGLS and ICGLS Methods for Least Squares Problems on Distributed Memory Architectures

Yang Laurence Tianruo Brent Richard P. 《The Journal of supercomputing》2004,29(2):145-156

In this paper we mainly study the parallelization of the CGLS method, a basic iterative method for large and sparse least squares problems in which the conjugate gradient method is applied to solve normal equations. On modern parallel architectures its parallel performance is always limited because of the global communication required for inner products, the main bottleneck of parallel performance. In this paper, we describe a modified CGLS (MCGLS) method which improve parallel performance by assembling the results of a number of inner products collectively and by creating situations where communication can be overlapped with computation. More importantly, we also propose an improved CGLS (ICGLS) method to reduce inner product's global synchronization points to half, then significantly improve the parallel performance accordingly compared with the standard CGLS method and the MCGLS method. 相似文献

14.

工作站网络下三对角方程组并行求解

侯英李晓梅《计算机工程与设计》1998,19(1):52-58

考虑工作站网络（ＮＯＷｓ）中三对角线性方程组的并行求解，基于最小秩解耦算法与分布治之并行计算模式，并行最小秩解耦算法（ＰＭＲＤ）。它在计算过程中保持原矩阵的结构特征，数值稳定性高，本文给出算法的数值特征分析以及计算与通讯复杂性分析并与Ｍｅｈｒｍａｎｎ分治算比较，所有算法由ＰＶＭ软件系统实现并在工作站网络中测试。相似文献

15.

Communication latency tolerant parallel algorithm for particle swarm optimization 总被引：1，自引：0，他引：1

Bo Li Koichi Wada 《Parallel Computing》2011,37(1):1-10

Particle swarm optimization (PSO) algorithm is a population-based algorithm for finding the optimal solution. Because of its simplicity in implementation and fewer adjustable parameters compared to the other global optimization algorithms, PSO is gaining attention in solving complex and large scale problems. However, PSO often requires long execution time to solve those problems. This paper proposes a parallel PSO algorithm, called delayed exchange parallelization, which improves performance of PSO on distributed environment by hiding communication latency efficiently. By overlapping communication with computation, the proposed algorithm extracts parallelism inherent in PSO. The performance of our proposed parallel PSO algorithm was evaluated using several applications. The results of evaluation showed that the proposed parallel algorithm drastically improved the performance of PSO, especially in high-latency network environment. 相似文献

16.

Tuning the granularity of parallelism for distributed graph processing

Xinyuan Luo Sai Wu Wei Wang Lidan Shou 《Distributed and Parallel Databases》2017,35(2):117-148

Popular distributed graph processing frameworks, such as Pregel and GraphLab, are based on the vertex-centric computation model, where users write their customized Compute function for each vertex to process the data iteratively. Vertices are evenly partitioned among the compute nodes, and the granularity of parallelism of the graph algorithm is normally tuned by adjusting the number of compute nodes. Vertex-centric model splits the computation into phases. Inside one specific phase, the computation proceeds as an embarrassingly parallel process, because no communication between compute nodes incurs. By default, current graph engine only handles one iteration of the algorithm in a phase. However, in this paper, we find that we can also tune the granularity of parallelism, by aggregating the computation of multiple iterations into one phase, which has a significant impact on the performance of the graph algorithm. In the ideal case, if all computations are handled in one phase, the whole algorithm turns into an embarrassingly parallel algorithm and the benefit of parallelism is maximized. Based on this observation, we propose two approaches, a function-based approach and a parameter-based approach, to automatically transform a Pregel algorithm into a new one with tunable granularity of parallelism. We study the cost of such transformation and the trade-off between the granularity of parallelism and the performance. We provide a new direction to tune the performance of parallel algorithms. Finally, the approaches are implemented in our graph processing system, N2, and we illustrate their performance using popular graph algorithms. 相似文献

17.

核辐照损伤金属材料的大规模KMC模拟

宋梦召冯仰德《计算机工程与科学》2017,39(7):1211-1218

动力学蒙特卡洛方法可用来模拟核反应堆第一壁材料的辐射效应和缺陷扩散,有助于理解和预测材料在辐照损伤下的微观性质和宏观变化。采用同步子域方法实现了空位跃迁过程的并行模拟。通过采用动态更新通信数据和自适应同步时间步长方法,减少通信次数和通信量,在保证准确性的情况下获得了较好并行性能。实验表明,基于同步子域的串行计算比原始串行算法时间缩短60.31%,并行算法在80核时达到39倍加速比。对于大规模问题,算法也表现出很好的并行效率,适用于大规模问题的模拟。相似文献

18.

Large-scale convex optimal control problems: time decomposition,incentive coordination, and parallel algorithm

Chang T.-S. Jin X.-X. Luh P.B. Miao X. 《Automatic Control, IEEE Transactions on》1990,35(1):108-114

A parallel algorithm based on time decomposition and incentive coordination is developed for long-horizon optimal control problems. This is done by first decomposing the original problem into subproblems with shorter time horizon, and then using the incentive coordination scheme to coordinate the interaction of subproblems. For strictly convex problems it is proved that the decomposed problem with linear incentive coordination is equivalent to the original problem, in the sense that each optimal solution of the decomposed problem produces one global optimal solution of the original problem and vice versa. In other words, linear incentive terms are sufficient in this case and impose no additional computation burden on the subproblems. The high-level parameter optimization problem is shown to be nonconvex, despite the uniqueness of the optimal solution and the convexity of the original problem. Nevertheless, the high-level problem has no local minimum, even though it is nonconvex. A parallel algorithm based on a prediction method is developed, and a numerical example is used to demonstrate the feasibility of the approach 相似文献

19.

Conjugate residual squared method and its improvement for non-symmetric linear systems

《国际计算机数学杂志》2012,89(7):1578-1590

In this paper, conjugate residual squared (CRS) method for solving linear systems with non-symmetric coefficient matrices is proposed. Moreover, based on the ideas by Gu et al. [An improved bi-conjugate residual algorithm suitable for distributed parallel computing, Appl. Math. Comput. 186 (2007), pp. 1243–1253], we present an improved conjugate residual squared (ICRS) method, which is designed for distributed parallel environments. The improved method reduces two global synchronization points to one by changing the computation sequence in the CRS method and all inner products per iteration are independent, and communication time required for inner product can be overlapped with useful computation. Theoretical analysis shows that the ICRS method has better parallelism and scalability than the CRS method. Finally, some numerical experiments clearly show that the ICRS method can achieve better parallel performance with a higher scalability than the CRS method, and also the improvement percentage of communication is up to 47.33%, which meets our theoretical analysis. 相似文献

20.

分布存储系统中优化通信的冗余计算分割 总被引：1，自引：0，他引：1

陈莉张兆庆冯晓兵《计算机学报》2003,26(2):180-187

针对并行循环套序列，提出一种冗余计算分割的通信优化方法，根据数据流分析，文中给出用以确定每个循环套的冗余计算量的一般方法，并在此基础上提出冗余计算分割的实现和判定，针对规则依赖的程序，该文还提出了一个高效的冗余计算分割的实现方法，该技术已经在一个并行编译器中实现，试验结果表明，它比传统的通信优化技术有明显的优越性。相似文献