期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

刘晓坤王建军路民旭张大鸣《数值计算与计算机应用》1995,(1)

边界元法中非对称满系数矩阵方程组的分块求解法刘晓坤，王建军（西安石油学院）路民旭，张大鸣（西北工业大学）ＡＢＬＯＣＫＥＱＵＡＴＩＯＮＳＯＬＵＴＩＯＮＴＥＣＨＮＩＱＵＥＡＮＤＩＴＳＰＲＯＧＲＡＭＦＯＲＵＮＳＹＭＭＥＴＲＩＣＤＥＮＳＥＭＡＴＲＩＸＩＮＴＨ... 相似文献

2.

基于类比的学习式搜索算法AMO.GLSA 总被引：1，自引：0，他引：1

蒋建东陆玮琳《计算机应用与软件》1997,14(3):52-58

本文首先给出了学习式搜索的一个问题模型，然后（５）中ＧＬＳ搜索解题系统的基础上，本文描述了一个多目标学习搜索算法ＭＯ．ＧＬＳＡ，并对该算法作出一性能评价，最后，文中给出了一个基于类比的学习搜索算法ＡＭＯ．ＧＬＳＡ。相似文献

3.

UNIX环境中光标键的获取算法及虚拟键盘的一种实现

阳道善韩世强《小型微型计算机系统》1994,(6)

本文在简要介绍ＴＥＲＭＣＡＰ数据库后，给出在ＵＮＩＸ终端原始工作模式下获取光标键的一种算法。受这个算法的启发，文中又给出虚拟键盘的一种实现方法。这个实现较好地体现了ＵＮＩＸ系统Ｖ中ＳＴＲＥＡＭＳ的精神。相似文献

4.

不连续生产系统的最大加工能力与最优生产安排的强多项式算法

杨承恩梁枢里《数值计算与计算机应用》1996,(1)

不连续生产系统的最大加工能力与最优生产安排的强多项式算法杨承恩，梁枢里（长沙铁道学院）ＴＨＥＭＡＸＩＭＵＭＰＲＯＣＥＳＳＩＮＧＣＡＰＡＣＩＴＹＡＮＤＯＰＴＩＭＡＬＳＣＨＥＤＵＬＥＯＦＡＤＩＳＣＯＮＴＩＮＵＯＵＳＰＲＯＤＵＣＴＩＯＮＳＹＳＴＥＭ￥Ｙａｎ... 相似文献

5.

IMSA110的基本结构及其在图像处理中的应用

李吉成王平沈振康王鲁平《微处理机》2000,(2):22-25

介绍了ＩＭＳＡ１１０的结构和原理。并在此基础上针对图像自动目标识别系统中低层处理算法的需要,了多个ＩＭＳＡ１１０的级联技术,文中最后给出了基于ＩＭＳＡ１１０的图像预处理单元结构框图并验证了该处理单元的图像处理性能。相似文献

6.

一个采用混合式监控策略的网络管理平台

徐波熊萍《计算机研究与发展》1998,35(12):1101-1106

文中针对网络管理中普遍存在的能力与开销间的矛盾，在分析比较目前常用的两类网络监控策略的基础上，提出了混合式监控策略，介绍了一个采用混合式监控策略的网管平台－ＬＡＮ－ＭＡＮ。该平台遵循现行工业标准ＳＮＭＰ所规定的管理模型、服务和协议，同时提供对ＲＭＯＮ远程监控标准的支持，并给出ＬＡＮＭＡＮ在监控性能方面的实验结果，以及与国外现有产品的对照。相似文献

7.

微型计算机与SIMATIC U系列PC通信

王福厚《微计算机应用》1995,16(3):29-32

生产过程控制和显示要求微型计算机与可编程控制器通信，文中对微型计算机和ＳＩＭＡＴＩＣＵ系列可编程控制器通信的协议、发送作业及接收作业作了详细论述，给出的通信程序可直接在应用程序中应用。相似文献

8.

实对称矩阵最大特征值极小化问题的一个BT型ε－次梯度算法

叶东毅《数值计算与计算机应用》1996,(4)

实对称矩阵最大特征值极小化问题的一个ＢＴ型ε－次梯度算法叶东毅（福州大学计算机科学系）ＡＢＴＴＹＰＥε－ＳＵＢＧＲＡＤＩＥＮＴＡＬＧＯＲＩＴＨＭＦＯＲＭＩＮＩＭＩＺＩＮＧＴＨＥＧＲＥＡＴＥＳＴＥＩＧＥＮＶＡＬＵＥＯＦＡＲＥＡＬＳＹＭＭＥＴＲＩＣＭＡＴ... 相似文献

9.

面向SW26010处理器的三维Stencil自适应分块参数算法

朱雨庞建民徐金龙陶小涵王军《计算机科学》2021,48(6):10-18

Stencil计算是科学应用中的一类重要计算,而分块是提升Stencil计算数据局部性的关键技术.针对现有三维Stencil优化在SW26010处理器上缺少时间分块以及分块参数需手工调优的问题,引入时间分块,提出了面向SW26010处理器的三维Stencil自适应分块参数算法.通过建立性能分析模型,结合硬件计算能力及存储容量等限制因素,文中系统地分析了分块参数对模型性能的影响,判断性能瓶颈,指导分块参数的优化方向.基于性能分析模型,自适应分块参数算法可给出预测性能最优时的分块参数,有利于三维Stencil在SW26010处理器上的快速优化部署.选取了三维7点和三维27点Stencil算例进行实验.与未使用时间分块的三维Stencil优化相比,以上两个算例在自适应选择的分块参数下可以达到1.47和1.29的加速比,且实际最优分块参数与理论最佳分块参数一致,这验证了所提性能分析模型及自适应分块参数算法的有效性. 相似文献

10.

加电自检中MMU的功能检测分析

孟永强李曦赵振西《计算机工程》2000,26(2):93-94

结合ＳＰＡＲＣ２０工作站ＭＭＵ的操作及其执行机制，通过对加电自检程序的分析，对ＳＰＡＲＣＲｅｆｅｒｅｎｃｅＭＭ＊检测机制进行了研究，给出了检测的内容及其实现算法。相似文献

11.

A new parallel matrix multiplication algorithm on distributed-memory concurrent computers

Jaeyoung Choi 《Concurrency and Computation》1998,10(8):655-670

We present a new fast and scalable matrix multiplication algorithm called DIMMA (distribution-independent matrix multiplication algorithm) for block cyclic data distribution on distributed-memory concurrent computers. The algorithm is based on two new ideas; it uses a modified pipelined communication scheme to overlap computation and communication effectively, and exploits the LCM block concept to obtain the maximum performance of the sequential BLAS (basic linear algebra subprograms) routine in each processor even when the block size is very small or very large. The algorithm is implemented and compared with SUMMA on the Intel Paragon computer. © 1998 John Wiley & Sons, Ltd. 相似文献

12.

Tiling, block data layout, and memory hierarchy performance

Park N. Hong B. Prasanna V.K. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(7):640-654

Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and translation look-aside buffer (TLB) performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower bound on TLB performance for standard matrix access patterns, and show that block data layout and Morton layout achieve this bound. To improve cache performance, block data layout is used in concert with tiling. Based on the cache and TLB performance analysis, we propose a data block size selection algorithm that finds a tight range for optimal block size. To validate our analysis, we conducted simulations and experiments using tiled matrix multiplication, LU decomposition, and Cholesky factorization. For matrix multiplication, simulation results using UltraSparc II parameters show that tiling and block data layout with a block size given by our block size selection algorithm, reduces up to 93 percent of TLB misses compared with other techniques. The total miss cost is reduced considerably. Experiments on several platforms show that tiling with block data layout achieves up to 50 percent performance improvement over other techniques that use conventional layouts. Morton layout is also analyzed and compared with block data layout. Experimental results show that matrix multiplication using block data layout is up to 15 percent faster than that using Morton data layout. 相似文献

13.

A framework for high‐performance matrix multiplication based on hierarchical abstractions,algorithms and optimized low‐level kernels

Vinod Valsalam Anthony Skjellum 《Concurrency and Computation》2002,14(10):805-839

Despite extensive research, optimal performance has not easily been available previously for matrix multiplication (especially for large matrices) on most architectures because of the lack of a structured approach and the limitations imposed by matrix storage formats. A simple but effective framework is presented here that lays the foundation for building high‐performance matrix‐multiplication codes in a structured, portable and efficient manner. The resulting codes are validated on three different representative RISC and CISC architectures on which they significantly outperform highly optimized libraries such as ATLAS and other competing methodologies reported in the literature. The main component of the proposed approach is a hierarchical storage format that efficiently generalizes the applicability of the memory hierarchy friendly Morton ordering to arbitrary‐sized matrices. The storage format supports polyalgorithms, which are shown here to be essential for obtaining the best possible performance for a range of problem sizes. Several algorithmic advances are made in this paper, including an oscillating iterative algorithm for matrix multiplication and a variable recursion cutoff criterion for Strassen's algorithm. The authors expose the need to standardize linear algebra kernel interfaces, distinct from the BLAS, for writing portable high‐performance code. These kernel routines operate on small blocks that fit in the L1 cache. The performance advantages of the proposed framework can be effectively delivered to new and existing applications through the use of object‐oriented or compiler‐based approaches. Copyright © 2002 John Wiley & Sons, Ltd. 相似文献

14.

基于BLACS的2.5D并行矩阵乘法

廖霞李胜国卢宇彤杨灿群《计算机学报》2021,44(5):1037-1050

并行矩阵乘法是线性代数中最重要的基本运算之一,同时也是许多科学应用的基石.随着高性能计算(HPC)向E级计算发展,并行矩阵乘法的通信开销所占比重越来越大.如何降低并行矩阵乘法的通信开销,提高并行矩阵乘的可扩展性是当前研究的热点之一.本文提出一种新型的分布式并行稠密矩阵乘算法,即2.5D版本的PUMMA(Parallel Universal Matrix Multiplication Algorithm)算法,该算法是通过将初始的进程分成c组,利用计算节点的额外内存,在每个进程组上同时存储矩阵A、B和执行1/c的PUMMA算法,最后通过规约操作来得到矩阵乘的最终结果.本文基于BLACS(Basic Linear Algebra Communication Subprograms)通信库实现了一种从2D到2.5D的新型数据重分配算法,与PUMMA算法相结合,最终得到2.5D PUMMA算法,可直接替换PDGEMM(Parallel Double-precision General Matrix-matrix Multiplication),具有良好的可移植性.与国际标准算法库ScaLAPACK(Scalable Linear Algebra PACKage)中的PDGEMM等经典2D算法相比,本文算法缩减了通信次数,提高了数据局部性,具有更好的可扩展性.在进程数较多时,例如4096进程时,系统测试表明相对PDGEMM的加速比可达到2.20~2.93.进一步地,本文将2.5D PUMMA算法应用于加速计算对称三对角矩阵的特征值分解,其加速比可达到1.2以上.本文通过大量数值算例分析了2.5D PUMMA算法的性能,并给出了实用性建议和总结了未来的工作. 相似文献

15.

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

《Parallel Computing》2016

We present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures. Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization. Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse grain block algorithms perform more efficiently than fine grain ones. This work is motivated by the design and implementation of dense linear algebra over a finite field, where fast matrix multiplication is used extensively and where costly modular reductions also advocate for coarse grain block decomposition. We incrementally build efficient kernels, for matrix multiplication first, then triangular system solving, on top of which a recursive PLUQ decomposition algorithm is built. We study the parallelization of these kernels using several algorithmic variants: either iterative or recursive and using different splitting strategies. Experiments show that recursive adaptive methods for matrix multiplication, hybrid recursive–iterative methods for triangular system solve and tile recursive versions of the PLUQ decomposition, together with various data mapping policies, provide the best performance on a 32 cores NUMA architecture. Overall, we show that the overhead of modular reductions is more than compensated by the fast linear algebra algorithms and that exact dense linear algebra matches the performance of full rank reference numerical software even in the presence of rank deficiencies. 相似文献

16.

一种支持优化分块策略的矩阵乘加速器设计

沈俊忠肖涛乔寓然杨乾明文梅《计算机工程与科学》2016,38(9):1748-1754

在许多应用领域中,大规模浮点矩阵乘法往往是最耗时的计算核心之一。在新兴的应用中经常存在至少有一个维度很小的大规模矩阵,我们把具备这种特性的矩阵称为非均匀矩阵。由于FPGA上用以存储中间结果的片上存储器容量十分有限,计算大规模矩阵乘法时往往需要将矩阵划分成细粒度的子块计算任务。当加速非均匀矩阵乘法时,由于只支持固定分块大小,大多数现有的线性阵列结构的硬件矩阵乘法器将遭受很大的性能下降。为了解决这个问题,提出了一种有效的优化分块策略。在此基础上,在Xilinx公司的Zynq XC7Z045FPGA芯片上实现了一个支持可变分块的矩阵乘法器。通过集成224个处理单元,该矩阵乘法器在150 MHz的时钟频率下对于实际应用中的非均匀矩乘达到了48GFLOPS的实测性能,而所需带宽仅为4.8GB/s。实验结果表明,我们提出的分块策略相比于传统的分块算法实现了高达12%的性能提升。相似文献

17.

Efficient mapping and implementation of matrix algorithms on a hypercube

Vladimir Cherkassky Ross Smith 《The Journal of supercomputing》1988,2(1):7-27

It is well known that parallelism by itself does not lead to higher speeds. This study shows how to put parallelism to best use, that is, how to find an optimal balance between communication and computation overheads for two parallel matrix algorithms. The problem graph for matrix algorithms analyzed in this paper is a two-dimensional grid (toroidal mesh) which is mapped onto a hypercube topology. To perform matrix operations on a hypercube, a matrix is partitioned into several submatrices which are stored and manipulated in the nodes. We seek to find an optimal matrix partitioning to minimize overall execution time. The NCUBE parallel machine is used for experimental performance evaluation. For matrix multiplication, we derive an exact analytical model to determine the optimal partitioning size and perform its experimental verification on the NCUBE parallel processor. For a parallel Gaussian elimination known as the balanced algorithm, we present performance measurements and an approximate analytical model for performance evaluation. Our analyses show that the optimal submatrix size is typically small and does not depend on the original matrix size. 相似文献

18.

Improvement of workload balancing using parallel loop self-scheduling on Intel Xeon Phi

Chao-Tung Yang Chao-Wei Huang Shuo-Tsung Chen 《The Journal of supercomputing》2017,73(11):4981-5005

In recent years, Intel promotes its new product Xeon Phi coprocessor, which is similar to the x86 architecture coprocessor. It has about 60 cores and can be regarded as a single computing node, with the computing power that cannot be ignored. This work aims to improve the workload balance by parallel loop self-scheduling scheme performed on Xeon Phi-based computer cluster. The proposed concept is implemented by hybrid MPI and OpenMP parallel programming in C language. Since parallel loop self-scheduling composes of static and dynamic allocation, weighting algorithm is adopted in the static part, while the well-known loop self-scheduling is adopted in dynamic part. The loop block is partitioned according to the weighting of MIC and HOST nodes. Accordingly, Xeon Phi with many-core is adopted to implement parallel loop self-scheduling. Finally, we test the performance in the experiments by four applicable problems: matrix multiplication, sparse matrix multiplication, Mandelbrot set and circuit meet. The experimental results indicate how to do the weight allocation and which scheduling method can achieve the best performance. 相似文献

19.

一种基于遗传算法的BLAS库优化方法

孙成国兰静姜浩《计算机工程与科学》2018,40(5):798-804

基于OpenBLAS和BLIS开源线性代数基础算法库,对稠密矩阵乘法GEMM运算的性能优化展开研究。针对如何选取稠密矩阵分块并行算法的关键分块参数这一问题,建立性能优化模型。采用改进的遗传算法求解上述优化模型,将某一分块参数组合(种群个体)所对应的稠密矩阵乘法的性能值作为该个体的适应度,通过不断迭代地进行选择、交叉、变异操作,找到最优的分块参数组合,使得稠密矩阵运算的性能值最优。数值实验表明,基于遗传算法求解得出最优分块参数下的GEMM性能值优于默认分块参数下的性能值,达到了优化的目的。相似文献

20.

布尔矩阵乘的分布式异构并行优化

朱敏唐波赵娟邹丹李金才《计算机工程与科学》2017,39(4):634-640

布尔多项式求解是当今密码代数分析中的关键步骤,F4算法是布尔多项式求解的高效算法。分析了Lachartre为F4矩阵专门设计的高斯消去算法,针对其中布尔矩阵乘这一耗时的计算步骤,设计并实现了分布式异构(CPU+MIC)并行算法。布尔矩阵相对于普通矩阵主要体现在矩阵元素取值区间不一样上,由于布尔矩阵元素(0,1)导致矩阵乘操作的特殊性,普通矩阵乘的优化方法不能很好地满足布尔矩阵乘的需求。分别从布尔矩阵的存储、OpenMP多线程组织、访存、任务划分和调度等方面进行了性能优化,实现了布尔矩阵乘的分布式异构并行算法。通过随机生成布尔矩阵测试,优化后的分布式异构并行程序相较于分布式同构并行程序达到了2.45的加速比,体现了良好的性能提升。相似文献