首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
This paper presents a new systolic algorithm for thecompletesolution of a system ofNlinear equations in (N2/2 +O(N)) time steps using 2Nprocessing elements (PEs). It is based on a variant of the Gaussian elimination (GE) algorithm called the successive GE and is faster than any existing GE based algorithm usingO(N) PEs. We also suggest two fault tolerant schemes that tolerate up toNPE failures. The first scheme is a time redundancy based approach with no hardware overhead and 100% time overhead. This scheme can tolerate up toNPE failures. The second scheme is based on algorithm based fault tolerance (ABFT) and usesNextra PEs to tolerate up toN− 1 PE failures with very little time overhead. The number of errors that can be detected/corrected in both schemes is more than that in any existing fault tolerant systolic array.  相似文献   

2.
提高软非周期任务响应性能的调度算法   总被引:9,自引:0,他引:9  
何军  孙玉方 《软件学报》1998,9(10):721-727
实时环境中常常既包含硬周期任务,又包含软非周期任务,引入一种改进软非周期实时任务响应时间的算法.已有的解决混合任务调度问题的方法都是基于速率单调(Rate Monotonic)策略的,其中从周期任务“挪用时间”的算法被证明优于其他所有算法.但是,速率单调算法限制了处理器的使用率,从而使周期任务的可“挪用”时间受到限制.最后期限驱动(Deadline Driven)策略DD可使潜在的处理器利用率达到100%.新算法正是在周期任务的调度中适当加入了DD策略,从而使非周期任务的响应时间得以缩短.仿真实验的结果表明,这种算法的性能优于已有的所有算法,而由它所带来的额外开销却不算很高.  相似文献   

3.
针对异构多核处理器间的任务调度问题,为了更好地发挥异构多核处理器间的平台优势,提出一种基于将有关联的且不在同一处理器上的任务进行复制的思想,从而使每个异构多核的处理器能独立执行任务,来减少不同处理器之间的通信开销,并且通过混合粒子群算法(HPSO)来调度异构多核处理器中的任务,避免由于当任意一个异构多核处理器由于任务分配过多而导致计算机不能及时且准确地得出结果.最后实验证明,对比传统的启发式分配方案和常见的遗传算法(GA),基于任务复制思想分配方案和混合粒子群算法(HPSO)具有更好的求解能力,并且可以提供执行时间更少的调度分配方案,具有较好的应用价值.  相似文献   

4.
The short-range pair interaction consumes most of the CPU time in molecular dynamics(MD)simulations.The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core ar-chitecture.In this paper,we present a highly efficient short-range force kernel on the Sunway,a novel many-core architecture with many unique features.The parallel efficiency of this algorithm on the Sunway many-core processor is strongly limited by the poor data locality and write conflicts.To enhance the data locality,we adopt a super cluster based neighbor list with an appropriate granularity that fits in the local memory of computing cores.In the absence of a low overhead locking mechanism,using data-privatization force array is a more feasible method to avoid write conflicts,but results in the large overhead of data reduction.We adopt a dual-slice partitioning scheme for both hardware resources and computing tasks,which utilizes the on-chip data communication to reduce data reduction overhead and provide load balancing.Moreover,we exploit the single instruction multiple data(SIMD)parallelism and perform instruction reordering of the force kernel on this many-core processor.The experimental results show that the optimized force kernel obtains a performance speedup of 226x compared with the reference implementation and achieves 20%of peak flop rate on the Sunway many-core processor.  相似文献   

5.
A longest common subsequence (LCS) of two strings is a common subsequence of two strings of maximal length. The LCS problem is to find an LCS of two given strings and the length of the LCS (LLCS). In this paper, we present a new linear processor array for solving the LCS problem. The array is based on parallelization of a recent LCS algorithm which consists of two phases, i.e. preprocessing and computation. The computation phase is based on bit-level dynamic programming approach. Implementations of the preprocessing and computation phases are discussed on the same processor array architecture for the LCS problem. Further, we propose a block processor array architecture which reduces the overall communication and time requirements. Finally, we develop a performance model for estimating the performance of the processor array architecture on Pentium processors.  相似文献   

6.
We present two fast algorithms for sorting on a linear array with a reconfigurable pipelined bus system (LARPBS), one of the recently proposed parallel architectures based on optical buses. In our first algorithm, we sort N numbers in O(log N log log N) worst-case time using N processors. In our second algorithm, we sort N numbers in O((log log N)2) worst-case time using N1+ε processors, for any fixed ε such that 0 < ε < 1. Our algorithms are based on a novel deterministic sampling scheme for merging two sorted arrays of length N each in O(log log N) time on an LARPBS with N processors. To our knowledge, the previous best sorting algorithm on this architecture has a running time of O((log N)2) using N processors  相似文献   

7.
Array statements are often used to express data-parallelism in scientific languages such as Fortran 90 and High Performance Fortran. In compiling array statements for a distributed-memory machine, efficient generation of communication sets and local index sets is important. We show that for arrays distributed block-cyclically on multiple processors, the local memory access sequence and communication sets can be efficiently enumerated as closed forms using regular sections. First, closed form solutions are presented for arrays that are distributed using block or cyclic distributions. These closed forms are then used with avirtual processor approachto give an efficient solution for arrays with block-cyclic distributions. This approach is based on viewing a block-cyclic distribution as a block (or cyclic) distribution on a set of virtual processors, which are cyclically (or block-wise) mapped to physical processors. These views are referred to asvirtual-blockorvirtual-cyclicviews, depending on whether a block or cyclic distribution of the array on the virtual processors is used. The virtual processor approach permits different schemes based on the combination of the virtual processor views chosen for the different arrays involved in an array statement. These virtualization schemes have different indexing overhead. We present a strategy for identifying the virtualization scheme which will have the best performance. Performance results on a Cray T3D system are presented for hand-compiled code for array assignments. These results show that using the virtual processor approach, efficient code can be generated for execution of array statements involving block-cyclically distributed arrays.  相似文献   

8.
研究需要附加信息的可任意划分应用的调度问题.文章首先引入附加信息的概念,扩展了DLS模型,在此基础上重新分析了在这类应用中经典的平均划分(EQS)算法的缺陷,并提出了一个无空闲时间调度算法(NIS).基于这两个算法的解析表达解,严格地证明了NIS算法的调度性能总是优于EQS算法.由于在这类应用中典型的情况是每个处理器需要相同的附加信息,文章进一步研究了这类典型应用.分析表明,与EQS算法相比有更大范围的应用能利用NIS算法获得并行计算的收益,NIS算法所能利用的资源也更多.  相似文献   

9.
Fault-tolerant dataflow system is an entirely new field.This paper presents an overview of FTDF-TD,a fault-tolerant dataflow system.Various aspects of FTDF-TD,such as the architecture,the fault-tolerant strategy,and its reliability,are described.The research on overhead and performance evaluation based on software simulation is introduced.It is shown that FTDF-TD gets valuable results in reducing overhead,execution time and increasing reliability of processor array,compared with two other fault-tolerant systems.  相似文献   

10.
Nonclassical parabolic initial-boundary value problems arise in the study of several important physical phenomena. This paper presents a new approach to treat complicated boundary conditions appearing in the parabolic partial differential equations with nonclassical boundary conditions. A new fourth-order finite difference technique, based upon the Noye and Hayman (N-H) alternating direction implicit (ADI) scheme, is used as the basis to solve the two-dimensional time dependent diffusion equation with an integral condition replacing one boundary condition. This scheme uses less central processor time (CPU) than a second-order fully implicit scheme based on the classical backward time centered space (BTCS) method for two-dimensional diffusion. It also has a larger range of stability than a second-order fully explicit scheme based on the classical forward time centered space (FTCS) method. The basis of the analysis of the finite difference equations considered here is the modified equivalent partial differential equation approach, developed from the 1974 work of Warming and Hyeet. This allows direct and simple comparison of the errors associated with the equations as well as providing a means to develop more accurate finite difference methods. The results of numerical experiments for the new method are presented. The central processor times needed are also reported. Error estimates derived in the maximum norm are tabulated.  相似文献   

11.
A simple and efficient algorithm for the bandwidth reduction of sparse symmetric matrices is proposed. It involves column-row permutations and is well-suited to map onto the linear array topology of the SIMD architectures. The efficiency of the algorithm is compared with the other existing algorithms. The interconnectivity and the memory requirement of the linear array are discussed and the complexity of its layout area is derived. The parallel version of the algorithm mapped onto the linear array is then introduced and is explained with the help of an example. The optimality of the parallel algorithm is proved by deriving the time complexities of the algorithm on a single processor and the linear array.  相似文献   

12.
ZD码(ZigZag-decodable codes)是基于之字形解码算法设计生成的一类纠删码, 它仅需要少量的计算即可修复存储系统中的故障数据, 但需要存储相对其他纠删码更多的冗余数据以保证系统的高可靠性. 为了降低ZD码产生的存储开销, 本文通过分析当前在存储系统中使用的之字形解码的思想, 提出了一种优化的之字形解码算法. 新的解码算法能够更充分利用校验数据中的信息来完成数据修复. 基于新的解码算法, 本文相应的提出了一种新的ZD码编码方案, 由于新算法更高的信息利用率, 新的编码方案能够用更少的存储开销来满足存储系统的高可靠性. 实验结果表明, 本文提出的ZD码编码方案具有最优的存储开销, 且编解码性能远高于目前广泛使用的RS码.  相似文献   

13.
We describe a new parallel polygonclipping algorithm based on a novel technique that allows a processor to compute output vertices independently of the results of the other processors. The basis for the method is a collision-free labeling scheme to compute the labels of the vertices of the output polygon. This labeling scheme depends only on the id of the vertices of the output polygon. This labeling scheme depends only on the id of the vertex in the input polygon. This procedure allows us to defer the synchronization between processors to the final stages of the algorithm, reduces the amount of overhead due to fine-grain synchronization, and helps makes the algorithm efficient.  相似文献   

14.
叶笑春  林伟  范东睿  张浩 《软件学报》2010,21(12):3094-3105
在生物信息学中,蛋白质序列比对是最为重要的算法之一,生物技术的发展使得已知的序列库变得越来越庞大,这类算法本身又具有计算密集型的特点,这导致进行序列比对所消耗的时间也越来越长,目前的单核或者数量较少的多核系统均已经难以满足对计算速度的要求.Godson-T是一个包含诸多创新结构的众核平台,在该系统上实现了对一种蛋白质序列比对算法的并行化,并且结合蛋白质比对算法以及Godson-T结构的特征,针对同步开销、存储访问竞争以及负载均衡3个方面对算法进行了细致的优化,最终并行部分整体也获得了更优的、接近线性的加速比,并且实际性能远远优于基于AMD Opteron处理器的工作站平台.  相似文献   

15.
This paper discusses a new work-scheduling algorithm for parallel search of single-agent state spaces, called transposition-table-driven work scheduling, that places the transposition table at the heart of the parallel work scheduling. The scheme results in less synchronization overhead, less processor idle time, and less redundant search effort. Measurements on a 128-processor parallel machine show that the scheme achieves close-to-linear speedups; for large problems the speedups are even superlinear due to better memory usage. On the same machine, the algorithm is 1.6 to 12.9 times faster than traditional work-stealing-based schemes  相似文献   

16.
We propose and evaluate a parallel “decomposite best-first” search branch-and-bound algorithm (dbs) for MIN-based multiprocessor systems. We start with a new probabilistic model to estimate the number of evaluated nodes for a serial best-first search branch-and-bound algorithm. This analysis is used in predicting the parallel algorithm speed-up. The proposed algorithm initially decomposes a problem into N subproblems, where N is the number of processors available in a multiprocessor. Afterwards, each processor executes the serial best-first search to find a local feasible solution. Local solutions are broadcasted through the network to compute the final solution. A conflict-free mapping scheme, known as the step-by-step spread, is used for subproblem distribution on the MIN. A speedup expression for the parallel algorithm is then derived using the serial best-first search node evaluation model. Our analysis considers both computation and communication overheads for providing realistic speed-up. Communication modeling is also extended for the parallel global best-first search technique. All the analytical results are validated via simulation. For large systems, when communication overhead is taken into consideration, it is observed that the parallel decomposite best-first search algorithm provides better speed-up compared to other reported schemes  相似文献   

17.
This short note presents constant-time algorithms for labeling the connected components of an image on a network of processors with a wide reconfigurable bus. The algorithms are based on a processor indexing scheme which employs constant-weight codes. The use of such codes enables identifying a single representative processor for each component in a constant number of steps. The proposed algorithms can label an N×N image in O(1) time using N2 processors, which is optimal. Furthermore, the proposed techniques lead to an O(logN/loglogN)-time image labeling algorithm on a network of N2 processors with a reconfigurable bus of width log N bits. It is shown that these techniques on be applied to labeling an undirected N-vertex graph represented by an adjacency matrix  相似文献   

18.
A fast algorithm for computing a histogram on reconfigurable mesh   总被引:1,自引:0,他引:1  
The reconfigurable mesh captures salient features from a variety of sources, including the content addressable array parallel processor, the CHiP, the polymorphic-torus network and the bus automaton. It consists of an array of processors interconnected by a reconfigurable bus system. The bus system can be used to dynamically obtain various interconnection patterns between the processors. In this paper, we present a fast algorithm for computing the histogram of an N×N image with h grey levels in O(min{√h+log*(N/h),N}) time on an N×N reconfigurable mesh assuming each PE has a constant amount of local memory. This algorithm runs on the PARBUS and MRN/LRN models. In addition, histogram modification can be performed in O(√h) time on the same model. A variant of out algorithm runs in O(min{√h+log log(N/h),N}) time on an N×N RMESH in which each PE has constant storage. This result improves the known time and memory bounds for histogramming on the RMESH model  相似文献   

19.
The design of a floating point matrix- vector multiplication processor array for VLSI, which has an optimal area-time complexity product, is presented. This processor array is capable of performing the function (where n = 1,…, N) and can be applied in many digital signal processing applications, by simply changing the matrix coefficients stored in that array. Each N-bit mantissa, M-bit exponent (N, M) processor element of the array comprises a mantissa multiplier/adder circuit and hardware to handle the floating point control. The multiplier/adder circuit is implemented by a new optimal algorithm, which is regular, recursive and fast. Secondly, the algorithm offers a highly local and regular interconnection network, which is a fundamental requirement in VLSI circuit design methodology.  相似文献   

20.
Array operations are useful in a large number of important scientific codes, such as molecular dynamics, finite element methods, climate modeling, atmosphere and ocean sciences, etc. In our previous work, we have proposed a scheme of extended Karnaugh map representation (EKMR) for multidimensional array representation. We have shown that sequential multidimensional array operation algorithms based on the EKMR scheme have better performance than those based on the traditional matrix representation (TMR) scheme. Since parallel multidimensional array operations have been an extensively investigated problem, we present efficient data parallel algorithms for multidimensional array operations based on the EKMR scheme for distributed memory multicomputers. In a data parallel programming paradigm, in general, we distribute array elements to processors based on various distribution schemes, do local computation in each processor, and collect computation results from each processor. Based on the row, column, and 2D mesh distribution schemes, we design data parallel algorithms for matrix-matrix addition and matrix-matrix multiplication array operations in both TMR and EKMR schemes for multidimensional arrays. We also design data parallel algorithms for six Fortran 90 array intrinsic functions: All, Maxval, Merge, Pack, Sum, and Cshift. We compare the time of the data distribution, the local computation, and the result collection phases of these array operations based on the TMR and the EKMR schemes. The experimental results show that algorithms based on the EKMR scheme outperform those based on the TMR scheme for all test cases.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号