期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

佟冬方滨兴胡铭曾《计算机研究与发展》2000,37(2):194-200

非线性存储方案能在处理单元数等于存储体数的情况下,使ＳＩＭＤ机实现多种访存模式无冲突,提高其整体性能,文中提出一种用线性存储方案设计ＳＩＭＤ一般方法,在存储方案给定的前提下,针对有限的模板集设计出同时满足存储器访问无冲突和互联网的并行结构,首先,用布尔向量空间表示模板,并指出模板与ＬＣ置换的对应关系,在此基础上,提出设计局部地址生成逻辑和增强的间接二进制Ｎ方体网络的方法,由于板集中任意的访存方式相似文献

2.

Fault-Tolerant Routing in MIN-Based Supercomputers

《Journal of Parallel and Distributed Computing》1994,22(2):154-167

We study methods for routing data in supercomputers that use multistage interconnection networks (MINs), in the presence of faulty components in the network. These methods are applicable to multiprocessors such as IBM GF11 and RP3. These methods are based on the concept of dynamic full access (DFA), which refers to the ability of the network to route data from any processor in the system to any other processor in a finite number of passes through the network. We introduce a graph model called the DFA graph of a MIN and show how it can be used to determine the DFA capability of the MIN under a given set of network faults. When the faults in the network satisfy certain special properties, we present algorithms for routing

•any arbitrary permutation in a faulty Benes network, and
•any Omega permutation in a faulty Omega network.

These algorithms are simple and operate in a distributed fashion. These techniques allow a supercomputer to efficiently realize permutations of data needed in a parallel computing environment despite the presence of faults in the network. 相似文献

3.

O(n) routing in rearrangeable networks

《Journal of Systems Architecture》2000,46(6):529-542

In (2n−1)-stage rearrangeable networks, the routing time for any arbitrary permutation is Ω(n²) compared to its propagation delay O(n) only. Here, we attempt to identify the sets of permutations, which are routable in O(n) time in these networks. We define four classes of self-routable permutations for Benes network. An O(n) algorithm is presented here, that identifies if any permutation P belongs to one of the proposed self-routable classes, and if yes, it also generates the necessary control vectors for routing P. Therefore, the identification, as well as the switch setting, both problems are resolved in O(n) time by this algorithm. It covers all the permutations that are self-routable by anyone of the proposed techniques. Some interesting relationships are also explored among these four classes of permutations, by applying the concept of ‘group-transformations’ [N. Das, B.B. Bhattacharya, J. Dattagupta, Hierarchical classification of permutation classes in multistage interconnection networks, IEEE Trans. Comput. (1993) 665–677] on these permutations. The concepts developed here for Benes network, can easily be extended to a class of (2n−1)-stage networks, which are topologically equivalent to Benes network. As a result, the set of permutations routable in a (2n−1)-stage rearrangeable network, in a time comparable to its propagation delay has been extended to a large extent. 相似文献

4.

Sorting by Short Block-Moves 总被引：1，自引：0，他引：1

L. S. Heath J. P. C. Vergara 《Algorithmica》2000,28(3):323-354

Sorting permutations by operations such as reversals and block-moves has received much interest because of its applications in the study of genome rearrangements and in the design of interconnection networks. A short block-move is an operation on a permutation that moves an element at most two positions away from its original position. This paper investigates the problem of finding a minimum-length sorting sequence of short block-moves for a given permutation. A 4/3 -approximation algorithm for this problem is presented. Woven double-strip permutations are defined and a polynomial-time algorithm for this class of permutations is devised that employs graph matching techniques. A linear-time maximum matching algorithm for a special class of grid graphs improves the time complexity of the algorithm for woven double-strip permutations. Received June 1, 1997; revised July 25, 1998. 相似文献

5.

Performing permutations on interconnection networks by regularlychanging switch states

Wei Kuang Lai 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(8):829-837

In this paper, we present an algorithm for performing permutations of messages on multistage interconnection networks. Permutations of messages are needed in many parallel algorithms. The proposed algorithm is feasible for any networks that can connect each input to each output using a set of N nonblocking connections, where N is the number of ports on the network. Messages are segmented into N submessages that are sent independently in each time step. For any permutation, the settings of switches are changed with fixed patterns. Partitioning of the network into independent subnetworks is also supported, each capable of simultaneously routing a different permutation 相似文献

6.

Efficient permutation instructions for fast software cryptography

Lee R.B. Zhijie Shi Xiiao Yang 《Micro, IEEE》2001,21(6):56-69

Performing permutations in software can facilitate more widespread use of secure information processing and faster multimedia processing, but current instruction set architectures, even when augmented with subword-parallel multimedia instructions, do not provide efficient, bit-level software permutations. Four new instructions each offer a solution. They are: PPERM (a new, lower-cost version of PPERM3R that selects bits for one byte of the result); GRP (a permutation instruction that separates bits into left and right parts); CROSS (a permutation instruction using Benes interconnection network theory); and OMFLIP (a permutation instruction using enhanced Omega-Flip interconnection network theory) 相似文献

7.

A heuristic storage for minimizing access time of arbitrary datapatterns

Al-Mouhamed M.A. Seiden S.S. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(4):441-447

The serialization of memory accesses is a major limiting factor in high performance SIMD computers. The data patterns or templates that are accessed by a program can be perceived by the compiler, and, therefore, the design of dynamic storage schemes that minimize conflicts may dramatically improve performance. The problem of finding storage schemes that minimize the access time of arbitrary sets of power-of-two data patterns is proved to be NP-complete. We propose linear address transformations that can be dynamically applied by each processing element for mapping array references onto memories. An efficient approach for combining the constraints of different access patterns into one single linear address transformation is presented. We prove that finding the transformation that minimizes the access time is reducible to N-coloring, where N is the number of parallel memories. Using coloring heuristics, storage schemes are investigated with respect to minimizing the implementation cost (perfect storage) and overall access conflicts (semiperfect storage). Results show that the perfect-storage may deviate on the average by 20% from the optimum access time in the case of 10 arbitrary data patterns and 16 memories. However, semiperfect schemes lead to dramatic reduction of the degree of conflict compared to perfect-schemes. The proposed heuristic storage largely outperforms interleaving and row-column-diagonals storages. The method can be implemented as compiler procedure for synthesizing storage schemes that promote parallel access to arbitrary sets of data patterns 相似文献

8.

Conflict-free star-access in parallel memory systems

《Journal of Parallel and Distributed Computing》2006,66(11):1431-1441

We study conflict-free data distribution schemes in parallel memories in multiprocessor system architectures. Given a host graph G, the problem is to map the nodes of G into memory modules such that any instance of a template type T in G can be accessed without memory conflicts. A conflict occurs if two or more nodes of T are mapped to the same memory module. The mapping algorithm should: (i) be fast in terms of data access (possibly mapping each node in constant time); (ii) minimize the required number of memory modules for accessing any instance in G of the given template type; and (iii) guarantee load balancing on the modules. In this paper, we consider conflict-free access to star templates, i.e., to any node of G along with all of its neighbors. Such a template type arises in many classical algorithms like breadth-first search in a graph, message broadcasting in networks, and nearest neighbor based approximation in numerical computation. We consider the star-template access problem on two specific host graphs—tori and hypercubes—that are also popular interconnection network topologies. The proposed conflict-free mappings on these graphs are fast, use an optimal or provably good number of memory modules, and guarantee load balancing. 相似文献

9.

Optimal tree access by elementary and composite templates inparallel memory systems

Auletta V. Das S.K. De Vivo A. Pinotti M.C. Scarano V. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(4):399-412

In this paper, we study efficient strategies for mapping onto parallel memory systems complete trees that are accessed by fixed templates (like complete subtrees, paths, or any combinations their of). These mappings are evaluated with respect to the following criteria: (1) the largest number of data items that can be accessed in parallel without memory conflicts; (2) the number of memory conflicts that can occur when accessing templates of size equal to the number of available memory modules, thereby exploiting the full parallelism of the system; (3) the complexity of the memory addressing scheme, i.e., the cost of retrieving the module where a given data item is mapped. We show that there exist trade-offs between these three criteria and the performance of different mapping strategies depends on the emphasis given on each of these criteria. More specifically, we describe an algorithm for mapping complete binary trees of height H onto M memory modules and prove that it achieves the following performance results: (1) conflict-free access to complete subtrees of size K and paths of size N such that N + K - [log K] ⩽ M; (2) at most 1 conflict in accessing complete subtrees and paths of size M; (3) O(K/M + c) conflicts when accessing a composite template of K nodes consisting of c disjoint subsets, each subset being a complete subtree, or a path or a set of consecutive nodes in a level of the tree 相似文献

10.

Fast recognition of Baxter permutations using syntactical and complete bipartite composite dag's

Johnson M. Hart 《International journal of parallel programming》1980,9(4):307-321

A correspondence is developed between permutations and two-dimensional digraphs. When the graphs are also required to be completely bipartite composite (CBC), their permutations then correspond to a class called the Baxter permutations. A permutation can be tested for the Baxter condition by a linear time processing of its digraph. This processing involves a pair of traversals of the digraph and is made possible by exploiting a relationship with the syntactical graphs used in language theory. 相似文献

11.

The composite banyan network 总被引：2，自引：0，他引：2

Seung-Woo Seo Tse-Yun Feng 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(10):1043-1054

A new multipath multistage interconnection network called the composite banyan network is proposed. The network incorporates both the banyan and the reverse banyan networks and is constructed by superimposing the two. The basic building blocks in the composite banyan network are 3×3 switching elements with log₂N stages. A major advantage of the composite banyan network over existing networks with 3×3 SEs is an efficient and fast control algorithm that sets up a path between any source and destination pair. Instead of complex numerical calculations, the network can easily generate a primary routing tag and alternate tags through simple binary operations. Also, the network has a lot of favorable features, including regularity, symmetry, and easy rerouting capability under faults and conflicts. It is shown that at least two totally disjoint paths exist between any source and destination pair, which increase the degree of fault-tolerance. A deterministic permutation routing algorithm is also developed for the 8×8 composite banyan network, Using a simple tabular method, it is shown that the algorithm always finds a set of conflict-free tags 相似文献

12.

Permuting data with the Omega network

Jacques Lenfant Serge Tahé 《Acta Informatica》1985,21(6):629-641

Summary Introduced by Lawrie, the Omega network is a powerful device to connect processing elements in a SIMD computer or in a multiprocessor architecture. Unfortunately it is not rearrangeable and some permutations that are frequently used to align data in a SIMD computer cannot be performed in one pass. Such is the case with the class of permutations induced by a permutation of index digits (PIPID) which includes the perfect shuffle, the bit reversal, etc. ... Using the techniques of linear algebra over the two-element field, we show that PIPIDs can be achieved by the Omega network through which the vector of data is routed twice. 相似文献

13.

A comparison of pivot selection techniques for permutation-based indexing

《Information Systems》2015

Recently, permutation based indexes have attracted interest in the area of similarity search. The basic idea of permutation based indexes is that data objects are represented as appropriately generated permutations of a set of pivots (or reference objects). Similarity queries are executed by searching for data objects whose permutation representation is similar to that of the query, following the assumption that similar objects are represented by similar permutations of the pivots. In the context of permutation-based indexing, most authors propose to select pivots randomly from the data set, given that traditional pivot selection techniques do not reveal better performance. However, to the best of our knowledge, no rigorous comparison has been performed yet. In this paper we compare five pivot selection techniques on three permutation-based similarity access methods. Among those, we propose a novel technique specifically designed for permutations. Two significant observations emerge from our tests. First, random selection is always outperformed by at least one of the tested techniques. Second, there is no technique that is universally the best for all permutation-based access methods; rather different techniques are optimal for different methods. This indicates that the pivot selection technique should be considered as an integrating and relevant part of any permutation-based access method. 相似文献

14.

A memory interference model for regularly patterned multiple streamvector accesses

Qing Yang Tao Yang 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(5):520-530

Most existing analytical models for memory interference generally assume random bank selection for each memory access. In vector computers, however, memory accesses are typically regularly patterned with a number of data items being accessed concurrently from different banks. Very little is known about the queueing behavior of memory interferences in multiple stream vector accesses. This paper presents an analytical model for memory interferences due to vector accesses in multiple vector processor systems. The model captures the effects of both bank conflicts among elements within one vector access stream and conflicts among multiple vector access streams on system performance. The model is based on a closed queueing network assuming an ideal interconnection network. An approximation technique is proposed to solve the memory queueing system that serves customers in a complicated way (non-FIFO). We also carry out extensive simulation experiments to study memory interference and validate our analytical model. Simulation results and analytical results are in a very good agreement, indicating that the model is very accurate. We further validate our analysis by comparing the numerical results obtained from our analytical model with those measurement results that were published by other researchers. Based on our analytical model and simulations, we carry out performance evaluation of the multiple vector processor systems. Our numerical results show that memory access conflicts pose a severe limitation on the number of useful processors in the system, implying that memory system design is essential to high-performance computing 相似文献

15.

Compile-time techniques for improving scalar access performance inparallel memories

Gupta A. Soffa M.L. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(2):138-148

Compile-time techniques for storage allocation of scalar values into memory modules that limit run-time memory-access conflicts are presented. The allocation approach is applicable to those operands in instructions that can be predicted at compile-time, where an instruction is composed of the multiple operations and corresponding operands that execute in parallel. Algorithms to schedule data transfers among memory modules to avoid conflicts that cannot be eliminated by the distribution of values alone are developed. The techniques have been implemented as part of a compiler for a reconfigurable long instruction word architecture. Results of experiments are presented demonstrating that a very high percentage of memory access conflicts can be avoided by scheduling a very low number of data transfers 相似文献

16.

Routing Permutations on Graphs via Factors

Dominique Barth Petri?or Panaite 《Journal of Parallel and Distributed Computing》1998,54(2):162

We deal with the permutation routing problem on graphs modeling interconnection networks. In our model, calledrouting via factors, at each routing step, the communication pattern is a directed 1-factor in a symmetric digraph. This adds a new feature, that of continuous packet movement, to preciously studied routing types, where the routing of a permutation is reduced to a sequence of permutations from a given class. We especially focus on bipartite graphs and we give sufficient conditions for a graph to be rearrangeable in our model. We propose a general technic for routing via factors that we apply to the 2D mesh and the hypercube. 相似文献

17.

基于神威·太湖之光的非结构网格计算加速算法

许乐安虹陈俊仕张鹏飞武铮《计算机工程》2022,48(12):45-53

在国产异构众核平台神威·太湖之光上的非结构网格计算具有稀疏存储、离散访存、数据依赖等特点,严重制约了众核处理器的性能发挥。为解决稀疏存储和离散访存问题,提出一种N阶对角染色算法,以有效平衡主从核计算并利用从核将全局访存转化为LDM访问。针对数据依赖造成的计算竞争问题,采用自适应和无依赖的任务划分方法,避免并行计算时的数据冲突。为对处理器架构和非结构网格计算进行优化,采用主核与从核异步并行的方式,差异化使用主从核以充分利用硬件资源,同时,取消处理器提供的寄存器通信机制,降低从核阵列的同步开销同时便于扩展到新一代神威平台。此外,使用计算访存异步重叠技术来充分隐藏访存延迟。利用SpMV、Integration、calcLudsFcc算子进行实验,结果表明,相比主核实现,组合加速算法在不同算例规模下平均取得了10倍的加速效果,加速比最高可达24倍,N阶对角染色算法相比非染色分块算法取得了超过5.8倍的性能加速,有效提升了数据局部性和计算并行度。该算法对有依赖关系的计算冲突算子同样具有良好的加速性能,验证了自适应和无依赖任务划分方法的有效性。相似文献

18.

Stability analysis of discrete-time recurrent neural networks based on standard neural network models

Meiqin Liu 《Neural computing & applications》2009,18(8):861-874

In order to conveniently analyze the stability of various discrete-time recurrent neural networks (RNNs), including bidirectional associative memory, Hopfield, cellular neural network, Cohen-Grossberg neural network, and recurrent multiplayer perceptrons, etc., the novel neural network model, named standard neural network model (SNNM) is advanced to describe this class of discrete-time RNNs. The SNNM is the interconnection of a linear dynamic system and a bounded static nonlinear operator. By combining Lyapunov functional with S-Procedure, some useful criteria of global asymptotic stability for the discrete-time SNNMs are derived, whose conditions are formulated as linear matrix inequalities. Most delayed (or non-delayed) RNNs can be transformed into the SNNMs to be stability analyzed in a unified way. Some application examples of the SNNMs to the stability analysis of the discrete-time RNNs shows that the SNNMs make the stability conditions of the RNNs easily verified. 相似文献

19.

Odd Memory Systems: A New Approach

Seznec A. Lenfant J. 《Journal of Parallel and Distributed Computing》1995,26(2)

To reject the use of a prime (or odd) number N of memory banks in a vector processor, it is generally advanced that address computation for such a memory system would require systematic Euclidean division by the number N. We first show that the Chinese Remainder Theorem allows one to define a very simple mapping of data onto the memory banks for which address computation does not require any Euclidean division. Massively parallel SIMD computers may have thousands of processors. When the memory on such a machine is globally shared, routing vectors from memory to the processors is a major difficulty; the control for the interconnection network cannot be generally computed at execution time. When the number of memory banks and processors is a product of prime numbers, the family of permutations needed for routing vectors from memory to the processors through the interconnection network has very specific properties. The Chinese Remainder Network presented in the paper is able to execute all these permutations in a single path and may be easily controlled. 相似文献

20.

Randomized self-routing algorithms for Clos networks

Abdou Youssef 《Computers & Electrical Engineering》1993,19(6):419-429

As the VLSI technology makes large crossbar switches affordable, Clos networks have become a feasible option of large interconnection networks. However, to make these networks practical and useful, efficient routing algorithms need to be developed. This paper will develop and study several randomized routing algorithms for Clos networks. The algorithms are based on the idea that if the first column of Clos is set to some configuration somehow, then the resulting network becomes self-routed using the destination addresses. Each of the randomized algorithms sets the first column to a configuration selected by a random process. The algorithms are then self-routed and take no computation time to set the switches. Probabilistic analysis and simulation measurements of the communication delay of permutation routing are conducted. It is shown that the communication delay of any permutation is 3–6 cycles in networks of up to 1024 processors. Although other routing algorithms route arbitrary permutations in one cycle over Clos/Benes networks and 2 cycles over δ networks, these algorithms take prohibitively large times to compute the appropriate switch settings, while our randomized algorithms are self-routed and spend NO time on computing the switch settings. This makes our algorithms superior to any universal nonrandomized routing algorithm for Clos/Benes networks or δ networks. The speed, universality and ease of implementation of our randomized algorithms make Clos networks highly attractive for large parallel computer systems. 相似文献