共查询到20条相似文献,搜索用时 21 毫秒
1.
Henry G. Dietz Abderrazek Zaafrani Matthew T. O'Keefe 《The Journal of supercomputing》1992,5(4):263-289
In a SIMD or VLIW machine, conceptual synchronizations are accomplished by using a static code schedule that does not require run-time synchronization. The lack of run-time synchronization overhead makes these machines very effective for fine-grain parallelism, but they cannot execute parallel code structures as general as those executed by MIMD architectures, and this limits their utility.In this paper we present a timing analysis that allows a compiler for a MIMD machine to eliminate a large fraction of the run-time synchronization by making efficient use of static code scheduling. Although these techniques can be adapted to be applied to most MIMD machines, this paper centers on the analysis and scheduling for barrier MIMD machines. Barrier MIMDs are asynchronous multiple instruction stream/multiple data stream architectures capable of parallel execution of variable execution-time instructions and arbitrary control flow (e.g., while loops and calls). However, they also incorporate a special hardware barrier synchronization mechanism that facilitates static scheduling by providing a mechanism which the compiler can use to enforce precise timing constraints. In other words, the compiler tracks relative timing between processors and uses static code scheduling until the timing imprecision becomes too large, at which point the compiler simply inserts a barrier to reduce that timing imprecision to zero (or a small constant).This paper describes new scheduling and barrier placement algorithms for barrier MIMDs that are based loosely on the list scheduling approach employed for VLIWs [Ellis 1985]. In addition, the experimental results from scheduling thousands of synthetic benchmark programs for a parameterized barrier MIMD machine are presented. 相似文献
2.
A. I. Galushkin L. V. Grachev M. M. Tolstykh V. A. Tochenov 《Cybernetics and Systems Analysis》1990,26(2):184-193
Some issues of reconfigurable computing systems with MIMD architecture are considered. A number of reconfiguring algorithms are analyzed by simulation from the aspect of the attainable fault tolerance.Translated from Kibernetika, No. 2, pp. 35–41, March–April, 1990. 相似文献
3.
Cremonesi P. Gennaro C. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(12):1320-1332
This paper introduces queuing network models for the performance analysis of SPMD applications executed on general-purpose parallel architectures such as MIMD and clusters of workstations. The models are based on the pattern of computation, communication, and I/O operations of typical parallel applications. Analysis of the models leads to the definition of speedup surfaces which capture the relative influence of processors and I/O parallelism and show the effects of different hardware and software components on the performance. Since the parameters of the models correspond to measurable program and hardware characteristics, the models can be used to anticipate the performance behavior of a parallel application as a function of the target architecture (i.e., number of processors, number of disks, I/O topology, etc). 相似文献
4.
Teo Milanez Sylvain Collange Fernando Magno Quintão Pereira Wagner Meira Jr. Renato Ferreira 《Parallel Computing》2014
Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same processing unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is one architecture recently proposed that shares instruction decoding and execution between threads running the same program in an SMT processor, thereby generalizing the approach followed by Graphics Processing Units to general-purpose processors. In this paper we propose new ways to expose redundancies in the MMT execution model. First, we propose and evaluate a new thread reconvergence heuristic that handles function calls better than previous approaches. Our heuristic only inspects the program counter and the stack frame to reconverge threads; hence, it is amenable to efficient and inexpensive hardware implementation. Second, we demonstrate that this heuristic is able to reveal the existence of substantial regularity in inter-thread memory access patterns. We validate our results on data-parallel applications from the PARSEC and SPLASH suites. Our new reconvergence heuristic increases the throughput of our MMT model by 7%, when compared to a previous, and substantially more complex approach, due to Long et al. Moreover, it gives us an effective way to increase regularity in memory accesses. We have observed that over 70% of simultaneous memory accesses are either the same for all the threads, or are affine expressions of the thread identifier. This observation motivates the design of newly proposed hardware that benefits from regularity in inter-thread memory accesses. 相似文献
5.
HANS HENRIK NTEMANN PER SØGAARD-ANDERSEN JAKOB STOUSTRUP 《International journal of control》2013,86(5):1177-1203
A general and concise formulation is given of the loop transfer recovery (LTR) design problem based on recovery errors. Three types of recovery errors are treated, open loop recovery, sensitivity recovery and input-output recovery errors. The three corresponding versions of the asymptotic recovery problem turn out to be equivalent, since the minimization of the recovery errors all amount to the minimization of a certain matrix, the recovery matrix. Using the recovery error definitions, simple necessary and sufficient conditions for the controllers are derived both for the exact and asymptotic recovery cases. This general recovery formulation covers all known observer based compensator types as special cases. The conditions given in this setting are effectively the aim of all known LTR design methods. The recovery formulation is interpreted in terms of a model-matching problem as well, which is examined by means of the Q-parametrization. It is shown how the general controller obtained by the Q-parametrization can be written as a Luenberger observer based controller. In all cases, n controller states suffice to achieve recovery. The compensators are characterized for errors both on the input- and on the output-node (dual case). 相似文献
6.
Extensive research has been done on extracting parallelism from single instruction stream processors. This paper presents
our investigation into ways to modify MIMD architectures to allow them to extract the instruction level parallelism achieved
by current superscalar and VLIW machines. A new architecture is proposed which utilizes the advantages of a multiple instruction
stream design while addressing some of the limitations that have prevented MIMD architectures from performing ILP operation.
A new code scheduling mechanism is described to support this new architecture by partitioning instructions across multiple
processing elements in order to exploit this level of parallelism. 相似文献
7.
A new optimization framework to maximize the performance and efficiency of morphable many-core accelerators is proposed. The devised methodology supports the co-existence of multiple optimization goals and constraints (e.g., computational performance, power, energy consumption and runtime reconfiguration overhead) by relying on a design space exploration approach based on a convenient adaptation of a Multi-Objective Evolutionary Algorithm. In accordance, the proposed algorithm allows the generation of a comprehensive set of execution plans, specifically targeting an efficient runtime adaptation of the processing elements instantiated in morphable slots of the processing structure. The conducted experimental evaluation shows significant gains in terms of the attained performance and energy efficiency when considering both highly parallel and data dependent applications, achieving peak power dissipation and energy consumption reductions as high as 54% and 45%, respectively. 相似文献
8.
The authors present a compile-time scheduling heuristic called dynamic level scheduling, which accounts for interprocessor communication overhead when mapping precedence-constrained, communicating tasks onto heterogeneous processor architectures with limited or possibly irregular interconnection structures. This technique uses dynamically-changing priorities to match tasks with processors at each step, and schedules over both spatial and temporal dimensions to eliminate shared resource contention. This method is fast, flexible, widely targetable, and displays promising performance 相似文献
9.
Functional parallelism may be supported on SIMD machines by interpretation. The programs and data of each function are loaded on the processing elements (PEs), and the control unit of the machine executes a central control algorithm that causes the concurrent interpretation of these functions. The performance of this paradigm has been shown to benefit considerably from a variable instruction issue schedule that delays execution of expensive and rarely occurring operations. Two new features of the interpretation paradigm, namely pipelined SIMD machines and compositional instruction sets, change the nature of the mathematical model used for variable instruction scheduling significantly. In the paper, a previously developed mathematical model of the interpretation process is extended to allow for compositional instructions and pipelining. We develop and present algorithms that produce variable instruction schedules for the extended model and investigate whether the variable instruction issue is useful for these cases. We show that the variable instruction issue improves the performance of pipelined machines but is not very effective for compositional instruction sets, especially when the composition matrix is not sparse. © 1997 by John Wiley & Sons, Ltd. 相似文献
10.
Strictly periodic scheduling in IMA-based architectures 总被引:1,自引:0,他引:1
Ahmad Al?Sheikh Olivier Brun Pierre-Emmanuel Hladik Balakrishna J. Prabhu 《Real-Time Systems》2012,48(4):359-386
The avionic industry has recently adopted the Integrated Modular Avionics (IMA). Such architectures allow the execution of avionic functions on a shared computing platform while avoiding any interference between them. This is done through hard memory and temporal segregation constraints. Although IMA reduces the weight and the power consumption and shortens the design-cycle times, it gives rise to a complex multiprocessor scheduling problem. One of the key difficulties of this problem is related to the strict periodicity of tasks, which means that the time separating two successive executions of the same task is strictly equal to the associated period. In order to help the system designer in producing a proper schedule, an exact formulation based on Integer Linear Programming and a heuristic inspired from Game Theory are proposed. To enhance the solution quality of the heuristic, a?multi-start method, which gives some probabilistic guarantees on the optimality of the solutions, is also introduced. 相似文献
11.
Meikang Qiu Minyi Guo Meiqin Liu Chun Jason Xue Laurence T. Yang Edwin H.-M. Sha 《Journal of Parallel and Distributed Computing》2009
Many high-performance DSP processors employ multi-bank on-chip memory to improve performance and energy consumption. This architectural feature supports higher memory bandwidth by allowing multiple data memory accesses to be executed in parallel. However, making effective use of multi-bank memory remains difficult, considering the combined effect of performance and energy requirement. This paper studies the scheduling and assignment problem about how to minimize the total energy consumption while satisfying the timing constraint with heterogeneous multi-bank memory for applications with loop. An algorithm, TASL (Type Assignment and Scheduling for Loops), is proposed. The algorithm uses bank type assignment with the consideration of variable partition to find the best configuration for both memory and ALU. The experimental results show that the average improvement on energy-saving is significant by using TASL. 相似文献
12.
Qunyan Sun Qingfeng Zhuge Jingtong Hu Juan Yi Edwin H.-M. Sha 《Computers & Electrical Engineering》2014
Heterogeneous clusters of computers usually provide high computing power for large-scale applications at the expense of large cost. And there are two challenges currently faced by researchers. One is how to map large applications, modeled by Directed Acyclic Graphs (DAG), to heterogeneous architectures with minimal cost. The other is how to schedule tasks on each cluster to further decrease the total cost. 相似文献
13.
Suleyman Tosun 《The Journal of supercomputing》2012,62(1):265-289
Scheduling periodic tasks onto a multiprocessor architecture under several constraints such as performance, cost, energy, and reliability is a major challenge in embedded systems. In this paper, we present an Integer Linear Programming (ILP) based framework that maps a given task set onto an Heterogeneous Multiprocessor System-on-Chip (HMPSoC) architecture. Our framework can be used with several objective functions; minimizing energy consumption, minimizing cost (i.e., the number of heterogeneous processors), and maximizing reliability of the system under performance constraints. We use Dynamic Voltage Scaling (DVS) for reducing energy consumption while we employ task duplication to maximize reliability. We illustrate the effectiveness of our approach through several experiments, each with a different number of tasks to be scheduled. We also propose two heuristics based on Earliest Deadline First (EDF) algorithm for minimizing energy under performance and cost constraints. Our experiments on generated task sets show that ILP-based method reduces the energy consumption up to 62% percent against a method that does not apply DVS. Heuristic methods obtain promising results when compared to optimal results generated by our ILP-based method. 相似文献
14.
The author studies dynamic scheduling of computational tasks with communication costs using nonuniform memory access architecture. The computing model assumes that data transfer can be partitioned into parallel and sequential parts with respect to the task execution. A scheduling heuristic, called least-communication (LC), together with a two-level scheduler is proposed in an attempt to minimize the finish time. The LC selects the task that removes the largest amount of remaining data transfer, if no such tasks are available the task that has been ready to run at the earliest is selected first. The time complexity of LC is O(n 2). Testing the finish time of LC and first-come first-served scheduling (FCFS) shows that LC is useful for tasks having moderate granularity and whose computation and communication requirements vary widely for different data sets 相似文献
15.
The methods of cyclic reduction, Fourier analysis and the resulting hybrid algorithm FACR(l) are formulated for a shared memory multiprocessor and applied to the two-dimensional Poisson equation. The complexity of the algorithms is analyzed, performance curves are given and optimal values of the parameter l are determined for various problem sizes and degrees of parallelism. The algorithms are implemented on a shared memory multiprocessor and performance results are discussed. 相似文献
16.
The performance of modern microprocessors considerably depends on the efficient workload of their execution units. The performance in modern applications is considerably affected by instruction stalls. Until recently, the problem of instruction stalls was mainly studied for superscalar microprocessors. A software instruction prefetching method for VLIW/EPIC architectures that makes it possible to improve performance for a certain class of problems is described. 相似文献
17.
An environment that lets system applications be expressed as virtual machines, through which architecture-independent multiple-instruction, multiple-data stream (MIMD) programs are written, is described. The virtual machine hides the hardware configuration from the programmer so that the MIMD programming environment always appears the same, regardless of the actual hardware. The data-definition and procedural high-level languages used in the environment and the generation of object code in the environment are discussed. The runtime configuration of the system and an implemented prototype of the system are described 相似文献
18.
The question of whether prefetching blocks on the file into the block cache can effectively reduce overall execution time of a parallel computation, even under favorable assumptions, is considered. Experiments have been conducted with an interleaved file system testbed on the Butterfly Plus multiprocessor. Results of these experiments suggest that (1) the hit ratio, the accepted measure in traditional caching studies, may not be an adequate measure of performance when the workload consists of parallel computations and parallel file access patterns, (2) caching with prefetching can significantly improve the hit ratio and the average time to perform an I/O (input/output) operation, and (3) an improvement in overall execution time has been observed in most cases. In spite of these gains, prefetching sometimes results in increased execution times (a negative result, given the optimistic nature of the study). The authors explore why it is not trivial to translate savings on individual I/O requests into consistently better overall performance and identify the key problems that need to be addressed in order to improve the potential of prefetching techniques in the environment 相似文献
19.
Parallel two-stage reduction to Hessenberg form using dynamic scheduling on shared-memory architectures 总被引:1,自引:0,他引:1
We consider parallel reduction of a real matrix to Hessenberg form using orthogonal transformations. Standard Hessenberg reduction algorithms reduce the columns of the matrix from left to right in either a blocked or unblocked fashion. However, the standard blocked variant performs 20% of the computations in terms of matrix-vector multiplications. We show that a two-stage approach consisting of an intermediate reduction to block Hessenberg form speeds up the reduction by avoiding matrix-vector multiplications. We describe and evaluate a new high-performance implementation of the two-stage approach that attains significant speedups over the one-stage approach. The key components are a dynamically scheduled implementation of Stage 1 and a blocked, adaptively load-balanced implementation of Stage 2. 相似文献
20.
《IEEE transactions on pattern analysis and machine intelligence》1983,(4):436-445
This paper presents a recognition procedure for parallel tasks in the user program written in a conventional programming language. To establish our program model, it describes the parallelism of the program in tenns of a process flow graph in which the relationships among processes are of predecessors and successors. And finally it presents a parallel processing scheme which realizes automatically the recognition of parallel tasks and schedules these tasks for parallel execution. 相似文献