首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到10条相似文献,搜索用时 125 毫秒
1.
The Data Locality of Work Stealing   总被引:1,自引:0,他引:1  
This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines, where movement of data to and from the cache is solely controlled by the hardware. We present lower and upper bounds on the number of cache misses when using work stealing, and introduce a locality-guided work-stealing algorithm and its experimental validation. {As a lower bound, we show that a work-stealing application that exhibits good data locality on a uniprocessor may exhibit poor data locality on a multiprocessor. In particular, we show a family of multithreaded computations G n whose members perform Θ(n) operations (work) and incur a constant number of cache misses on a uniprocessor, while even on two processors the total number of cache misses soars to Ω(n) . On the other hand, we show a tight upper bound on the number of cache misses that nested-parallel computations, a large, important class of computations, incur due to multiprocessing. In particular, for nested-parallel computations, we show that on P processors a multiprocessor execution incurs an expected
more misses than the uniprocessor execution. Here m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T fty is the number of nodes on the longest chain of dependencies. Based on this we give strong execution time bounds for nested-parallel computations using work stealing.} For the second part of our results, we present a locality-guided work-stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, locality-guided work stealing improves the performance of work stealing up to 80%.  相似文献   

2.
Software transactional memory   总被引:1,自引:0,他引:1  
Summary.  As we learn from the literature, flexibility in choosing synchronization operations greatly simplifies the task of designing highly concurrent programs. Unfortunately, existing hardware is inflexible and is at best on the level of a LoadLinked/StoreConditional operation on a single word. Building on the hardware based transactional synchronization methodology of Herlihy and Moss, we offer software transactional memory (STM), a novel software method for supporting flexible transactional programming of synchronization operations. STM is non-blocking, and can be implemented on existing machines using only a LoadLinked/StoreConditional operation. We use STM to provide a general highly concurrent method for translating sequential object implementations to non-blocking ones based on implementing a k-word compare&swap STM-transaction. Empirical evidence collected on simulated multiprocessor architectures shows that our method always outperforms the non-blocking translation methods in the style of Barnes, and outperforms Herlihy’s translation method for sufficiently large numbers of processors. The key to the efficiency of our software-transactional approach is that unlike Barnes style methods, it is not based on a costly “recursive helping” policy. Received: January 1996 / Revised: June 1996 / Accepted: August 1996  相似文献   

3.
Shared counters are among the most basic coordination structures in distributed computing. Known implementations of shared counters are either blocking, non-linearizable, or have a sequential bottleneck. We present the first counter algorithm that is both linearizable, non-blocking, and can provably achieve high throughput in k-synchronous executions—executions in which process speeds vary by at most a constant factor k. The algorithm is based on a novel variation of the software combining paradigm that we call bounded-wait combining (BWC). It can thus be used to obtain implementations, possessing the same properties, of any object that supports combinable operations, such as a stack or a queue. Unlike previous combining algorithms where processes may have to wait for each other indefinitely, in the BWC algorithm, a process only waits for other processes for a bounded period of time and then “takes destiny in its own hands”. In order to reason rigorously about the parallelism attainable by our algorithm, we define a novel metric for measuring the throughput of shared objects, which we believe is interesting in its own right. We use this metric to prove that our algorithm achieves throughput of Ω(N/ log N) in k-synchronous executions, where N is the number of processes that can participate in the algorithm. Our algorithm uses two tools that we believe may prove useful for obtaining highly parallel non-blocking implementation of additional objects. The first are “synchronous locks”, locks that are respected by processes only in k-synchronous executions and are disregarded otherwise; the second are “pseduo-transactions”—a weakening of regular transactions that allows higher parallelism. A preliminary version of this paper appeared in [11]. D. Hendler is supported in part by a grant from the Israel Science Foundation. S. Kutten is supported in part by a grant from the Israel Science Foundation.  相似文献   

4.
The importance of symbolic data structures such as Ordered Binary Decision Diagrams (OBDD) is rapidly growing in many areas of Computer Science where the large dimensions of the input models is a challenging feature: OBDD based graph representations allowed to define truly new standards in the achievable dimensions for the Model Checking verification technique. However, OBDD representations pose strict constraints in the algorithm design issue. For example, Depth-First Search (DFS) is not feasible in a symbolic framework and, consequently, many state-of-the-art DFS based algorithms (e.g., connectivity procedures) cannot be easily rearranged to work on symbolically represented graphs. We devise here a symbolic algorithmic strategy, based on the new notion of spine-set, that is general enough to be the engine of linear symbolic step algorithms for both strongly connected components and biconnected components. Our procedures improve on previously designed connectivity symbolic algorithms. Moreover, by an application to the so-called “bad cycle detection problem”, our technique can be used to efficiently solve the emptiness problem for various kinds of ω-automata. This work is a revised and extended version of [22,23]. It is partially supported by the projects PRIN 2005015491 and BIOCHECK.  相似文献   

5.
This paper describes a formalisation of the lambda-calculus in a HOL-based theorem prover using nominal techniques. Central to the formalisation is an inductive set that is bijective with the alpha-equated lambda-terms. Unlike de-Bruijn indices, however, this inductive set includes names and reasoning about it is very similar to informal reasoning with “pencil and paper”. To show this we provide a structural induction principle that requires to prove the lambda-case for fresh binders only. Furthermore, we adapt work by Pitts providing a recursion combinator for the inductive set. The main technical novelty of this work is that it is compatible with the axiom of choice (unlike earlier nominal logic work by Pitts et al); thus we were able to implement all results in Isabelle/HOL and use them to formalise the standard proofs for Church-Rosser, strong-normalisation of beta-reduction, the correctness of the type-inference algorithm W, typical proofs from SOS and much more. This paper is a revised and much extended version of Urban and Berghofer [32], and Urban and Tasson [36].  相似文献   

6.
We present an adaptive algorithm for N-process mutual exclusion under read/write atomicity in which all busy waiting is by local spinning. In our algorithm, each process p performs O(k) remote memory references to enter and exit its critical section, where k is the maximum “point contention” experienced by p. The space complexity of our algorithm is Θ(N), which is clearly optimal. Our algorithm is the first mutual exclusion algorithm under read/write atomicity that is adaptive when time complexity is measured by counting remote memory references.A preliminary version of this paper was presented at the 14th International Symposium on Distributed Computing [6].  相似文献   

7.
The computer industry is currently examining the use of strong synchronization operations such as double compare-and-swap (DCAS) as a means of supporting non-blocking synchronization on future multiprocessor machines. However, before such a strong primitive will be incorporated into hardware design, its utility needs to be proven by developing a body of effective non-blocking data structures using DCAS. As part of this effort, we present two new linearizable non-blocking implementations of concurrent deques using the DCAS operation. The first uses an array representation, and improves on previous algorithms by allowing uninterrupted concurrent access to both ends of the deque while correctly handling the difficult boundary cases when the deque is empty or full. The second uses a linked-list representation, and is the first non-blocking, dynamically-sized deque implementation. It too allows uninterrupted concurrent access to both ends of the deque. We have proved these algorithms correct with the aid of a mechanical theorem prover; we describe these proofs in the paper.  相似文献   

8.
A new fourth order box-scheme for the Poisson problem in a square with Dirichlet boundary conditions is introduced, extending the approach in Croisille (Computing 78:329–353, 2006). The design is based on a “hermitian box” approach, combining the approximation of the gradient by the fourth order hermitian derivative, with a conservative discrete formulation on boxes of length 2h. The goal is twofold: first to show that fourth order accuracy is obtained both for the unknown and the gradient; second, to describe a fast direct algorithm, based on the Sherman-Morrison formula and the Fast Sine Transform. Several numerical results in a square are given, indicating an asymptotic O(N 2log 2(N)) computing complexity.  相似文献   

9.
The approach of learning multiple “related” tasks simultaneously has proven quite successful in practice; however, theoretical justification for this success has remained elusive. The starting point for previous work on multiple task learning has been that the tasks to be learned jointly are somehow “algorithmically related”, in the sense that the results of applying a specific learning algorithm to these tasks are assumed to be similar. We offer an alternative approach, defining relatedness of tasks on the basis of similarity between the example generating distributions that underlie these tasks. We provide a formal framework for this notion of task relatedness, which captures a sub-domain of the wide scope of issues in which one may apply a multiple task learning approach. Our notion of task similarity is relevant to a variety of real life multitask learning scenarios and allows the formal derivation of generalization bounds that are strictly stronger than the previously known bounds for both the learning-to-learn and the multitask learning scenarios. We give precise conditions under which our bounds guarantee generalization on the basis of smaller sample sizes than the standard single-task approach. Editors: Daniel Silver, Kristin Bennett, Richard Caruana. A preliminary version of this paper appears in the proceedings of COLT’03, (Ben-David and Schuller 2003).  相似文献   

10.
Temporal logics are commonly used for reasoning about concurrent systems. Model checkers and other finite-state verification techniques allow for automated checking of system model compliance to given temporal properties. These properties are typically specified as linear-time formulae in temporal logics. Unfortunately, the level of inherent sophistication required by these formalisms too often represents an impediment to move these techniques from “research theory” to “industry practice”. The objective of this work is to facilitate the nontrivial and error prone task of specifying, correctly and without expertise in temporal logic, temporal properties. In order to understand the basis of a simple but expressive formalism for specifying temporal properties we critically analyze commonly used in practice visual notations. Then we present a scenario-based visual language called Property Sequence Chart (PSC) that, in our opinion, fixes the highlighted lacks of these notations by extending a subset of UML 2.0 Interaction Sequence Diagrams. We also provide PSC with both denotational and operational semantics. The operational semantics is obtained via translation into Büchi automata and the translation algorithm is implemented as a plugin of our Charmy tool. Expressiveness of PSC has been validated with respect to well known property specification patterns. Preliminary results appeared in (Autili et al. 2006a).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号