共查询到10条相似文献,搜索用时 125 毫秒
1.
The Data Locality of Work Stealing 总被引:1,自引:0,他引:1
This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines,
where movement of data to and from the cache is solely controlled by the hardware. We present lower and upper bounds on the
number of cache misses when using work stealing, and introduce a locality-guided work-stealing algorithm and its experimental
validation.
{As a lower bound, we show that a work-stealing application that exhibits good data locality on a uniprocessor may exhibit
poor data locality on a multiprocessor. In particular, we show a family of multithreaded computations G
n
whose members perform Θ(n) operations (work) and incur a constant number of cache misses on a uniprocessor, while even on two processors the total
number of cache misses soars to Ω(n) . On the other hand, we show a tight upper bound on the number of cache misses that nested-parallel computations, a large,
important class of computations, incur due to multiprocessing. In particular, for nested-parallel computations, we show that
on P processors a multiprocessor execution incurs an expected
more misses than the uniprocessor execution. Here m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T
∈
fty is the number of nodes on the longest chain of dependencies. Based on this we give strong execution time bounds for nested-parallel
computations using work stealing.}
For the second part of our results, we present a locality-guided work-stealing algorithm that improves the data locality
of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative
data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads
but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, locality-guided
work stealing improves the performance of work stealing up to 80%. 相似文献
2.
Software transactional memory 总被引:1,自引:0,他引:1
Summary. As we learn from the literature, flexibility in choosing synchronization operations greatly simplifies the task of designing
highly concurrent programs. Unfortunately, existing hardware is inflexible and is at best on the level of a Load–Linked/Store–Conditional operation on a single word. Building on the hardware based transactional synchronization methodology of Herlihy and Moss,
we offer software transactional memory (STM), a novel software method for supporting flexible transactional programming of synchronization operations. STM is non-blocking,
and can be implemented on existing machines using only a Load–Linked/Store–Conditional operation. We use STM to provide a general highly concurrent method for translating sequential object implementations to
non-blocking ones based on implementing a k-word compare&swap STM-transaction. Empirical evidence collected on simulated multiprocessor architectures shows that our method always outperforms
the non-blocking translation methods in the style of Barnes, and outperforms Herlihy’s translation method for sufficiently
large numbers of processors. The key to the efficiency of our software-transactional approach is that unlike Barnes style
methods, it is not based on a costly “recursive helping” policy.
Received: January 1996 / Revised: June 1996 / Accepted: August 1996 相似文献
3.
Shared counters are among the most basic coordination structures in distributed computing. Known implementations of shared
counters are either blocking, non-linearizable, or have a sequential bottleneck. We present the first counter algorithm that
is both linearizable, non-blocking, and can provably achieve high throughput in k-synchronous executions—executions in which process speeds vary by at most a constant factor k. The algorithm is based on a novel variation of the software combining paradigm that we call bounded-wait combining (BWC). It can thus be used to obtain implementations, possessing the same properties, of any object that supports combinable
operations, such as a stack or a queue. Unlike previous combining algorithms where processes may have to wait for each other
indefinitely, in the BWC algorithm, a process only waits for other processes for a bounded period of time and then “takes
destiny in its own hands”. In order to reason rigorously about the parallelism attainable by our algorithm, we define a novel
metric for measuring the throughput of shared objects, which we believe is interesting in its own right. We use this metric
to prove that our algorithm achieves throughput of Ω(N/ log N) in k-synchronous executions, where N is the number of processes that can participate in the algorithm. Our algorithm uses two tools that we believe may prove
useful for obtaining highly parallel non-blocking implementation of additional objects. The first are “synchronous locks”,
locks that are respected by processes only in k-synchronous executions and are disregarded otherwise; the second are “pseduo-transactions”—a weakening of regular transactions
that allows higher parallelism.
A preliminary version of this paper appeared in [11].
D. Hendler is supported in part by a grant from the Israel Science Foundation. S. Kutten is supported in part by a grant from
the Israel Science Foundation. 相似文献
4.
The importance of symbolic data structures such as Ordered Binary Decision Diagrams (OBDD) is rapidly growing in many areas
of Computer Science where the large dimensions of the input models is a challenging feature: OBDD based graph representations
allowed to define truly new standards in the achievable dimensions for the Model Checking verification technique. However,
OBDD representations pose strict constraints in the algorithm design issue. For example, Depth-First Search (DFS) is not feasible
in a symbolic framework and, consequently, many state-of-the-art DFS based algorithms (e.g., connectivity procedures) cannot
be easily rearranged to work on symbolically represented graphs. We devise here a symbolic algorithmic strategy, based on
the new notion of spine-set, that is general enough to be the engine of linear symbolic step algorithms for both strongly connected components and biconnected
components. Our procedures improve on previously designed connectivity symbolic algorithms. Moreover, by an application to
the so-called “bad cycle detection problem”, our technique can be used to efficiently solve the emptiness problem for various
kinds of ω-automata.
This work is a revised and extended version of [22,23]. It is partially supported by the projects PRIN 2005015491 and BIOCHECK. 相似文献
5.
Christian Urban 《Journal of Automated Reasoning》2008,40(4):327-356
This paper describes a formalisation of the lambda-calculus in a HOL-based theorem prover using nominal techniques. Central
to the formalisation is an inductive set that is bijective with the alpha-equated lambda-terms. Unlike de-Bruijn indices,
however, this inductive set includes names and reasoning about it is very similar to informal reasoning with “pencil and paper”.
To show this we provide a structural induction principle that requires to prove the lambda-case for fresh binders only. Furthermore,
we adapt work by Pitts providing a recursion combinator for the inductive set. The main technical novelty of this work is
that it is compatible with the axiom of choice (unlike earlier nominal logic work by Pitts et al); thus we were able to implement all results in Isabelle/HOL and use them to formalise the standard proofs for Church-Rosser,
strong-normalisation of beta-reduction, the correctness of the type-inference algorithm W, typical proofs from SOS and much
more.
This paper is a revised and much extended version of Urban and Berghofer [32], and Urban and Tasson [36]. 相似文献
6.
We present an adaptive algorithm for N-process mutual exclusion under read/write atomicity in which all busy waiting is by local spinning. In our algorithm, each process p performs O(k) remote memory references to enter and exit its critical section, where k is the maximum “point contention” experienced by p. The space complexity of our algorithm is Θ(N), which is clearly optimal. Our algorithm is the first mutual exclusion algorithm under read/write atomicity that is adaptive when time complexity is measured by counting remote memory references.A preliminary version of this paper was presented at the 14th International Symposium on Distributed Computing [6]. 相似文献
7.
Ole Agesen David L. Detlefs Christine H. Flood Alexander T. Garthwaite Paul A. Martin Mark Moir Nir N. Shavit Guy L. Steele Jr. 《Theory of Computing Systems》2002,35(3):349-386
The computer industry is currently examining the use of strong synchronization operations such as double compare-and-swap
(DCAS) as a means of supporting non-blocking synchronization on future multiprocessor machines. However, before such a strong
primitive will be incorporated into hardware design, its utility needs to be proven by developing a body of effective non-blocking
data structures using DCAS.
As part of this effort, we present two new linearizable non-blocking implementations of concurrent deques using the DCAS
operation. The first uses an array representation, and improves on previous algorithms by allowing uninterrupted concurrent
access to both ends of the deque while correctly handling the difficult boundary cases when the deque is empty or full. The
second uses a linked-list representation, and is the first non-blocking, dynamically-sized deque implementation. It too allows
uninterrupted concurrent access to both ends of the deque. We have proved these algorithms correct with the aid of a mechanical
theorem prover; we describe these proofs in the paper. 相似文献
8.
A new fourth order box-scheme for the Poisson problem in a square with Dirichlet boundary conditions is introduced, extending
the approach in Croisille (Computing 78:329–353, 2006). The design is based on a “hermitian box” approach, combining the approximation of the gradient by the fourth order hermitian
derivative, with a conservative discrete formulation on boxes of length 2h. The goal is twofold: first to show that fourth order accuracy is obtained both for the unknown and the gradient; second,
to describe a fast direct algorithm, based on the Sherman-Morrison formula and the Fast Sine Transform. Several numerical
results in a square are given, indicating an asymptotic O(N
2log 2(N)) computing complexity. 相似文献
9.
The approach of learning multiple “related” tasks simultaneously has proven quite successful in practice; however, theoretical
justification for this success has remained elusive. The starting point for previous work on multiple task learning has been
that the tasks to be learned jointly are somehow “algorithmically related”, in the sense that the results of applying a specific learning algorithm to these tasks are assumed to be similar. We offer an alternative approach, defining
relatedness of tasks on the basis of similarity between the example generating distributions that underlie these tasks.
We provide a formal framework for this notion of task relatedness, which captures a sub-domain of the wide scope of issues
in which one may apply a multiple task learning approach. Our notion of task similarity is relevant to a variety of real life
multitask learning scenarios and allows the formal derivation of generalization bounds that are strictly stronger than the
previously known bounds for both the learning-to-learn and the multitask learning scenarios. We give precise conditions under
which our bounds guarantee generalization on the basis of smaller sample sizes than the standard single-task approach.
Editors: Daniel Silver, Kristin Bennett, Richard Caruana.
A preliminary version of this paper appears in the proceedings of COLT’03, (Ben-David and Schuller 2003). 相似文献
10.
Temporal logics are commonly used for reasoning about concurrent systems. Model checkers and other finite-state verification
techniques allow for automated checking of system model compliance to given temporal properties. These properties are typically
specified as linear-time formulae in temporal logics. Unfortunately, the level of inherent sophistication required by these
formalisms too often represents an impediment to move these techniques from “research theory” to “industry practice”. The
objective of this work is to facilitate the nontrivial and error prone task of specifying, correctly and without expertise
in temporal logic, temporal properties.
In order to understand the basis of a simple but expressive formalism for specifying temporal properties we critically analyze
commonly used in practice visual notations. Then we present a scenario-based visual language called Property Sequence Chart
(PSC) that, in our opinion, fixes the highlighted lacks of these notations by extending a subset of UML 2.0 Interaction Sequence Diagrams. We also provide PSC with both denotational and operational semantics. The operational semantics is obtained via translation into Büchi automata
and the translation algorithm is implemented as a plugin of our Charmy tool. Expressiveness of PSC has been validated with respect to well known property specification patterns.
Preliminary results appeared in (Autili et al. 2006a). 相似文献