期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The effect of communication and synchronization on Amdahl’s law in multicore systems

L. Yavits A. Morad R. Ginosar 《Parallel Computing》2014

This work analyses the effects of sequential-to-parallel synchronization and inter-core communication on multicore performance, speedup and scaling from Amdahl’s law perspective. Analytical modeling supported by simulation leads to a modification of Amdahl’s law, reflecting lower than originally predicted speedup, due to these effects. In applications with high degree of data sharing, leading to intense inter-core connectivity requirements, the workload should be executed on a smaller number of larger cores. Applications requiring intense sequential-to-parallel synchronization, even highly parallelizable ones, may better be executed by the sequential core. To improve the scalability and performance speedup of a multicore, it is as important to address the synchronization and connectivity intensities of parallel algorithms as their parallelization factor. 相似文献

2.

Reevaluating Amdahl’s law in the multicore era

Xian-He Sun Yong Chen 《Journal of Parallel and Distributed Computing》2010

Microprocessor architecture has entered the multicore era. Recently, Hill and Marty presented a pessimistic view of multicore scalability. Their analysis was based on Amdahl’s law (i.e. fixed-workload condition) and challenged readers to develop better models. In this study, we analyze multicore scalability under fixed-time and memory-bound conditions and from the data access (memory wall) perspective. We use the same hardware cost model of multicore chips used by Hill and Marty, but achieve very different and more optimistic performance models. These models show that there is no inherent, immovable upper bound on the scalability of multicore architectures. These results complement existing studies and demonstrate that multicore architectures are capable of extensive scalability. 相似文献

3.

Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters

Hikmet Dursun Manaschai Kunaseth Ken-ichi Nomura Jacqueline Chame Robert F. Lucas Chun Chen Mary Hall Rajiv K. Kalia Aiichiro Nakano Priya Vashishta 《The Journal of supercomputing》2012,62(2):946-966

We present a scalable parallelization scheme for high-order stencil computations that also optimizes memory behavior on multicore clusters. Our multilevel approach combines: (i)?inter-node parallelization via spatial decomposition; (ii)?inter-core parallelization via multithreading and explicit non-uniform memory access (NUMA) control; (iii)?data locality optimizations through auto-tuned tiling for efficient use of hierarchical memory; and (iv)?register blocking and data parallelism via single-instruction multiple-data techniques to utilize registers and exploit data locality. The scheme is applied to a sixth-order stencil based finite-difference time-domain code. Weak-scaling parallel efficiency is over 98?% on 32,768 BlueGene/P processors. Multithreading with explicit NUMA control attains 9.9-fold speedup on a dual 12-core AMD Opteron system. Data locality optimizations achieve 7.7-fold reduction of the last level cache miss rate of Intel Nehalem, whereas register blocking increases data parallelism and thereby achieves 5.9 Gflops performance on a single core. Register blocking?+ multithreading optimizations achieve 5.8-fold speedup on a single quadcore Nehalem. 相似文献

4.

Mapping Parallel Application Communication Topology to Rhombic Overlapping-Cluster Multiprocessors

Hoganson Kenneth 《The Journal of supercomputing》2000,17(1):67-90

This paper extends research into rhombic overlapping-connectivity interconnection networks into the area of parallel applications. As a foundation for a shared-memory non-uniform access bus-based multiprocessor, these interconnection networks create overlapping groups of processors, buses, and memories, forming a clustered computer architecture where the clusters overlap. This overlapping-membership characteristic is shown to be useful for matching parallel application communication topology to the architecture's bandwidth characteristics. Many parallel applications can be mapped to the architecture topology so that most or all communication is localized within an overlapping cluster, at the low latency of processor direct to cache (or memory) over a bus. The latency of communication between parallel threads does not degrade parallel performance or limit the graininess of applications. Parallel applications can execute with good speedup and scaling on a proposed architecture which is designed to obtain maximum advantage from the overlapping-cluster characteristic, and also allows dynamic workload migration without moving the instructions or data. Scalability limitations of bus-based shared-memory multiprocessors are overcome by judicious workload allocation schemes, that take advantage of the overlapping-cluster memberships. Bus-based rhombic shared-memory multiprocessors are examined in terms of parallel speedup models to explain their advantages and justify their use as a foundation for the proposed computer architecture. Interconnection bandwidth is maximized with bi-directional circular and segmented overlapping buses. Strategies for mapping parallel application communication topologies to rhombic architectures are developed. Analytical models of enhanced rhombic multiprocessor performance are developed with a unique bandwidth modeling technique, and are compared with the results of simulation. 相似文献

5.

Hardware transactional memory: A high performance parallel programming model

Chen Fu Dongxin Wen Xiaoqun Wang Xiaozong Yang 《Journal of Systems Architecture》2010,56(8):384-391

The transactional memory in multicore processors has been a major research area over past several years. Many transactional memory systems have been proposed to be used to solve the synchronization problem of multicore processors. Hardware transactional memory is one of the critical methods to speedup communications in multicore environment. In this paper, we give a review of the current hardware transactional memory systems for multicore processors. We take a top-down approach to characterizing and classifying various hardware transactional design issues and present a taxonomy of hardware transactional memory systems which is consist of the five fundamental design issues: version management, conflict detection, contention management, virtualization and nesting. Finally, we discussed the active research challenge: the relationship between transactional memory and Input/Output operations and system calls. 相似文献

6.

Helper threads via virtual multithreading

《Micro, IEEE》2004,24(6):74-82

Memory latency dominates the performance of many applications on modern processors, despite advances in caches and prefetching techniques. Numerous prefetching techniques, both in hardware and software, try to alleviate the memory bottleneck. One such technique, known as helper threading improves single-thread performance on a simultaneous multithreaded architecture (SMT), which shares processor resources, including caches, among logical threads. It uses otherwise idle hardware thread contexts to execute speculative threads on behalf of the main thread. Helper threading accelerates a program by exploiting a processor's multithreading capability to run assist threads. Based on the helper threading usage model, virtual multithreading (VMT), a form of switch-on-event user-level multithreading, can improve performance for real-world workloads with a wall-clock speedup of 5.0 to 38.5 percent 相似文献

7.

Energy-efficient multithreading for a hierarchical heterogeneous multicore through locality-cognizant thread generation

Patrick A. La Fratta Peter M. Kogge 《Journal of Parallel and Distributed Computing》2013

Energy costs have become increasingly problematic for high performance processors, but the rising number of cores on-chip offers promising opportunities for energy reduction. Further, emerging architectures such as heterogeneous multicores present new opportunities for improved energy efficiency. While previous work has presented novel memory architectures, multithreading techniques, and data mapping strategies for reducing energy, consideration to thread generation mechanisms that take into account data locality for this purpose has been limited. This study presents methodologies for the joint partitioning of data and threads to parallelize sequential codes across an innovative heterogeneous multicore processor called the Passive/Active Multicore (PAM) for reducing energy consumption from on-chip data transport and cache access components while also improving execution time. Experimental results show that the design with automatic thread partitioning offered reductions in energy-delay product (EDP) of up to 48%. 相似文献

8.

A continuation-based noninterruptible multithreading processor architecture

Satoshi Amamiya Makoto Amamiya Ryuzo Hasegawa Hiroshi Fujita 《The Journal of supercomputing》2009,47(2):228-252

Current trend of research on multithreading processors is toward the chip multithreading (CMT), which exploits thread level parallelism (TLP) and improves performance of softwares built on traditional threading components, e.g., Pthread. There exist commercially available processors that support simultaneous multithreading (SMT) on multicore processors. But they are basically based on the conventional sequential execution model, and execute multiple threads in parallel under the control of OS that handles interruptions. Moreover, there exist few languages or programming techniques to utilize the multicore processors effectively. We are taking another approach to develop a multithreading processor, which is dedicated to TLP. Our processor, named Fuce, is based on the continuation-based multithreading. A thread is defined as a block of sequentially ordered instructions which are executed without interruption. Every thread execution is triggered only by the event called continuation. This paper first introduces the continuation-based multithread execution model and its processor architecture then gives multithreaded programming techniques and the continuation-based multithreading language system CML. Last, the performance of the Fuce processor is evaluated by means of the clock-level software simulation. 相似文献

9.

A superlinear speedup region for matrix multiplication

Marjan Gusev Sasko Ristov 《Concurrency and Computation》2014,26(11):1847-1868

The realization of modern processors is based on a multicore architecture with increasing number of cores per processor. Multicore processors are often designed such that some level of the cache hierarchy is shared among cores. Usually, last level cache is shared among several or all cores (e.g., L3 cache) and each core possesses private low level caches (e.g., L1 and L2 caches). Superlinear speedup is possible for matrix multiplication algorithm executed in a shared memory multiprocessor due to the existence of a superlinear region. It is a region where cache requirements for matrix storage of the sequential execution incur more cache misses than in parallel execution. This paper shows theoretically and experimentally that there is a region, where the superlinear speedup can be achieved. We provide a theoretical proof of existence of a superlinear speedup and determine boundaries of the region where it can be achieved. The experiments confirm our theoretical results. Therefore, these results will have impact on future software development and exploitation of parallel hardware on the basis of a shared memory multiprocessor architecture. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

10.

GPU-accelerated string matching for database applications

Evangelia A. Sitaridi Kenneth A. Ross 《The VLDB Journal The International Journal on Very Large Data Bases》2016,25(5):719-740

Implementations of relational operators on GPU processors have resulted in order of magnitude speedups compared to their multicore CPU counterparts. Here we focus on the efficient implementation of string matching operators common in SQL queries. Due to different architectural features the optimal algorithm for CPUs might be suboptimal for GPUs. GPUs achieve high memory bandwidth by running thousands of threads, so it is not feasible to keep the working set of all threads in the cache in a naive implementation. In GPUs the unit of execution is a group of threads and in the presence of loops and branches, threads in a group have to follow the same execution path; if some threads diverge, then different paths are serialized. We study the cache memory efficiency of single- and multi-pattern string matching algorithms for conventional and pivoted string layouts in the GPU memory. We evaluate the memory efficiency in terms of memory access pattern and achieved memory bandwidth for different parallelization methods. To reduce thread divergence, we split string matching into multiple steps. We evaluate the different matching algorithms in terms of average- and worst-case performance and compare them against state-of-the-art CPU and GPU libraries. Our experimental evaluation shows that thread and memory efficiency affect performance significantly and that our proposed methods outperform previous CPU and GPU algorithms in terms of raw performance and power efficiency. The Knuth–Morris–Pratt algorithm is a good choice for GPUs because its regular memory access pattern makes it amenable to several GPU optimizations. 相似文献

11.

Analyzing and enhancing the parallel sort operation on multithreaded architectures 总被引：1，自引：1，他引：0

Layali Rashid Wessam M. Hassanein Moustafa A. Hammad 《The Journal of supercomputing》2010,53(2):293-312

The sort operation is a core part of many critical applications (e.g., database management systems). Despite the large efforts to parallelize it, the fact that it suffers from high data-dependencies vastly limits its performance. Multithreaded architectures are emerging as the most demanding technology in leading-edge processors. These architectures include simultaneous multithreading, chip multiprocessors, and machines combining different multithreading technologies. In this paper, we analyze the memory behavior and improve the performance of the most recent parallel radix and quick integer sort algorithms on modern multithreaded architectures. We achieve speedups up to 4.69× for radix sort and up to 4.17× for quicksort on a machine with 4 multithreaded processors compared to single threaded versions, respectively. We find that since radix sort is CPU-intensive, it exhibits better results on chip multiprocessors where multiple CPUs are available. While quicksort is accomplishing speedups on all types of multithreading processers due to its ability to overlap memory miss latencies with other useful processing. 相似文献

12.

Predicting and Bounding the Speedup of Multithreaded Solaris Programs

Lars Lundberg 《Journal of Parallel and Distributed Computing》1999,57(3):358

In Solaris, threads are frequently relocated. The data associated with a relocated thread have to be moved from the cache of the old processor to the new processor. In order to avoid poor memory performance due to thread relocation, threads can be bound to processors—static scheduling. Finding a static schedule which results in maximum speedup is NP-hard. It is even difficult to determine if a static schedule is close to the optimal case or not. Here, a technique for predicting the speedup of multithreaded Solaris programs is presented. Based on an existing theoretical result, a lower bound on the maximal speedup is also obtained. The predicted speedup and the bound are based on recordings from a single-processor execution. When comparing the predictions with the real speedup using a multiprocessor with eight processors, we see that the predictions are very good. By comparing the speedup of a static schedule with the bound, we see that it is worthwhile to look for other schedules. 相似文献

13.

Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture

Nan Zhang Author Vitae 《Journal of Parallel and Distributed Computing》2011,71(7):915-924

Parallel workloads on shared-memory multi-core processors often suffer from performance degradation. Cache eviction, true/false sharing and bus contention are among the well-understood causes to this problem. This paper presents a study that shows the L2 DPL (data prefetch logic) in processors based on Intel Core microarchitecture can be a cause to this problem as well. The study through a case of an image integration finds the nonscaling problem on the parallel integration of images whose size exceeds the capacity of the processor’s L2 cache. Through an analysis on relevant performance events using Intel VTune™Performance Analyser the L2 DPL prefetch is found less effective over the parallel integration in prefetching needed data than over the serial ones. To resolve the problem a novel parallel image reverse loading is developed with the purpose of reducing the number of memory accesses over the parallel integration and the associated delay. Experimental results demonstrate that the parallel integration after the parallel reverse loading shows significant speedup against the same parallel integration but after serial loading. 相似文献

14.

Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems

Lee Jaejin Jung Changhee Lim Daeseob Solihin Yan 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(9):1309-1324

This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle. 相似文献

15.

HSCS: a hybrid shared cache scheduling scheme for multiprogrammed workloads

Jingyu ZHANG Chentao WU Dingyu YANG Yuanyi CHEN Xiaodong MENG Liting XU Minyi GUO 《Frontiers of Computer Science》2018,12(6):1090-1104

The traditional dynamic random-access memory (DRAM) storage medium can be integrated on chips via modern emerging 3D-stacking technology to architect a DRAM shared cache in multicore systems. Compared with static random-access memory (SRAM), DRAM is larger but slower. In the existing research, a lot of work has been devoted to improving the workload performance using SRAM and stacked DRAM together in shared cache systems, ranging from SRAM structure improvement to optimizing cache tags and data access. However, little attention has been paid to designing a shared cache scheduling scheme for multiprogrammed workloads with different memory footprints in multicore systems. Motivated by this, we propose a hybrid shared cache scheduling scheme that allows a multicore system to utilize SRAM and 3D-stacked DRAM efficiently, thus achieving better workload performance. This scheduling scheme employs (1) a cache monitor, which is used to collect cache statistics; (2) a cache evaluator, which is used to evaluate the cache information during the process of programs being executed; and (3) a cache switcher, which is used to self-adaptively choose SRAM or DRAM shared cache modules. A cache data migration policy is naturally developed to guarantee that the scheduling scheme works correctly. Extensive experiments are conducted to evaluate the workload performance of our proposed scheme. The experimental results showed that our method can improve the multiprogrammed workload performance by up to 25% compared with state-of-the-art methods (including conventional and DRAM cache systems). 相似文献

16.

Static Coarse Grain Task Scheduling with Cache Optimization Using OpenMP

Nakano Hirofumi Ishizaka Kazuhisa Obata Motoki Kimura Keiji Kasahara Hironori 《International journal of parallel programming》2003,31(3):211-223

Effective use of cache memory is getting more important with increasing gap between the processor speed and memory access speed. Also, use of multigrain parallelism is getting more important to improve effective performance beyond the limitation of loop iteration level parallelism. Considering these factors, this paper proposes a coarse grain task static scheduling scheme considering cache optimization. The proposed scheme schedules coarse grain tasks to threads so that shared data among coarse grain tasks can be passed via cache after task and data decomposition considering cache size at compile time. It is implemented on OSCAR Fortran multigrain parallelizing compiler and evaluated on Sun Ultra80 four-processor SMP workstation using Swim and Tomcatv from the SPEC fp 95. As the results, the proposed scheme gives us 4.56 times speedup for Swim and 2.37 times on 4 processors for Tomcatv respectively against the Sun Forte HPC Ver. 6 update 1 loop parallelizing compiler. 相似文献

17.

Model-based cache-aware dispatching of object-oriented software for multicore systems

Tolga Ovatman Feza Buzluca 《Journal of Systems and Software》2013

In recent years, processor technology has evolved towards multicore processors, which include multiple processing units (cores) in a single package. Those cores, having their own private caches, often share a higher level cache memory dedicated to each processor die. This multi-level cache hierarchy in multicore processors raises the importance of cache utilization problem. Assigning parallel-running software components with common data to processor cores that do not share a common cache increases the number of cache misses. In this paper we present a novel approach that uses model-based information to guide the OS scheduler in assigning appropriate core affinities to software objects at run-time. We build graph models of software and cache hierarchies of processors and devise a graph matcher algorithm that provides mapping between these two graphs. Using this mapping we obtain candidate core sets that each software object can be affiliated with at run-time. These affiliations are determined based on the idea that software components that have the potential to share common data at run-time should run on cores that share a common cache. We also develop an object dispatcher algorithm that keeps track of object affiliations at run-time and dispatches objects by using the information from the compile-time graph matcher. We apply our approach on design pattern implementations and two different application program running on servers using CFS scheduling. Our results show that cache-aware dispatching based on information obtained from software model, decreases number of cache misses significantly and improves CFS’ scheduling performance. 相似文献

18.

DAFT: Decoupled Acyclic Fault Tolerance

Yun Zhang Jae W. Lee Nick P. Johnson David I. August 《International journal of parallel programming》2012,40(1):118-140

Higher transistor counts, lower voltage levels, and reduced noise margin increase the susceptibility of multicore processors to transient faults. Redundant hardware modules can detect such faults, but software techniques are more appealing for their low cost and flexibility. Recent software proposals have not achieved widespread acceptance because they either increase register pressure, double memory usage, or are too slow in the absence of hardware extensions. This paper presents DAFT, a fast, safe, and memory efficient transient fault detection framework for commodity multicore systems. DAFT replicates computation across multiple cores and schedules fault detection off the critical path. Where possible, values are speculated to be correct and only communicated to the redundant thread at essential program points. DAFT is implemented in the LLVM compiler framework and evaluated using SPEC CPU2000 and SPEC CPU2006 benchmarks on a commodity multicore system. Evaluation results demonstrate that speculation allows DAFT to improves the performance of software redundant multithreading by 2.17× with no degradation of fault coverage. 相似文献

19.

Performance tradeoffs in multithreaded processors 总被引：1，自引：0，他引：1

Agarwal A. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(5):525-539

An analytical performance model for multithreaded processors that includes cache interference, network contention, context-switching overhead, and data-sharing effects is presented. The model is validated through the author's simulations and by comparison with previously published simulation results. The results indicate that processors can substantially benefit from multithreading, even in systems with small caches, provided sufficient network bandwidth exists. Caches that are much larger than the working-set sizes of individual processes yield close to full processor utilization with as few as two to four contexts. Smaller caches require more contexts to keep the processor busy, while caches that are comparable in size to the working-sets of individual processes cannot achieve a high utilization regardless of the number of contexts. Increased network contention due to multithreading has a major effect on performance. The available network bandwidth and the context-switching overhead limits the best possible utilization 相似文献

20.

Taxonomy of Data Prefetching for Multicore Processors

下载免费PDF全文

Surendra Byna Member IEEE Yong Chen 《计算机科学技术学报》2009,24(3):405-417

Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory.With hardware and/or software support,data prefetching brings data closer to a processor before it is actually needed.Many prefetching techniques have been developed for single-core processors.Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching t... 相似文献