期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Recognition of circular patterns on GPUs: Performance analysis and contributions

Antonio Ruiz Nicolás Guil Manuel Ujaldón 《Journal of Parallel and Distributed Computing》2008

We develop a novel approach for computing the circle Hough transform entirely on graphics hardware (GPU). A primary role is assigned to vertex processors and the rasterizer, overshadowing the traditional foreground of pixel processors and enhancing parallel processing. Resources like the vertex cache or blending units are studied too, with our set of optimizations leading to extraordinary peak gain factors exceeding 358x over a typical CPU execution. Software optimizations, like the use of precomputed tables or gradient information and hardware improvements, like hyperthreading and multicores are explored on CPUs as well. Overall, the GPU exhibits better scalability and much greater parallel performance to become a solid alternative for computing the classical circle Hough transform versus those optimal methods run on emerging multicore architectures. 相似文献

2.

嵌入式多核系统软件设计和开发

肖红周朴雄《现代计算机》2008,(5):62-66

嵌入式多核系统包含多种同构和异构的计算、存储和通信资源来完成复杂的功能.需要采用相对复杂的软件栈来完成任务调度、资源管理和通信.软件设计和开发中需要解决的关键问题包括任务分解和调度、并行方法、通信延迟、操作系统和中间件的设计等.需要选择合理的解决方案来满足嵌入式环境对实时和资源限制的要求. 相似文献

3.

Fuzzy logic based energy and throughput aware design space exploration for MPSoCs

《Microprocessors and Microsystems》2016

Multicore architectures were introduced to mitigate the issue of increase in power dissipation with clock frequency. Introduction of deeper pipelines, speculative threading etc. for single core systems were not able to bring much increase in performance as compared to their associated power overhead. However for multicore architectures performance scaling with number of cores has always been a challenge. The Amdahl’s law shows that the theoretical maximum speedup of a multicore architecture is not even close to the multiple of number of cores. With less amount of code in parallel having more number of cores for an application might just contribute in greater power dissipation instead of bringing some performance advantage. Therefore there is a need of an adaptive multicore architecture that can be tailored for the application in use for higher energy efficiency. In this paper a fuzzy logic based design space exploration technique is presented that is targeted to optimize a multicore architecture according to the workload requirements in order to achieve optimum balance between throughput and energy of the system. 相似文献

4.

Extending τ-Lop to model concurrent MPI communications in multicore clusters

《Future Generation Computer Systems》2016

相似文献

5.

System design of full HD MVC decoding on mesh-based multicore NoCs

Ning Ma^{Author Vitae} Zhonghai LuAuthor VitaeLirong ZhengAuthor Vitae 《Microprocessors and Microsystems》2011,35(2):217-229

Future multimedia applications such as full HD (1920 × 1080) multiview video coding (MVC) present great challenges on computing architectures. Even if with the state-of-the-art ASIC technology which can process single view HD decoding, dealing with multiple views would require times of computation capacity in proportion to the number of views, which is difficult to achieve. In this paper, we explore the system-level design space for full HD MVC applications mapped onto mesh-based multicore Network-on-Chip (NoC) architectures. To this end, we establish a simulation framework capable of simulating the combination of communication networks with computing cores. We investigate two task assignment schemes: picture-level assignment and view-level assignment. With an eight-view MVC decoding, we explore the design options with respect to network size, single-core performance and link bandwidth under both task assignment schemes. Our studies show that, to achieve a certain decoding performance, the computation capability and communication capacity should be balanced in the system. Also, to realize the eight-view HD decoding, the system only requires twice or less than twice of the single-core processing capacity required by single view decoding, thanks to the parallel computation and communication enabled by the multicore NoC architectures. Our results exhibit feasibility and potential of efficiently implementing the full HD MVC decoding on multicore NoC architectures. 相似文献

6.

Good programming in transactional memory : Game theory meets multicore architecture

Raphael Eidenbenz Roger Wattenhofer 《Theoretical computer science》2011,412(32):4136-4150

In a multicore transactional memory (TM) system, concurrent execution threads interact and interfere with each other through shared memory. The less interference a thread provokes the better for the system. However, as a programmer is primarily interested in optimizing her individual code’s performance rather than the system’s overall performance, she does not have a natural incentive to provoke as little interference as possible. Hence, a TM system must be designed compatible with good programming incentives (GPI), i.e., writing efficient code for the overall system should coincide with writing code that optimizes an individual thread’s performance. We show that with most contention managers (CM) proposed in the literature so far, TM systems are not GPI compatible. We provide a generic framework for CMs that base their decisions on priorities and explain how to modify Timestamp-like CMs so as to feature GPI compatibility. In general, however, priority-based conflict resolution policies are prone to be exploited by selfish programmers. In contrast, a simple non-priority-based manager that resolves conflicts at random is GPI compatible. 相似文献

7.

HiCOO: Hierarchical cooperation for scalable communication in Global Address Space programming models on Cray XT systems

Weikuan Yu Xinyu Que Vinod Tipparaju Jeffrey S. Vetter 《Journal of Parallel and Distributed Computing》2012

Global Address Space (GAS) programming models enable a convenient, shared-memory style addressing model. Typically this is realized through one-sided operations that can enable asynchronous communication and data movement. With the size of petascale systems reaching 10,000s of nodes and 100,000s of cores, the underlying runtime systems face critical challenges in (1) scalably managing resources (such as memory for communication buffers), and (2) gracefully handling unpredictable communication patterns and any associated contention. For any solution that addresses these resource scalability challenges, equally important is the need to maintain the performance of GAS programming models. In this paper, we describe a Hierarchical COOperation (HiCOO) architecture for scalable communication in GAS programming models. HiCOO formulates a cooperative communication architecture: with inter-node cooperation amongst multiple nodes (a.k.a multinode) and hierarchical cooperation among multinodes that are arranged in various virtual topologies. We have implemented HiCOO for a popular GAS runtime library, Aggregate Remote Memory Copy Interface (ARMCI). By extensively evaluating different virtual topologies in HiCOO in terms of their impact to memory scalability, network contention, and application performance, we identify MFCG as the most suitable virtual topology. The resulting HiCOO architecture is able to realize scalable resource management and achieve resilience to network contention, while at the same time maintaining or enhancing the performance of scientific applications. In one case, it reduces the total execution time of an NWChem application by 52%. 相似文献

8.

GPGPU implementation of growing neural gas: Application to 3D scene reconstruction

Sergio Orts Jose Garcia-Rodriguez Diego Viejo Miguel Cazorla Vicente Morell 《Journal of Parallel and Distributed Computing》2012

Self-organising neural models have the ability to provide a good representation of the input space. In particular the Growing Neural Gas (GNG) is a suitable model because of its flexibility, rapid adaptation and excellent quality of representation. However, this type of learning is time-consuming, especially for high-dimensional input data. Since real applications often work under time constraints, it is necessary to adapt the learning process in order to complete it in a predefined time. This paper proposes a Graphics Processing Unit (GPU) parallel implementation of the GNG with Compute Unified Device Architecture (CUDA). In contrast to existing algorithms, the proposed GPU implementation allows the acceleration of the learning process keeping a good quality of representation. Comparative experiments using iterative, parallel and hybrid implementations are carried out to demonstrate the effectiveness of CUDA implementation. The results show that GNG learning with the proposed implementation achieves a speed-up of 6×

6 \times

compared with the single-threaded CPU implementation. GPU implementation has also been applied to a real application with time constraints: acceleration of 3D scene reconstruction for egomotion, in order to validate the proposal. 相似文献

9.

Energy saving strategies for parallel applications with point-to-point communication phases

Vaibhav Sundriyal Masha Sosonkina Alexander Gaenko Zhao Zhang 《Journal of Parallel and Distributed Computing》2013

Although high-performance computing traditionally focuses on the efficient execution of large-scale applications, both energy and power have become critical concerns when approaching exascale. Drastic increases in the power consumption of supercomputers affect significantly their operating costs and failure rates. In modern microprocessor architectures, equipped with dynamic voltage and frequency scaling (DVFS) and CPU clock modulation (throttling), the power consumption may be controlled in software. Additionally, network interconnect, such as Infiniband, may be exploited to maximize energy savings while the application performance loss and frequency switching overheads must be carefully balanced. This paper advocates for a runtime assessment of such overheads by means of characterizing point-to-point communications into phases followed by analyzing the time gaps between the communication calls. Certain communication and architectural parameters are taken into consideration in the three proposed frequency scaling strategies, which differ with respect to their treatment of the time gaps. The experimental results are presented for NAS parallel benchmark problems as well as for the realistic parallel electronic structure calculations performed by the widely used quantum chemistry package GAMESS. For the latter, three different process-to-core mappings were studied as to their energy savings under the proposed frequency scaling strategies and under the existing state-of-the-art techniques. Close to the maximum energy savings were obtained with a low performance loss of 2% on the given platform. 相似文献

10.

MLMIN: A multicore processor and parallel computer network topology for multicast

Dietmar Tutsch Günter Hommel 《Computers & Operations Research》2008,35(12):3807

In future, multicore processors with hundreds of cores will collaborate on a single chip. Then, more advanced network-on-chip (NoC) topologies will be needed than today's shared busses for dual core processors. Multistage interconnection networks, which are already used in parallel computers, seem to be a promising alternative. In this paper, a new network topology is introduced that particularly applies to multicast traffic in multicore systems and parallel computers. Those multilayer multistage interconnection networks are described by defining the main parameters of such a topology. Performance and costs of the new architecture are determined and compared to other network topologies. Network traffic consisting of constant size packets and of varying size packets is investigated. It is shown that all kinds of multicast traffic particularly benefit from the new topology. 相似文献