期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A pipeline-based approach for scheduling video processing algorithms on NOW

Mau-Tsuen Yang Kasturi R. Sivasubramaniam A. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(2):119-130

Network Of Workstations (NOW) platforms put together with off-the-shelf workstations and networking hardware have become a cost effective, scalable, and flexible platform for video processing applications. Still, one has to manually schedule an algorithm to the available processors of the NOW to make efficient use of the resources. However, this approach is time-consuming and impractical for a video processing system that must perform a variety of different algorithms, with new algorithms being constantly developed. Improved support for program development is absolutely necessary before the full benefits of parallel architectures can be realized for video processing applications. Toward this goal, an automatic compile-time scheduler has been developed to schedule input tasks of video processing applications with precedence constraints onto available processors. The scheduler exploits both spatial (parallelism) and temporal (pipelining) concurrency to make the best use of machine resources. Two important scheduling problems are addressed. First, given a task graph and a desired throughput, a schedule is constructed to achieve the desired throughput with the minimum number of processors. Second, given a task graph and a finite set of available resources, a schedule is constructed such that the throughput is maximized while meeting the resource constraints. Results from simulations show that the scheduler and proposed optimization techniques effectively tackle these problems by maximizing processor utilization. A code generator has been developed to generate parallel programs automatically. The tools developed in this paper make it much easier for a programmer to develop video processing applications on these parallel architectures. 相似文献

2.

Repetitive model refactoring strategy for the design space exploration of intensive signal processing applications 总被引：1，自引：0，他引：1

Calin Glitia Pierre Boulet Eric Lenormand Michel BarreteauAuthor vitae 《Journal of Systems Architecture》2011,57(9):815-829

The efficient design of computation intensive multidimensional signal processing applications requires dealing with three kinds of constraints: those implied by the data dependencies, the non-functional requirements (real-time, power consumption) and resources availability of the execution platform. Modeling and Analysis of Real-time and Embedded systems (MARTE) UML profile through its repetitive structure modeling (RSM) package is well suited to model the inherent parallelism within these applications, a compact representation of parallel execution platforms and the distributive mapping of one on another. The execution of such a specification respects the whole set of constraints defined upon, while the quality of the scheduling is directly linked to the quality of the mapping of the multidimensional structures (data arrays or parallel loop nests) into time and space. We propose here a strategy to use a refactoring tool dedicated to this kind of application that allows to find good trade-offs in the usage of storage and computation resources and in parallelism (both task and data parallelism) exploitation. This strategy is illustrated on an industrial radar application. 相似文献

3.

A hybrid solution method for CFD applications on GPU-accelerated hybrid HPC platforms

《Future Generation Computer Systems》2016

Heterogeneous multiprocessor systems, where commodity multicore processors are coupled with graphics processing units (GPUs), have been widely used in high performance computing (HPC). In this work, we focus on the design and optimization of Computational Fluid Dynamics (CFD) applications on such HPC platforms. In order to fully utilize the computational power of such heterogeneous platforms, we propose to design the performance-critical part of CFD applications, namely the linear equation solvers, in a hybrid way. A hybrid linear solver includes both one CPU version and one GPU version of code for solving a linear equations system. When a hybrid linear equation solver is invoked during the CFD simulation, the CPU portion and the GPU portion will be run on corresponding processing devices respectively in parallel according to the execution configuration. Furthermore, we propose to build functional performance models (FPMs) of processing devices and use FPM-based heterogeneous decomposition method to distribute workload between heterogeneous processing devices, in order to ensure balanced workload and optimized communication overhead. Efficiency of this approach is demonstrated by experiments with numerical simulation of lid-driven cavity flow on both a hybrid server and a hybrid cluster. 相似文献

4.

并行任务图的优化调度算法

李于锋莫则尧肖永浩熊敏《计算机工程与科学》2019,41(6):955-962

科学与工程计算中的很多复杂应用问题需要使用科学工作流技术,超算领域中的科学工作流常以并行任务图建模,并行任务图的有效调度对应用的高效执行有重要意义。给出了资源限制条件下并行任务图的调度模型;针对Fork-Join类并行任务图给出了若干最优化调度结论;针对一般并行任务图提出了一种新的调度算法,该算法考虑了数据通信开销对资源分配和调度性能的影响,并对已有的CPA算法在特定情况下进行了改进。通过实验与常用的CPR和CPA算法做比较,验证了提出的新算法能够获得很好的调度效果。本文提出的调度算法和得到的最优调度结论对工作流应用系统的高性能调度功能开发具有借鉴意义。相似文献

5.

Application resource requirement estimation in a parallel-pipeline model of execution

Kuntraruk J. Pottenger W.M. Ross A.M. 《Parallel and Distributed Systems, IEEE Transactions on》2005,16(12):1154-1165

We propose a massively parallel framework termed a parallel-pipeline model of execution that can be employed on a homogeneous computational cluster. We show that speedups near-linear in the number of processors are achievable for applications involving reduction operations based on a novel, parallel-pipeline model of execution. As computational clusters become viable alternative platforms for solving large computational problems, the research community acknowledges that the cluster environment can be used effectively when adaptive resource management is employed. This requires the ability to estimate the resource requirements of applications before scheduling decisions are made. We propose a resource estimation model for applications executed in the parallel-pipeline model of execution. We develop a performance model that predicts the resource utilization (i.e., computation and communication complexity) for applications executing under the parallel-pipeline model on a homogeneous computational cluster. This performance prediction model can provide information to a cluster scheduler. 相似文献

6.

A revisit of fast greedy heuristics for mapping a class of independent tasks onto heterogeneous computing systems

Ping Luo Kevin Lü Zhongzhi Shi 《Journal of Parallel and Distributed Computing》2007

Mixed-machine heterogeneous computing (HC) environments utilize a distributed suite of different high-performance machines, interconnected with high-speed links, to perform groups of computing-intensive applications that have diverse computational requirements and constraints. The problem of optimally mapping a class of independent tasks onto the machines of an HC environment has been proved, in general, to be NP-complete, thus requiring the development of heuristic techniques for practical usage. If the mapping has real-time requirements such that the mapping process is performed during task execution, fast greedy heuristics must be adopted. This paper investigates fast greedy heuristics for this problem and identifies the importance of the concept of task consistency in designing this mapping heuristic. We further propose task priority graph based fast greedy heuristics, which consider the factors of both task consistency and machine consistency (the same concept of consistency as in previous studies). A collection of 20 greedy heuristics, including 17 newly proposed ones, are implemented, analyzed, and systematically compared within a uniform model of task execution time. This model is implemented by the coefficient-of-variation based method. The experimental results illuminate the circumstances when a specific greedy heuristic would outperform the other 19 greedy heuristics. 相似文献

7.

Packet triggered prediction based task migration for network-on-chip

Tianzhou ChenAuthor Vitae Weiwei FuAuthor VitaeBin Xie Chao WangAuthor Vitae 《Microprocessors and Microsystems》2014

The development of IC technology makes Network-on-Chip (NoC) an attractive architecture for future massive parallel systems. Task migration optimize the overall communication performance of NoCs since the changing phases of execution make static task mapping insufficient. It is well-known that the communication behavior of many applications are predictable, which makes it feasible to use prediction to guide task migration. The triggering of activating a task migration is also important. In this paper, we first defined and analyzed predictabilities of applications, and then compared different ways of triggering for migration. We then modified the Genetic Algorithm (GA) based task remapping and proposed two other task migration algorithms: Simple Exchange (SE) and Benefit Assess (BA). A mechanism called node lock is also used to reduce unnecessary and costly migrations. Simulation results on real applications from PARSEC benchmark suites show that the SE, BA and GA algorithms can reduce 21.4%, 34.0% and 34.9% of number of hops, and 17.3%, 27.2% and 26.3% in terms of average latency respectively, compared with the system without task migration; BA and SE reduce 72.0% and 78.7% of migrations without significant performance degradation compared with GA, and the node lock mechanism can further remove 37.3% and 46.0% of migrations while achieving almost the same performance. 相似文献

8.

Optimization of data-intensive workflows in stream-based data processing models

Saima Gulzar Ahmad Chee Sun Liew M. Mustafa Rafique Ehsan Ullah Munir 《The Journal of supercomputing》2017,73(9):3901-3923

Stream computing applications require minimum latency and high throughput for efficiently processing real-time data. Typically, data-intensive applications where large datasets are required to be moved across execution nodes have low latency requirements. In this paper, a stream-based data processing model is adopted to develop an algorithm for optimal partitioning the input data such that the inter-partition data flow remains minimal. The proposed algorithm improves the execution of the data-intensive workflows in heterogeneous computing environments by partitioning the data-intensive workflow and mapping each partition on the available heterogeneous resources that offer minimum execution time. Minimum data movement between the partitions reduces the latency, which can be further reduced by applying advanced data parallelism techniques. In this paper, we apply data parallelism technique to the bottleneck (most compute-intensive) task in each partition that significantly reduces the latency. We study the effectiveness and the performance of the proposed approach by using synthesized workflows and real-world applications, such as Montage and Cybershake. Our evaluation shows that the proposed algorithm provides schedules with approximately 12% reduced latency and nearly 17% enhanced throughput as compared to the existing state of the art algorithms. 相似文献

9.

移动微云任务迁移与调度

徐红霞孔志周《计算机应用与软件》2020,37(3):101-108,137

为了解决移动微云中时间期限约束下的任务能效调度问题,提出一种基于自适应概率的分布式任务调度算法。算法分为两个阶段:资源发现阶段和自适应概率调度阶段。第一阶段主要通过修正的QoS OLSR协议,使发送任务执行请求的源节点周期性地收集邻近处理节点的资源信息;第二阶段主要根据源节点的任务到达率,以概率计算方式选择最优的处理节点执行任务,在满足时间约束的同时,达到最优的能效。经过大量仿真场景的验证,结果表明该算法在维持较高的任务完成率的同时,还可以降低任务完成的平均能耗。相似文献

10.

Partitioning and mapping in embedded multiprocessor architectures in the presence of constraints

S. Yalamanchili L. Te Winkel D. Perschbacher B. Shenoy 《Concurrency and Computation》1995,7(3):167-189

The paper focuses on the problem of partitioning and mapping parallel programs onto heterogeneous embedded multiprocessor architectures for real-time applications. Such applications present unique constraints and challenges. In addition to heterogeneity, the proposed partitioning and mapping algorithms satisfy memory, task throughput, task placement, intertask communication bandwidth, and co-location constraints. They do so for architectures that utilize circuit-switched (rather than packet-switched) interprocessor communication and optimize latency and throughput in addition to load-balancing. Finally, these mapping algorithms make use of knowledge of the local scheduling discipline to accommodate real-time scheduling constraints. Our focus is on unstructured parallel programs that fall into one of two classes: (i) the class of computations characteristic of control applications in a real-time environment where tasks execute concurrently, periodically exchanging information, and (ii) pipelined computation graphs found in sensor data processing applications. The algorithms are implemented in a set of tools that operate with commercial CASE tools at one end, and present an interface to multiprocessor simulators at the other end. Collectively, the algorithms form a significant component of an interactive design environment for the development and mapping of real-time embedded parallel programs. The paper describes the algorithms, the encapsulating toolset, and presents an example of their application to an existing embedded application—an Autonomous Underwater Vehicle application. 相似文献

11.

Embedded GPU implementation of sensor correction for on-board real-time stream computing of high-resolution optical satellite imagery

Mi Wang Zhiqi Zhang Ying Zhu Zhipeng Dong Yingying Li 《Journal of Real-Time Image Processing》2018,14(3):565-583

Advanced driver assistance systems applications increasingly use cameras and image processing algorithms. To embed and achieve real-time execution of these algorithms, semiconductor companies propose heterogeneous systems-on-chip (SoCs). Embedding algorithms on this type of hardware is not trivial: One needs to determine how to partition the computational load on the different processing units. In addition, it is not easy to predict whether a given algorithm can be executed on a given heterogeneous SoC while meeting real-time constraints. We propose a novel global methodology to assist with embedding image processing algorithms on heterogeneous SoC while meeting real-time constraints (using a soft real-time analysis). Our approach proposes several heuristics predicting delays and execution times and is based on a set of multi-level test vectors which extract key features of heterogeneous architectures. 相似文献

12.

A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds

Daniel de Oliveira Kary A. C. S. Oca?a Fernanda Bai?o Marta Mattoso 《Journal of Grid Computing》2012,10(3):521-552

In the last years, scientific workflows have emerged as a fundamental abstraction for structuring and executing scientific experiments in computational environments. Scientific workflows are becoming increasingly complex and more demanding in terms of computational resources, thus requiring the usage of parallel techniques and high performance computing (HPC) environments. Meanwhile, clouds have emerged as a new paradigm where resources are virtualized and provided on demand. By using clouds, scientists have expanded beyond single parallel computers to hundreds or even thousands of virtual machines. Although the initial focus of clouds was to provide high throughput computing, clouds are already being used to provide an HPC environment where elastic resources can be instantiated on demand during the course of a scientific workflow. However, this model also raises many open, yet important, challenges such as scheduling workflow activities. Scheduling parallel scientific workflows in the cloud is a very complex task since we have to take into account many different criteria and to explore the elasticity characteristic for optimizing workflow execution. In this paper, we introduce an adaptive scheduling heuristic for parallel execution of scientific workflows in the cloud that is based on three criteria: total execution time (makespan), reliability and financial cost. Besides scheduling workflow activities based on a 3-objective cost model, this approach also scales resources up and down according to the restrictions imposed by scientists before workflow execution. This tuning is based on provenance data captured and queried at runtime. We conducted a thorough validation of our approach using a real bioinformatics workflow. The experiments were performed in SciCumulus, a cloud workflow engine for managing scientific workflow execution. 相似文献

13.

Enhancing throughput for streaming applications running on cluster systems

Fernando Guirado Concepció Roig Ana Ripoll 《Journal of Parallel and Distributed Computing》2013

The exploitation of throughput in a parallel application that processes an input data stream is a difficult challenge. For typical coarse-grain applications, where the computation time of tasks is greater than their communication time, the maximum achievable throughput is determined by the maximum task computation time. Thus, the improvement in throughput above this maximum would eventually require the modification of the source code of the tasks. In this work, we address the improvement of throughput by proposing two task replication methodologies that have the target throughput to be achieved as an input parameter. They proceed by generating a new task graph structure that permits the target throughput to be achieved. The first replication mechanism, named DPRM (Data Parallel Replication Mechanism), exploits the inner task data parallelism. The second mechanism, named TCRM (Task Copy Replication Mechanism), creates new execution paths inside the application task graph structure that allows more than one instance of data to be processed concurrently. We evaluate the effectiveness of these mechanisms with three real applications executed in a cluster system: the MPEG2 video compressor, the IVUS (Intra-Vascular Ultra-Sound) medical image application and the BASIZ (Bright and SAtured Images Zone) video processing application. In all these cases, the obtained throughput was greater after applying the proposed replication mechanism than what the application could provide with the original implementation. 相似文献

14.

大规模CFD多区结构网格任务负载平衡算法 总被引：1，自引：0，他引：1

唐波王勇献《计算机工程与科学》2014,36(7):1213-1220

针对现有负载平衡算法的适应度低、可扩展性差、通信开销度量不准确的缺陷, 提出一种大规模CFD多区结构网格任务负载平衡算法。通过对网格块的分割、网格块之间的组合映射、进程上网格计算量的调整来实现并行CFD任务负载平衡。实验结果表明, 该算法既适应同构平台也适应异构平台, 既适应网格块数多于进程数的情况也适应网格块数少于进程数的情况, 该算法可使得整个计算空间分配到各进程上的计算量负载平衡, 同时使得各进程间的最大通信开销最小。相似文献

15.

Load Balancing in a Cluster-Based Web Server for Multimedia Applications

Jiani Guo Bhuyan L.N. 《Parallel and Distributed Systems, IEEE Transactions on》2006,17(11):1321-1334

We consider a cluster-based multimedia Web server that dynamically generates video units to satisfy the bit rate and bandwidth requirements of a variety of clients. The media server partitions the job into several tasks and schedules them on the backend computing nodes for processing. For stream-based applications, the main design criteria of the scheduling are to minimize the total processing time and maintain the order of media units for each outgoing stream. In this paper, we first design, implement, and evaluate three scheduling algorithms, first fit (FF), stream-based mapping (SM), and adaptive load sharing (ALS), for multimedia transcoding in a cluster environment. We determined that it is necessary to predict the CPU load for each multimedia task and schedule them accordingly due to the variability of the individual jobs/tasks. We, therefore, propose an online prediction algorithm that can dynamically predict the processing time per individual task (media unit). We then propose two new load scheduling algorithms, namely, prediction-based least load first (P-LLF) and prediction-based adaptive partitioning (P-AP), which can use prediction to improve the performance. The performance of the system is evaluated in terms of system throughput, out-of-order rate of outgoing media streams, and load balancing overhead through real measurements using a cluster of computers. The performance of the new load balancing algorithms is compared with all other load balancing schemes to show that P-AP greatly reduces the delay jitter and achieves high throughput for a variety of workloads in a heterogeneous cluster. It strikes a good balance between the throughput and output order of the processed media units 相似文献

16.

Stochastic allocation and scheduling for conditional task graphs in multi-processor systems-on-chip

Michele Lombardi Michela Milano Martino Ruggiero Luca Benini 《Journal of Scheduling》2010,13(4):315-345

Embedded systems designers are turning to multicore architectures to satisfy the ever-growing computational needs of applications within a reasonable power envelope. One of the most daunting challenges for MultiProcessor System-on-Chip (MPSoC) platforms is the development of tools for efficient mapping multi-task applications onto hardware platforms. Software mapping can be formulated as an optimal allocation and scheduling problem, where the application is modeled as a task graph, the target hardware is modeled as a set of heterogeneous resources, and the objective function represents a design goal α (e.g. minimum execution time, minimum usage of communication resources, etc.). Conditional task graphs, where inter-task edges represent data as well as control dependencies, are a well-known computational model to describe complex real-life applications where alternative execution paths, guarded by conditionals, can be specified. Each condition has a probability associated with each possible outcome. 相似文献

17.

Automated memory-aware application distribution for Multi-processor System-on-Chips

《Journal of Systems Architecture》2007,53(11):795-815

Mapping of applications on a Multi-processor System-on-Chip (MP-SoC) is a crucial step to optimize performance, energy and memory constraints at the same time. The problem is formulated as finding solutions to a cost function of the algorithm performing mapping and scheduling under strict constraints. Our solution is based on simultaneous optimization of execution time and memory consumption whereas traditional methods only concentrate on execution time. Applications are modeled as static acyclic task graphs that are mapped on MP-SoC with customized simulated annealing. The automated mapping in this paper is especially purposed for MP-SoC architecture exploration, which typically requires a large number of trials without human interaction. For this reason, a new parameter selection scheme for simulated annealing is proposed that sets task mapping specific optimization parameters automatically. The scheme bounds optimization iterations to a reasonable limit and defines an annealing schedule that scales up with application and architecture complexity. The presented parameter selection scheme compared to extensive optimization achieves 90% goodness in results with only 5% optimization time, which helps large-scale architecture exploration where optimization time is important. The optimization procedure is analyzed with simulated annealing, group migration and random mapping algorithms using test graphs from the Standard Task Graph Set. Simulated annealing is found better than other algorithms in terms of both optimization time and the result. Simultaneous time and memory optimization method with simulated annealing is shown to speed up execution by 63% without memory buffer size increase. As a comparison, optimizing only execution time yields 112% speedup, but also increases memory buffers by 49%. 相似文献

18.

FGFS: Feature Guided Frontier Scheduling for SIMT DAGs

Ghose Anirban Dey Soumyajit 《The Journal of supercomputing》2022,78(9):11702-11743

In the past decade, heterogeneous multicore architectures with support for Single Instruction Multiple Thread (SIMT) style computing have become the standard platform of choice for scheduling HPC applications. Here, applications are typically modelled as a set of data-parallel tasks with dependencies represented in the form of a directed acyclic graph (DAG). The relevant execution time information for each constituent task in the DAG is known beforehand and is leveraged by scheduling algorithms (List or Cluster based) to ascertain near-optimal schedules at runtime. However, given an online setting, where applications are submitted by multiple users and the types of applications are not restrictive, the chances of knowing execution time information for every program are highly unlikely. In this context, we propose a class of intelligent algorithms for heterogeneous CPU-GPU platforms that leverage static analysis-assisted machine learning techniques for deciding how device assignments should be made at runtime, thus bypassing the requirement for expensive offline profiling passes. We formalize relevant task-level ranking metrics and discuss how existing scheduling techniques can be adapted for our proposed class of algorithms. We also devise an online cluster scheduling algorithm that supports dynamic task arrival by determining in any given scheduling epoch, mapping decisions for a subset of tasks in a DAG. We perform a detailed comparative analysis between our proposed cluster and list scheduling heuristics via extensive simulation experiments using a variety of heterogeneous multicore platform configurations and observe performance speedups in the range of 1.1–1.5× for cluster scheduling over that of list scheduling.

相似文献

19.

Task spreading and shrinking on multiprocessor systems and networksof workstations

Jacob J.C. Soo-Young Lee 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(10):1082-1101

In this paper, we describe how our computational model can be used for the problems of processor allocation and task mapping. The intended applications for this model include the dynamic mapping problems of shrinking or spreading an existing mapping when the available pool of processors changes during execution of the problem. The concept of problem edge class and other features of our model are developed to realistically and efficiently support task partitioning and merging for static and dynamic mapping. The model dictates realistic changes in the computation and communication characteristics of a problem when the problem partitioning is modified dynamically. This model forms the basis of our algorithms for shrinking and spreading, and yields realistic results for a variety of problems mapped onto real systems. An emulation program running on a network of workstations under PVM is used to measure execution times for the mapping solutions found by the algorithms. The results indicate that the problem edge class is a crucial consideration for processor allocation and task mapping 相似文献

20.

Multi-Dimensional Scheduling for Real-Time Tasks on Heterogeneous Clusters 总被引：4，自引：0，他引：4

下载免费PDF全文

Xiao-Min Zhu 《计算机科学技术学报》2009,24(3):434-446

Multiple performance requirements need to be guaranteed in some real-time applications such as multimedia data processing and real-time signal processing in addition to timing constraints.Unfortunately,most conventional scheduling algorithms only take one or two dimensions of them into account.Motivated by this fact,this paper investigates the problem of providing multiple performance guarantees including timeliness,QoS,throughput,QoS fairness and load balancing for a set of independent tasks by dynamic ... 相似文献