首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 734 毫秒
1.
Parallel algorithms, based on a distributed memory machine model, for an exhaustive search technique for motion vector estimation in video compression are being designed and evaluated. Results from the execution on a 16,384 processor MasPar MP-1 (an SIMD machine), a 140 node Intel Paragon XP/S and a 16 node IBM SP2 (two M IMD machines), and the 16 processor PASM prototype (a partitionable SIMD/MIMD mixed-mode machine) are presented. The trade-offs of using different modes of parallelism (SIMD, SPMD, and mixed-mode) and different data partitioning schemes (the rectangular and stripe subimage methods) are examined. The analytical and experimental results shown in this application study will help practitioners to predict and contrast the performance of different approaches to parallel implementation of this important video compression technique. The results presented are also applicable to a large class of image and video processing tasks. Case studies, such as the one presented here, are a necessary step in developing software tools for mapping an application task onto a single parallel machine and for mapping a set of independent application tasks, or the subtasks of a single application task, onto a heterogeneous suite of parallel machines.  相似文献   

2.
This paper examines the applicability of fine-grained tree-structured SIMD machines, which are amenable to highly efficient VLSI implementation, to several low-level image understanding tasks. Algorithms are presented for histogramming, thresholding, image correlation, connected component labeling, and computing Euler number. A particular massively parallel machine called NON-VON is used for purposes of explication and performance evaluation. Only NON-VON tree-structured communication capabilities and its SIMD mode of execution are considered in this paper. Novel algorithmic techniques are described, such as vertical pipelining, subproblem partitioning, associative matching, and data duplication, that effectively exploit the massive parallelism available in fine-grained SIMD tree machines while avoiding communication bottlenecks. Simulation results are presented and compared with results obtained or forecast for other highly parallel machines. The relative advantages and limitations of the class of machines under consideration are outlined; except for some types of image correlation, the fine-grained SIMD tree is exceptionally fast.  相似文献   

3.
A mesh-connected single-input multiple-data (SIMD) architecture called a sliding memory plane (SliM) array processor is proposed. Differing from existing mesh-connected SIMD architectures, SliM has several salient features such as a sliding memory plane that provides inter-PE communication during computation. Two I/O planes provide an I/O overlapping capability. Thus, inter-PE communication and I/O overhead can be overlapped with computation. Inter-PE communication time is invisible in most image processing tasks because the computation time is larger than the communication time on SliM. The ability to overlap inter-PE communication with computation, regardless of window size and shape and without using a coprocessor or an on-chip DMA controller is unique to SliM  相似文献   

4.
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to the programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains, including digital signal processing, image processing, and computer vision. The parameters of the performance for such stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which data sets are processed). These two criteria are distinct since multiple data sets can be pipelined or processed in parallel. The central contribution of this research is a new algorithm to determine a processor mapping for a chain of tasks that optimizes latency in the presence of a throughput constraint. We also discuss how this algorithm can be applied to solve the converse problem of optimizing throughput with a latency constraint. The problem formulation uses a general and realistic model of intertask communication and addresses the entire problem of mapping, which includes clustering tasks into modules, assigning of processors to modules, and possible replicating of modules. The main algorithms are based on dynamic programming and their execution time complexity is polynomial in the number of processors and tasks. The entire framework is implemented as an automatic mapping tool in the Fx parallelizing compiler for a dialect of High Performance Fortran.  相似文献   

5.
The performance of conjugate gradient (CG) algorithms for the solution of the system of linear equations that results from the finite-differencing of the neutron diffusion equation was analyzed on SIMD, MIMD, and mixed-mode parallel machines. A block preconditioner based on the incomplete Cholesky factorization was used to accelerate the conjugate gradient search. The issues involved in mapping both the unpreconditioned and preconditioned conjugate gradient algorithms onto the mixed-mode PASM prototype, the SIMD MasPar MP-1, and the MIMD Intel Paragon XP/S are discussed. On PASM , the mixed-mode implementation outperformed either SIMD or MIMD alone. Theoretical performance predictions were analyzed and compared with the experimental results on the MasPar MP-1 and the Paragon XP/S. Other issues addressed include the impact on execution time of the number of processors used, the effect of the interprocessor communication network on performance, and the relationship of the number of processors to the quality of the preconditioning. Applications studies such as this are necessary in the development of software tools for mapping algorithms onto either a single parallel machine or a heterogeneous suite of parallel machines.  相似文献   

6.
In order to minimize the execution time of a parallel application running on a heterogeneously distributed computing system, an appropriate mapping scheme is needed to allocate the application tasks to the processors. The general problem of mapping tasks to machines is a well‐known NP‐hard problem and several heuristics have been proposed to approximate its optimal solution. In this paper we propose a static graph‐based mapping algorithm, called Heterogeneous Multi‐phase Mapping (HMM), which permits suboptimal mapping of a parallel application onto a heterogeneous computing distributed system by using a local search technique together with a tabu search meta‐heuristic. HMM allocates parallel tasks by exploiting the information embedded in the parallelism forms used to implement an application, and considering an affinity parameter, that identifies which machine in the heterogeneous computing system is most suitable to execute a task. We compare HMM with some leading techniques and with an exhaustive mapping algorithm. We also give an example of mapping of two real applications using HMM. Experimental results show that HMM performs well demonstrating the applicability of our approach. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

7.
提出了一种面向SIMD机器的全局数据自动分割算法,该算法能处理多个非紧嵌折循环嵌套,并且数组下标存取为循环变量的线性式,首先通过数据与迭代映射抽象了计算中的通信方式,然事提出识别规则模式通信模式的形式比条件,接着建立包含对准信息和相应通信开销的数据迭代图,并在数据迭代图的基础上提出了一个启发式算法来计算较优的数据分布和迭代分布,以优化处理单元之间的通信开销,通过发析多个循环嵌套所涉及的多个数组映和  相似文献   

8.
Improving scheduling of tasks in a heterogeneous environment   总被引:1,自引:0,他引:1  
Optimal scheduling of parallel tasks with some precedence relationship, onto a parallel machine is known to be NP-complete. The complexity of the problem increases when task scheduling is to be done in a heterogeneous environment, where the processors in the network may not be identical and take different amounts of time to execute the same task. We introduce a task duplication-based scheduling algorithm for network of heterogeneous systems (TANH), with complexity O(V/sup 2/), which provides optimal results for applications represented by directed acyclic graphs (DAGs), provided a simple set of conditions on task computation and network communication time could be satisfied. The performance of the algorithm is illustrated by comparing the scheduling time with an existing "best imaginary level scheduling (BIL)" scheme for heterogeneous systems. The scalability for a higher or lower number of processors, as per their availability is also discussed. We have shown to provide substantial improvement over existing work on the task duplication-based scheduling algorithm (TDS).  相似文献   

9.
In this paper, a processor allocation mechanism for NoC-based chip multiprocessors is presented. Processor allocation is a well-known problem in parallel computer systems and aims to allocate the processing nodes of a multiprocessor to different tasks of an input application at run time. The proposed mechanism targets optimizing the on-chip communication power/latency and relies on two procedures: processor allocation and task migration. Allocation is done by a fast heuristic algorithm to allocate the free processors to the tasks of an incoming application when a new application begins execution. The task-migration algorithm is activated when some application completes execution and frees up the allocated resources. Task migration uses the recently deallocated processors and tries to rearrange the current tasks in order to find a better mapping for them. The proposed method can also capture the dynamic traffic pattern of the network and perform task migration based on the current communication demands of the tasks. Consequently, task migration adapts the task mapping to the current network status. We adopt a non-contiguous processor allocation strategy in which the tasks of the input application are allowed to be mapped onto disjoint regions (groups of processors) of the network. We then use virtual point-to-point circuits, a state-of-the-art fast on-chip connection designed for network-on-chips, to virtually connect the disjoint regions and make the communication latency/power closer to the values offered by contiguous allocation schemes. The experimental results show considerable improvement over existing allocation mechanisms.  相似文献   

10.
This paper presents architectures and implementation of a Sliding Memory Plane (SliM) Image Processor to build a SIMD parallel computer. The paper also proposes an enhanced multiplication algorithm to reduce the gate count and the number of cycles. The SliM chip consists of mesh-connected 5×5 PEs. Due to the idea ofsliding, that is, overlapping the inter-PE communication time with the computation time, SliM can greatly reduce the inter-PE communication overhead. In addition, four operations corresponding to ALU, shift, data I/O, and inter-PE communication can be grouped into an instruction to be executed in a cycle simultaneously. The implemented SliM chip operates at 25 MHz and gives 625 MIPS. Because of a mesh topology, a large number of chips can be easily connected to form a SIMD parallel computer. We have implemented the scalable SliM Array Processor and developed parallel algorithms for real-time image processing.  相似文献   

11.
基于SIMD机器的优化数据传输的并行循环分割   总被引:2,自引:1,他引:2  
本文提出一个基于分布式局存的SIMD机器的循环分割理论体系以优化运算中所需要的数据传输。该体系使用矩阵表示迭代空间、数据空间和数组存取式。我们引入数据传输概念,并建立一个简单有效的数据传输模型来评估数据在全局内存和局部内存之间的传输开销。最后,对于给定的循环嵌套,我们给出一个循环分割算法以获得优化循环块,使得循环嵌套中所需要的数据传输开销最小,并且大大减少了数据传输和计算的同步开销。实验结果证明了  相似文献   

12.
面向算法的SIMD计算机数学模型及其应用研究   总被引:1,自引:0,他引:1  
针对数据并行计算在图像处理中的应用研究,提出了数据并行计算机的面向算法的数学模型,以及利用该模型得到的一种新颖的、数据并行算法的数学描述方法.采用该数学描述方法对数据并行图像处理中的灰度直方图运算、区域增长法图像分割以及图像卷积运算等3类图像处理方法进行了描述.结果表明,该数学描述方法不仅简单可行和精确,而且,可以从数学公式中直接得到算法的通信复杂性和计算复杂性.该方法可以应用到数据并行计算的应用研究中作为数学描述的工具.  相似文献   

13.
数字图像几何变换的数据并行方法研究   总被引:2,自引:0,他引:2  
张发存  王馨梅  张毅坤 《计算机工程》2005,31(22):159-161,196
针对SIMD计算机上的数字图像的几何变换问题,提出了一个新颖的基于阵列平移的数据并行实现方法。在此基础上,给出了数字图像几何变换的数据并行实现算法,并详细分析和讨论了算法的复杂性以及精度要求。  相似文献   

14.
为了解决网格仿真中的静态任务调度问题,提出一种创新的基于交互优先算法的实体调度策略.该策略使用了一种兼顾通信优化和计算平衡的评价标准.调度策略分为聚类和映射两阶段进行.聚类过程使用一种交互优先算法,将交互密集的实体聚类到一个实体组中并且将被调度到同一处理器上运行;映射过程则使用一种计算优先的启发式算法,优先映射计算消耗大的聚类实体到处理能力强的处理器上.最后,将使用上述两种算法的调度策略同贪婪对分法进行对比实验,结果证明本文的方法能够较好地优化仿真中的通信和计算性能并且更加适合于网格仿真.  相似文献   

15.
图像恢复的高效并行算法及关键技术   总被引:6,自引:0,他引:6  
首次从并行处理的途径分析了能产生高恢复质量,但具有高计算复杂性的图像恢复算法BNM的并行性,并对影响该算法并行效率的关键问题,提出了有效的解决方案:①采用条状重叠的数据分配方案,减少了并行处理中的通信量;②给出了不同读取策略的内部实现模型,分析了不同读取策略对I/O带宽产生的影响,提出了能够获得高I/O性能的读取策略;⑧提出了降低通信量的“关键位通信”方法.综合运用上述策略,设计并实现了高效的并行BNM算法.理论分析和实验表明,该并行BNM算法具有很高的加速比、并行效率及很好的可扩展性,是解决图像恢复实用性的有效途径。  相似文献   

16.
《Real》2000,6(5):375-389
This paper presents an approach to parallel implementation of wavelet transforms in a distributed computing environment. To achieve robustness and efficiency, we proposed a parallel algorithm for wavelet transform which can be implemented in SIMD, MIMD and pipeline architectures on the configured system. Our experimental results show that our proposed algorithm will speed up the wavelet-based image processing tasks on a network of computer workstation clusters.  相似文献   

17.
《Parallel Computing》2002,28(7-8):1079-1093
Vector quantization (VQ) is a widely used algorithm in speech and image data compression. One of the problems of the VQ methodology is that it requires large computation time especially for large codebook size. This paper addresses two issues. The first deals with the parallel construction of the VQ codebook which can drastically reduce the training time. A master/worker parallel implementation of a VQ algorithm is proposed. The algorithm is executed on the DM-MIMD Alex AVX-2 machine using a pipeline architecture. The second issue deals with the ability of accurately predicting the machine performance. Using communication and computation models, a comparison between expected and real performance is carried out. Results show that the two models can accurately predict the performance of the machine for image data compression. Analysis of metrics normally used in parallel realization is conducted.  相似文献   

18.
We design a task mapper TPCM for assigning tasks to virtual machines, and an application-aware virtual machine scheduler TPCS oriented for parallel computing to achieve a high performance in virtual computing systems. To solve the problem of mapping tasks to virtual machines, a virtual machine mapping algorithm (VMMA) in TPCM is presented to achieve load balance in a cluster. Based on such mapping results, TPCS is constructed including three components: a middleware supporting an application-driven scheduling, a device driver in the guest OS kernel, and a virtual machine scheduling algorithm. These components are implemented in the user space, guest OS, and the CPU virtualization subsystem of the Xen hypervisor, respectively. In TPCS, the progress statuses of tasks are transmitted to the underlying kernel from the user space, thus enabling virtual machine scheduling policy to schedule based on the progress of tasks. This policy aims to exchange completion time of tasks for resource utilization. Experimental results show that TPCM can mine the parallelism among tasks to implement the mapping from tasks to virtual machines based on the relations among subtasks. The TPCS scheduler can complete the tasks in a shorter time than can Credit and other schedulers, because it uses task progress to ensure that the tasks in virtual machines complete simultaneously, thereby reducing the time spent in pending, synchronization, communication, and switching. Therefore, parallel tasks can collaborate with each other to achieve higher resource utilization and lower overheads. We conclude that the TPCS scheduler can overcome the shortcomings of present algorithms in perceiving the progress of tasks, making it better than schedulers currently used in parallel computing.  相似文献   

19.
大规模CFD多区结构网格任务负载平衡算法   总被引:1,自引:0,他引:1  
针对现有负载平衡算法的适应度低、可扩展性差、通信开销度量不准确的缺陷, 提出一种大规模CFD多区结构网格任务负载平衡算法。通过对网格块的分割、网格块之间的组合映射、进程上网格计算量的调整来实现并行CFD任务负载平衡。实验结果表明, 该算法既适应同构平台也适应异构平台, 既适应网格块数多于进程数的情况也适应网格块数少于进程数的情况, 该算法可使得整个计算空间分配到各进程上的计算量负载平衡, 同时使得各进程间的最大通信开销最小。  相似文献   

20.
Triggered by the ever increasing advancements in processor and networking technology, a cluster of PCs connected by a high-speed network has become a viable and cost-effective platform for the execution of computation intensive parallel multithreaded applications. However, there are two research issues to be tackled in the scheduling problem for PC cluster computing: (1) how to reduce the communication overhead of executing a multithreaded application on the cluster; (2) how to exploit the heterogeneity, which is unavoidable in an evolving PC cluster, for the application. In this paper, we propose to use a duplication based approach in scheduling tasks/threads to a heterogeneous cluster of PCs. In duplication based scheduling, critical tasks are redundantly scheduled to more than one machine, in order to reduce the number of inter-task communication operations. The start times of the succeeding tasks are also reduced. The task duplication process is guided given the system heterogeneity in that the critical tasks are scheduled or replicated in faster machines. The algorithm has been implemented in our experimental application parallelization system for generating multithreaded parallel code executable on a cluster of Pentium PCs. Our experiments, using three numerical applications and one protocol processing kernel (multithreading per request), have indicated that heterogeneity of PC cluster is indeed useful for optimizing the execution of parallel multithreaded programs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号