首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 125 毫秒
1.
推测多线程主要针对编译器生成的指令进行线程划分,在控制流和数据流分析基础上,实现串行程序的自动并行化.模拟器作为检验线程划分算法的有效手段,不仅能验证程序执行结果的正确性,而且可以评估程序并发执行的加速比性能,进一步也可以反映线程划分算法的合理性.针对Olden Suite程序在模拟器上的运行时统计信息,分析线程划分中所存在的寄存器依赖问题.同时,结合实例详细讨论造成寄存器依赖的主要原因.最后,针对寄存器依赖问题提出一种改进的线程划分方法.  相似文献   

2.
为对CUDA并行程序内核性能进行分析和预测,从而指导并行程序设计及性能优化,提出一种性能预测框架.1)从GPU编程模型和设备架构细节入手,以线程束为研究单位,通过整合与GPU程序用时密切相关的软硬件基本特征,定义了并行空间闲置度、流处理器线程束负载、并行效应因子等高层次性能相关特征.2)基于上述特征,框架针对线程负载均衡型GPU程序,评估内核函数在不同问题规模以及执行配置下的执行时间.3)依据性能评估原理提出了内核函数执行配置参数的优化策略.验证实验结果表明,该框架在两种典型情境下对现有程序性能的平均预测准确率分别达到89%和94%,客观归纳了高层次特征与程序性能间的相关关系,且能定性分析并行算法性能水平.  相似文献   

3.
程旭 《电子器件》1997,20(1):423-428
指令级并行性是保证处理器性能不断改进的重要途径。我们利用研制的S&S模拟系统,针对不同执行机制对八个基准程序中潜在指令级并行性存在,开发情况的影响,在多方面进行了定量分析,结果表明推测式招考 多控制流并行是充分挖掘出程序中潜在并行性的关键。  相似文献   

4.
循环占据了FORTRAN程序的大部分运行时间,循环级并行性的有效开发是程序并行化过程中一个最关键环节之一,尽管人们已对循环并行化进行了广泛而深入的研究,然而现有技术对循环的处理并不是那么令人满意。本文提出了一种新的能有效、完全地开发循环迭代间不规则并行性的技术。  相似文献   

5.
并行化程序的出现大大提高了应用程序的执行效率,多核程序设计时需要对程序的性能进行考虑。本文重点讨论OpenMP编程模型中多核多线程程序在并行化开销、负载均衡、线程同步开销方面对程序性能的影响。  相似文献   

6.
针对目前大多数面向指针应用程序的线程数据预取方法在预取距离控制方面的不足,该文提出一种基于缓存行为特征的数据预取距离控制策略。该策略利用指针应用程序执行时的数据缓存特征构建预取距离控制模型,以避免共享缓存污染,降低系统资源竞争,并通过忽略对部分非循环依赖数据预取平衡帮助线程与主线程间的执行任务,提高线程数据预取的时效性。实验结果表明,通过该策略控制线程数据预取距离能进一步提高线程预取性能。  相似文献   

7.
针对HEVC帧内预测角度模式算法的特点,提出实现角度预测模式的并行化方法.该方法基于BWDSP1041仿真平台通过分析角度模式算法的可并行性,提出了适合多乘法器并行计算的数据分配方式,结合处理器所搭载的硬件资源,设计了多运算部件并行工作的算法程序.实验结果表明角度预测模式20和垂直模式26在BWDSP1041上利用硬件资源的并行化实现,并行加速比分别达到161.68和344.65.该并行化算法减少了视频编码的时间,其数据分配方案对于帧内预测算法在多核和多运算部件结构上的并行化研究也具有一定的参考价值.  相似文献   

8.
随着网络带宽的快速增长,作为网络安全设备核心模块的多模式匹配(MPM)算法面临严峻的性能挑战。该文提出一种高效的数据包分割和并行匹配算法—距离比较并行匹配算法(DCPM)。和已有方法相比,并行的DCPM线程间不存在同步开销,引入的冗余检测开销达到理论最小。基于Aho-Corasick(AC)算法,在8核处理器平台上将DCPM算法与已有的数据包分割方法进行了性能比较。实验结果表明,和已有方法相比,DCPM算法的适应性更好,性能受网络流量中模式串占比、模式串长度及自动机状态数等因素的影响更小;在处理真实数据集时,DCPM算法的加速比提升1.3~3.5倍。  相似文献   

9.
构建了一种适用于多核集群的混合并行编程模型.该模型融合了共享内存的面向任务的TBB编程和基于消息传递的MPI编程两种模式.结合两者的优势,实现进程到处理节点和进程内线程到处理器核的两级并行.相对于单一编程方式下的程序性能,采用这种混合并行编程模型的算法不但可以减少程序执行时间,获得更好的加速比和执行效率,而且明显地提高了集群性能.  相似文献   

10.
为减少卷积神经网络(CNN)的计算量,该文将2维快速滤波算法引入到卷积神经网络,并提出一种在FPGA上实现CNN逐层加速的硬件架构。首先,采用循环变换方法设计行缓存循环控制单元,用于有效地管理不同卷积窗口以及不同层之间的输入特征图数据,并通过标志信号启动卷积计算加速单元来实现逐层加速;其次,设计了基于4并行快速滤波算法的卷积计算加速单元,该单元采用若干小滤波器组成的复杂度较低的并行滤波结构来实现。利用手写数字集MNIST对所设计的CNN加速器电路进行测试,结果表明:在xilinx kintex7平台上,输入时钟为100 MHz时,电路的计算性能达到了20.49 GOPS,识别率为98.68%。可见通过减少CNN的计算量,能够提高电路的计算性能。  相似文献   

11.
To speed up data‐intensive programs, two complementary techniques, namely nested loops parallelization and data locality optimization, should be considered. Effective parallelization techniques distribute the computation and necessary data across different processors, whereas data locality places data on the same processor. Therefore, locality and parallelization may demand different loop transformations. As such, an integrated approach that combines these two can generate much better results than each individual approach. This paper proposes a unified approach that integrates these two techniques to obtain an appropriate loop transformation. Applying this transformation results in coarse grain parallelism through exploiting the largest possible groups of outer permutable loops in addition to data locality through dependence satisfaction at inner loops. These groups can be further tiled to improve data locality through exploiting data reuse in multiple dimensions.  相似文献   

12.
Efficient address register allocation has been shown to be a central problem in code generation for processors with restricted addressing modes. This paper extends previous work on Global Array Reference Allocation (GARA), the problem of allocating address registers to array references in loops. It describes two heuristics to the problem, presenting experimental data to support them. In addition, it proposes an approach to solve GARA optimally which, albeit computationally exponential, is useful to measure the efficiency of other methods. Experimental results, using the MediaBench benchmark and profiling information, reveal that the proposed heuristics can solve the majority of the benchmark loops near optimality in polynomial-time. A substantial execution time speedup is reported for the benchmark programs, after compiled with the original and the optimized versions of GCC.  相似文献   

13.
This paper studies how to parallelize the emerging media mining workloads on existing small-scale multi-core processors and future large-scale platforms. Media mining is an emerging technology to extract meaningful knowledge from large amounts of multimedia data, aiming at helping end users search, browse, and manage multimedia data. Many of the media mining applications are very complicated and require a huge amount of computing power. The advent of multi-core architectures provides the acceleration opportunity for media mining. However, to efficiently utilize the multi-core processors, we must effectively execute many threads at the same time. In this paper, we present how to explore the multi-core processors to speed up the computation-intensive media mining applications. We first parallelize two media mining applications by extracting the coarse-grained parallelism and evaluate their parallel speedups on a small-scale multi-core system. Our experiment shows that the coarse-grained parallelization achieves good scaling performance, but not perfect. When examining the memory requirements, we find that these coarse-grained parallelized workloads expose high memory demand. Their working set sizes increase almost linearly with the degree of parallelism, and the instantaneous memory bandwidth usage prevents them from perfect scalability on the 8-core machine. To avoid the memory bandwidth bottleneck, we turn to exploit the fine-grained parallelism and evaluate the parallel performance on the 8-core machine and a simulated 64-core processor. Experimental data show that the fine-grained parallelization demonstrates much lower memory requirements than the coarse-grained one, but exhibits significant read-write data sharing behavior. Therefore, the expensive inter-thread communication limits the parallel speedup on the 8-core machine, while excellent speedup is observed on the large-scale processor as fast core-to-core communication is provided via a shared cache. Our study suggests that (1) extracting the coarse-grained parallelism scales well on small-scale platforms, but poorly on large-scale system; (2) exploiting the fine-grained parallelism is suitable to realize the power of large-scale platforms; (3) future many-core chips can provide shared cache and sufficient on-chip interconnect bandwidth to enable efficient inter-core communication for applications with significant amounts of shared data. In short, this work demonstrates proper parallelization techniques are critical to the performance of multi-core processors. We also demonstrate that one of the important factors in parallelization is the performance analysis. The parallelization principles, practice, and performance analysis methodology presented in this paper are also useful for everyone to exploit the thread-level parallelism in their applications.
Wenlong LiEmail:
  相似文献   

14.
In this paper, we propose an automatic facial expression exaggeration system, which consists of face detection, facial expression recognition, and facial expression exaggeration components, for generating exaggerated views of different expressions for an input face video. In addition, the parallelized algorithms for the automatic facial expression exaggeration system are developed to reduce the execution time on a multi-core embedded system. The experimental results show satisfactory expression exaggeration results and computational efficiency of the automatic facial expression exaggeration system under cluttered environments. The quantitative experimental comparisons show that the proposed parallelization strategies provide significant computational speedup compared to the single-processor implementation on a multi-core embedded platform.  相似文献   

15.
提出了基于GPU-CPU流水线的雷达回波快速聚类方法.该方法利用GPU与CPU异步执行的特征,将聚类的各步骤组织成流水线,大大的挖掘了聚类全过程的的并行性.实验表明,引入这种GPU-CPU流水线机制后,该方法比一般策略的基于GPU的并行聚类算法性能有38%的提升,而相对于传统的CPU上的串行程序,获得了47x的加速比,满足了气象实时分析应用中的实时性要求.  相似文献   

16.
Coprocessor design is one application of high-level synthesis. We want to focus on high-performance coprocessors to speed up time critical parts in hardware-software codesign of embedded controllers. Time critical software parts often contain nested loops, often with data-dependent branches and data-dependent number of iterations. When (loop) pipelining is employed for high performance, the control dependencies become a dominant limitation to pipeline utilization. Branch prediction is a possible approach, but is usually restricted to few instructions and to one branch because of hardware and control overhead. Multiple branch prediction and speculative computation take a more global view on the program flow. We give practical examples of how speculative computation with multiple branch prediction increases performance far beyond a usual ASAP scheduling based on a CDFG. For scheduling, speculative computation requires a modification of the CDFG and, for the allocation phase, the insertion of register sets to save the processor status. The controller needs slight modification. We conclude that manual application of our approach will in general be too difficult, such that it can only be used in connection with synthesis  相似文献   

17.
Magnetic resonance elastography can be limited by the computationally intensive nonlinear inversion schemes that are sometimes employed to estimate shear modulus from externally induced internal tissue displacements. Consequently, we have developed a parallelized partial volume reconstruction approach to overcome this limitation. In this paper, we report results from experiments conducted on breast phantoms and human volunteers to validate the proposed technique. More specifically, we demonstrate that computational cost is linearly related to the number of subzones used during image recovery and that both subzone parallelization and partial volume domain reduction decrease execution time accordingly. Importantly, elastograms computed based on the parallelized partial volume technique are not degraded in terms of either image quality or accuracy relative to their full volume counterparts provided that the estimation domain is sufficiently large to negate the effects of boundary conditions. The clinical results presented in this paper are clearly preliminary; however, the parallelized partial volume reconstruction approach performs sufficiently well to warrant more in-depth clinical evaluation.  相似文献   

18.
In the area of automatic parallelization of programs, analyzing and transforming loop nests with parametric affine loop bounds requires fundamental mathematical results. The most common geometrical model of iteration spaces, called the polytope model, is based on mathematics dealing with convex and discrete geometry, linear programming, combinatorics and geometry of numbers.In this paper, we present automatic methods for computing the parametric vertices and the Ehrhart polynomial, i.e., a parametric expression of the number of integer points, of a polytope defined by a set of parametric linear constraints.These methods have many applications in analysis and transformations of nested loop programs. The paper is illustrated with exact symbolic array dataflow analysis, estimation of execution time, and with the computation of the maximum available parallelism of given loop nests.  相似文献   

19.
In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler‐hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write‐back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single‐instruction multiple‐data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32‐bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2‐way MLEP and 33.7% faster with a 4‐way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号