首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents an analytically robust, globally convergent approach to managing the use of approximation models of varying fidelity in optimization. By robust global behaviour we mean the mathematical assurance that the iterates produced by the optimization algorithm, started at an arbitrary initial iterate, will converge to a stationary point or local optimizer for the original problem. The approach presented is based on the trust region idea from nonlinear programming and is shown to be provably convergent to a solution of the original high-fidelity problem. The proposed method for managing approximations in engineering optimization suggests ways to decide when the fidelity, and thus the cost, of the approximations might be fruitfully increased or decreased in the course of the optimization iterations. The approach is quite general. We make no assumptions on the structure of the original problem, in particular, no assumptions of convexity and separability, and place only mild requirements on the approximations. The approximations used in the framework can be of any nature appropriate to an application; for instance, they can be represented by analyses, simulations, or simple algebraic models. This paper introduces the approach and outlines the convergence analysis.This research was supported by the Dept. of Energy grant DEFG03-95ER25257 and Air Force Office of Scientific Research grant F49620-95-1-0210This research was supported by the National Aeronautics and Space Administration under NASA Contract No. NAS1-19480 while the author was in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23681, USAThis research was supported by the Air Force Office of Scientific Research grant F49620-95-1-0210 and by the National Aeronautics and Space Administration under NASA Contract No. NAS1-19480 while the author was in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley Research Center, Hampton, VA 23681, USA  相似文献   

2.
In this paper we propose a fast method for solving wave guide problems. In particular, we consider the guide to be inhomogeneous, and allow propagation of waves of higher-order modes. Such techniques have been handled successfully for acoustic wave propagation problems with single mode and finite length. This paper extends this concept to electromagnetic wave guides with several modes and infinite in length. The method is shown and results of computations are presented.Research was supported by the National Aeronautics and Space Administration under NASA Contract No. NAS1-18107 while the first author was in residence at the ICASE, NASA Langley Research Center, Hampton, VA 23665-5225, and by NASA Grant No. NAG-1-624.  相似文献   

3.
Sufficient conditions that a two-dimensional system with output is locally observable are presented. Known results depend on time derivatives of the output and the inverse function theorem. In some cases, no information is provided by these theories, and one must study observability by other methods. We dualize the observability problem to the controllability problem, and apply the deep results of Hermes on local controllability to prove a theorem concerning local observability.Research supported by NASA Ames Research Center under Grant NAG2-189 and the Joint Services Electronics Program under ONR Contract N0014-76-C1136.Research supported by NASA Ames Research Center under Grant NAG2-203 and the Joint Services Electronics Program under ONR Contract N0014-76-C1136.  相似文献   

4.
An efficient three-dimensional unstructured Euler solver is parallelized on a CRAY Y-MP C90 shared-memory computer and on an Intel Touchstone Delta distributed-memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between the two differing architectures are made.This work was sponsored in part by ARPA (NAG-1-1485) and by NASA Contract No. NAS1-19480 while authors Mavriplis, Saltz and Das were in residence at ICASE, NASA Langley Research Center, Hampton, Virginia. This research was performed in part using the Intel Touchstone Delta System operated by Caltech on behalf of the Concurrent Supercomputing Consortium. Access to this fecility was provided by NASA Langley Research Center and the Center for Research in Parallel Processing. The content of the information does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred.  相似文献   

5.
In irregular scientific computational problems one is periodically forced to choosea delay point where some overhead cost is suffered to ensure correctness, or to improve subsequent performance. Examples of delay points are problem remappings, and global synchronizations. One sometimes has considerable latitude in choosing the placement and frequency of delay points; we consider the problem of scheduling delay points so as to minimize the overal execution time. We illustrate the problem with two examples, a regridding method which changes the problem discretization during the course of the computation, and a method for solving sparse triangular systems of linear equations. We show that one can optimally choose delay points in polynomial time using dynamic programming. However, the cost models underlying this approach are often unknown. We consequently examine a scheduling heuristic based on maximizing performance locally, and empirically show it to be nearly optimal on both problems. We explain this phenomenon analytically by identifying underlying assumptions which imply that overall performance is maximized asymptotically if local performance is maximized.This research was supported in part by the National Aeronautics and Space Administration under NASA contract NAS1-18107 while the author consulted at ICASE, Mail Stop 132C, NASA Langley Research Center, Hampton, Virginia 23665.Supported in part by NASA contract NAS1-18107, the Office of Naval Research under Contract No. N00014-86-K-0654, and NSF Grant DCR 8106181.  相似文献   

6.
We present a demand-driven approach to memory leak detection algorithm based on flow- and context-sensitive pointer analysis. The detection algorithm firstly assumes the presence of a memory leak at some program point and then runs a backward analysis to see if this assumption can be disproved. Our algorithm computes the memory abstraction of programs based on points-to graph resulting from flow- and context-sensitive pointer analysis. We have implemented the algorithm in the SUIF2 compiler infrastructure and used the implementation to analyze a set of C benchmark programs. The experimental results show that the approach has better precision with satisfied scalability as expected. This work is supported by the National Natural Science Foundation of China under Grant Nos. 60725206, 60673118, and 90612009, the National High-Tech Research and Development 863 Program of China under Grant No. 2006AA01Z429, the National Basic Research 973 Program of China under Grant No. 2005CB321802, the Program for New Century Excellent Talents in University under Grant No. NCET-04-0996, and the Hunan Natural Science Foundation under Grant No. 07JJ1011.  相似文献   

7.
基于SIMD机器的优化数据传输的并行循环分割   总被引:2,自引:1,他引:2  
本文提出一个基于分布式局存的SIMD机器的循环分割理论体系以优化运算中所需要的数据传输。该体系使用矩阵表示迭代空间、数据空间和数组存取式。我们引入数据传输概念,并建立一个简单有效的数据传输模型来评估数据在全局内存和局部内存之间的传输开销。最后,对于给定的循环嵌套,我们给出一个循环分割算法以获得优化循环块,使得循环嵌套中所需要的数据传输开销最小,并且大大减少了数据传输和计算的同步开销。实验结果证明了  相似文献   

8.
There is an enormous amount of parallelism exposed to fine-grain multithreaded architectures to cover latencies. It is a demanding task for a multithreading programmer to manage such a degree of parallelism by hand. To use multithreaded architectures efficiently it is essential to have compiler support for automatically partitioning programs into threads. This paper solves a fundamental problem in compiling for multithreaded architectures, automatically partitioning a program into threads. The focus of such partitioning is to overlap the remote communication latency and minimize the total execution time. We first formulate the partitioning problem based on a multithreaded execution cost model. Then, we prove such a formulation is NP-hard. Therefore, we propose two heuristic thread-partitioning methods to solve this problem in practice. The advanced partitioning algorithm is a novel extension of list scheduling, and it takes advantage of the cost model to generate near-optimum partitioning results. The remote-path-based partitioning algorithm is a simplified version of the advanced one but it is easy for compiler implementation. The two partitioning algorithms were implemented respectively in a thread partitioning testbed and a research EARTH-C compiler. The experimental results show that both partitioning algorithms are effective to generate efficient threaded code, and code generated by the compiler is comparable to hand-written code.  相似文献   

9.
This paper describes a verified compiler for PreScheme, the implementation language for thevlisp run-time system. The compiler and proof were divided into three parts: A transformational front end that translates source text into a core language, a syntax-directed compiler that translates the core language into a combinator-based tree-manipulation language, and a linearizer that translates combinator code into code for an abstract stored-program machine with linear memory for both data and code. This factorization enabled different proof techniques to be used for the different phases of the compiler, and also allowed the generation of good code. Finally, the whole process was made possible by carefully defining the semantics ofvlisp PreScheme rather than just adopting Scheme's. We believe that the architecture of the compiler and its correctness proof can easily be applied to compilers for languages other than PreScheme.This work was supported by Rome Laboratory of the United States Air Force, contract No. F19628-89-C-0001, through the MITRE Corporation, and by NSF and DARPA under NSF grants CCR-9002253 and CCR-9014603. Author's current address: Department of Computer Science and Engineering, Oregon Graduate Institute, P.O. Box 91000, Portland, OR 97291-1000.The work reported here was supported by Rome Laboratory of the United States Air Force, contract No. F19628-89-C-0001. Preparation of this paper was generously supported by The MITRE Corporation.This work was supported by Rome Laboratory of the United States Air Force, contract No. F19628-89-C-0001, through the MITRE Corporation, and by NSF and DARPA under NSF grants CCR-9002253 and CCR-9014603.  相似文献   

10.
This paper presents the results of multitasking a Navier-Stokes algorithm on the CRAY-2. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. Two implementations of multitasking on the CRAY-2 are considered: macrotasking (parallelism at the subroutine level) and microtasking (parallelism at the do-loop level). These two techniques are briefly described. The implementation of the algorithm is discussed in relation to these techniques, and the results for three problem sizes are presented. The timing results for both techniques are, in general, comparable with differences ranging between 2% and 14%, depending on the problem size. The best achieved speedup in a dedicated environment is 3.62 for macrotasking and 3.32 for microtasking. The task granularity for both techniques is computed, and the synchronization costs are estimated. For macrotasks of granularity of up to 0.5 msec, microtasking outperformed macrotasking, while the latter outperformed the former for granularity of over one msec.This research was supported by NASA Contract No. NAS2-11555 while the author was an employee of Sterling Software under contract to the Numerical Aerodynamic Simulation Systems Divison at NASA Ames Research Center, Moffett Field, CA 94035.  相似文献   

11.
提出了一种面向SIMD机器的全局数据自动分割算法,该算法能处理多个非紧嵌折循环嵌套,并且数组下标存取为循环变量的线性式,首先通过数据与迭代映射抽象了计算中的通信方式,然事提出识别规则模式通信模式的形式比条件,接着建立包含对准信息和相应通信开销的数据迭代图,并在数据迭代图的基础上提出了一个启发式算法来计算较优的数据分布和迭代分布,以优化处理单元之间的通信开销,通过发析多个循环嵌套所涉及的多个数组映和  相似文献   

12.
GPGPU加速器是当前提高图像处理算法性能的主流加速平台,但是,在GPGPU平台上,同一个程序充分利用硬件体系结构特征和软件特征的优化版本与简单实现版本在性能上会有数量级的差异。GPGPU加速器具有多维多层的大量执行线程和层次化存储体系结构,后者的不同层次具有不同的容量、带宽、延迟和访问权限。同时,图像处理应用程序具有复杂的计算操作、边界处理规则和数据访问特性。因此,任务的并发执行模式、线程的组织方式和并发任务到设备的映射不仅影响到程序的并发度、调度、通信和同步等特性,而且也会影响到访存的带宽、延迟等。因此,GPGPU平台上的程序优化是一个困难、复杂且效率较低的过程。本文提出基于语言扩展的领域编程模型:ParaC。ParaC编程环境利用高层语言扩展描述的程序语义信息,自动分析获取应用程序的操作信息、并发任务间的数据重用信息和访存信息等程序特征,同时结合硬件平台特征,利用基于领域先验知识驱动的编译优化模型自动生成GPGPU平台上的优化代码,最后,利用源源变换编译器生成标准OpenCL程序。本文在测试用例上的实验结果表明,ParaC在GPGPU平台上自动生成的优化版本相对于手工优化版本的加速比最高达到3.22倍,但代码行数只是后者的1.2%到39.68%。  相似文献   

13.
On modern computers, the performance of programs is often limited by memory latency rather than by processor cycle time. To reduce the impact of memory latency, the restructuring compiler community has developed locality-enhancing program transformations such as loop permutation and tiling. These transformations work well for perfectly nested loops (loops in which all assignment statements are contained in the innermost loop), but their performance on codes such as matrix factorizations that contain imperfectly nested loops leaves much to be desired. In this paper, we propose an alternative approach called data-centric transformation. Instead of reasoning directly about the control structure of the program, a compiler using the data-centric approach chooses an order for the arrival of data elements in the cache, determines what computations should be performed when that data arrives, and generates the appropriate code. At runtime, program execution will automatically pull data into the cache in an order that corresponds approximately to the order chosen by the compiler; since statements that touch a data structure element are scheduled close together, locality is improved. The idea of data-centric transformation is very general, and in this paper, we discuss a particular transformation called data-shackling. We have implemented shackling in the SGI MIPSPro compiler which already has a sophisticated implementation of control-centric transformations for locality enhancement. We present experimental results on the SGI Octane comparing the performance of the two approaches, and show that for dense numerical linear algebra codes, data-shackling does better by factors of two to five.  相似文献   

14.
Overlapping communication with computation is a well-known approach to improving performance. Previous research has focused on optimizations performed by the programmer. This paper presents a compiler algorithm that automatically determines the appropriate loop indices of a given nested loop and applies loop interchange and tiling in order to overlap communication with computation. The algorithm avoids generating redundant communication by providing a framework for combining information on data dependence, communication, and reuse. It also describes a method of generating messages to exchange data between processors for tiled loops on distributed memory machines. The algorithm has been implemented in our High Performance Fortran (HPF) compiler, and experimental results have shown its effectiveness on distributed memory machines, such as the RISC System/6000 Scalable POWERparallel System. This paper also discusses the architectural problems of efficient optimization.  相似文献   

15.
We studylazy structure sharing as a tool for optimizing equivalence testing on complex data types. We investigate a number of strategies for implementing lazy structure sharing and provide upper and lower bounds on their performance (how quickly they effect ideal configurations of our data structure). In most cases when the strategies are applied to a restricted case of the problem, the bounds provide nontrivial improvements over the naïve linear-time equivalence-testing strategy that employs no optimization. Only one strategy, however, which employs path compression, seems promising for the most general case of the problem.Work completed while at Princeton University and supported by a Fannie and John Hertz Foundation Fellowship, National Science Foundation Grant No. CCR-8920505, and the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS) under NSF-STC-91-19999.Work completed while at Princeton University and DIMACS and supported by DIMACS under NSF-STC-91-19999.Research at Princeton University partially supported by the National Science Foundation, Grant No. CCR-8920505, the Office of Naval Research, Contract No. N00014-91-J-1463, and by DIMACS under NSF-STC-91-19999.  相似文献   

16.
Algorithms for parallel memory,II: Hierarchical multilevel memories   总被引:1,自引:0,他引:1  
In this paper we introduce parallel versions of two hierarchical memory models and give optimal algorithms in these models for sorting, FFT, and matrix multiplication. In our parallel models, there areP memory hierarchies operating simultaneously; communication among the hierarchies takes place at a base memory level. Our optimal sorting algorithm is randomized and is based upon the probabilistic partitioning technique developed in the companion paper for optimal disk sorting in a two-level memory with parallel block transfer. The probability of using/times the optimal running time is exponentially small in (log ) logP.A summarized version of this research was presented at the 22nd Annual ACM Symposium on Theory of Computing, Baltimore, MD, May 1990. This work was done while the first author was at Brown University. Support was provided in part by a National Science Foundation Presidential Young Investigator Award with matching funds from IBM, by NSF Research Grants DCR-8403613 and CCR-9007851, by Army Research Office Grant DAAL03-91-G-0035, and by the Office of Naval Research and the Defense Advanced Research Projects Agency under Contract N00014-91-J-4052 ARPA Order 8225. This work was done in part while the second author was at Brown University supported by a Bellcore graduate fellowship and at Bellcore.  相似文献   

17.
We study strategies for converting randomized algorithms of the Las Vegas type into randomized algorithms with small tail probabilities.Supported by ESPRIT U Basic Research Actions Program of the EC under Contract No. 3075 (project ALCOM).Supported by ESPRIT II Basic Research Actions Program of the EC under Contract No. 3075 (Project ALCOM).Research supported by NSF Grant No. CCR-9005448.Partially supported by a Wolfson Research Award administered by the Israel Academy of Sciences and Humanities.  相似文献   

18.
Presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We have implemented this framework in a compiler for Alewife, a distributed shared-memory multiprocessor  相似文献   

19.
Distributed Memory Multicomputers (DMMs), such as the IBM SP-2, the Intel Paragon, and the Thinking Machines CM-5, offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications-the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster  相似文献   

20.
Exploiting cache locality of parallel programs at runtime is a complementary approach to a compiler optimization. This is particularly important for those applications with dynamic memory access patterns. We propose a memory-layout oriented technique to exploit cache locality of parallel loops at runtime on Symmetric Multiprocessor (SMP) systems. Guided by application-dependent and targeted architecture-dependent hints, our system, called Cacheminer, reorganizes and partitions a parallel loop using the memory-access space of its execution. Through effective runtime transformations, our system maximizes the data reuse in each partitioned data region assigned in a cache, and minimizes the data sharing among the partitioned data regions assigned to all caches. The executions of tasks in the partitions are scheduled in an adaptive and locality-presented way to minimize the execution time of programs by trading off load balance and locality. We have implemented the Cacheminer runtime library on two commercial SMP servers and an SimCS simulated SMP. Our simulation and measurement results show that our runtime approach can achieve comparable performance with the compiler optimizations for programs with regular computation and memory-access patterns, whose load balance and cache locality can be well optimized by the tiling and other program transformations. However, our experimental results show that our approach is able to significantly improve the memory performance for the applications with irregular computation and dynamic memory access patterns. These types of programs are usually hard to optimize by static compiler optimizations  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号