期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Dataflow formalisation of real-time streaming applications on a Composable and Predictable Multi-Processor SOC

《Journal of Systems Architecture》2015,61(9):435-448

Embedded systems often contain multiple applications, some of which have real-time requirements and whose performance must be guaranteed. To efficiently execute applications, modern embedded systems contain Globally Asynchronous Locally Synchronous (GALS) processors, network on chip, DRAM and SRAM memories, and system software, e.g. microkernel and communication libraries. In this paper we describe a dataflow formalisation to independently model real-time applications executing on the CompSOC platform, including new models of the entire software stack. We compare the guaranteed application throughput as computed by our tool flow to the throughput measured on an FPGA implementation of the platform, for both synthetic and real H.263 applications. The dataflow formalisation is composable (i.e. independent for each real-time application), conservative, models the impact of GALS on performance, and correctly predicts trends, such as application speed-up when mapping an application to more processors. 相似文献

2.

Parallel Molecular Dynamics: Implications for Massively Parallel Machines

Valerie E. Taylor Rick L. Stevens Kathryn E. Arnold 《Journal of Parallel and Distributed Computing》1997,45(2):159

Molecular dynamics simulation is a class of applications that require reducing the execution time of fixed-size problems. This reduction in execution time is important to drug design and protein interaction studies. Many implementations of parallel molecular dynamics have been developed, but very little work has addressed issues related to the use of machines with 50,000 processors for modest-sized problems in the range of 50,000 atoms. Current massively parallel machines present a major obstacle to achieving good performance:communication overhead. In this paper we quantify the communication latency and network bandwidth necessary to achieve 30–40% efficiency on future message-passing machines with sizes on the order of tens of thousands of processors, for executing molecular dynamics problems with the same order of atoms. We derive an analytical model of a benchmark application that simulates a system of helium atoms executing on the Intel Touchstone Delta using an interaction decomposition method. This model is validated and used to extrapolate information on the startup time and network bandwidth. The results indicate that for an MPP with a four-dimensional mesh topology using 400 MHz processors, the communication startup time must be at most 30 clock cycles and the network bandwidth at least 2.3 GB/s. This configuration results in 30–40% efficiency of the MPP for a problem with 50,000 atoms executing on 50,000 processors. 相似文献

3.

Throughput bounds for closed queueing networks

Jiri Kriz 《Performance Evaluation》1984,4(1):1-10

Analytical lower and upper bounds for the throughput of closed queueing networks with single and delay (infinite) servers are studied in this paper. The numerical evaluation of these bounds requires a small number of significant operations which is independent of the population N. This is in contrast to the exact computation of the throughput which requires at least O(N) operations as N tends to infinity. The bounds are given by simple closed-form analytical expressions and may be more suitable for various performance studies than the algorithmical form of the exact solution.In this paper, the previously known balanced-job bounds are generalized to networks containing delay servers (terminals) and a hierarchy of bounds is obtained for single and multiple class networks. For the single class network, further new bounds are derived: lower and upper bounds that require the evaluation of one square root and an upper bound that requires a constant number of exponentiations. This upper bound does not employ the balancing of server loadings and is especially useful for asymptotic analysis in the case of a large number of customers N. 相似文献

4.

User-process communication performance in networks of computers

Cabrera L.-F. Hunter E. Karels M.J. Hosher D.A. 《IEEE transactions on pattern analysis and machine intelligence》1988,14(1):38-53

The authors present a study of the performance achieved by user processes when using the IPC mechanisms as implemented in Berkeley Unix 4.2BSD in Ethernet based environments. The authors assess not only the impact that different processors, network hardware interfaces, and Ethernets have on the communication across machines, but also the effect of the loading of the hosts and communication media that participate in the interprocess communication mechanism. The measurements highlight the ultimate bounds on performance that may be achieved by user process applications communicating across machines, and serve as a guide in designing performance-critical applications. A detailed timing analysis is presented of the dynamic behavior of the TCP/IP and the UDP/IP network communication protocols' implementation in Berkeley Unix 4.2BSD 相似文献

5.

FEADS: A Framework for Exploring the Application Design Space on Network Processors

Rajani Pai R. Govindarajan 《International journal of parallel programming》2007,35(1):1-31

Network processors are designed to handle the inherently parallel nature of network processing applications. However, partitioning and scheduling of application tasks and data allocation to reduce memory contention remain as major challenges in realizing the full performance potential of a given network processor. The large variety of processor architectures in use and the increasing complexity of network applications further aggravate the problem. This work proposes a novel framework, called FEADS, for automating the task of application partitioning and scheduling for network processors. FEADS uses the simulated annealing approach to perform design space exploration of application mapping onto processor resources. Further, it uses cyclic and r-periodic scheduling to achieve higher throughput schedules. To evaluate dynamic performance metrics such as throughput and resource utilization under realistic workloads, FEADS automatically generates a Petri net (PN) which models the application, architectural resources, mapping and the constructed schedule and their interaction. The throughput obtained by schedules constructed by FEADS is comparable to that obtained by manual scheduling for linear task flow graphs; for more complicated task graphs, FEADS’ schedules have a throughput which is upto 2.5 times higher compared to the manual schedules. Further, static scheduling of tasks results in an increase in throughput by upto 30% compared to an implementation of the same mapping without task scheduling. 相似文献

6.

Co-scheduling algorithms for high-throughput workload execution

Guillaume Aupy Manu Shantharam Anne Benoit Yves Robert Padma Raghavan 《Journal of Scheduling》2016,19(6):627-640

This paper investigates co-scheduling algorithms for processing a set of parallel applications. Instead of executing each application one by one, using a maximum degree of parallelism for each of them, we aim at scheduling several applications concurrently. We partition the original application set into a series of packs, which are executed one by one. A pack comprises several applications, each of them with an assigned number of processors, with the constraint that the total number of processors assigned within a pack does not exceed the maximum number of available processors. The objective is to determine a partition into packs, and an assignment of processors to applications, that minimize the sum of the execution times of the packs. We thoroughly study the complexity of this optimization problem, and propose several heuristics that exhibit very good performance on a variety of workloads, whose application execution times model profiles of parallel scientific codes. We show that co-scheduling leads to faster workload completion time (40 % improvement on average over traditional scheduling) and to faster response times (50 % improvement). Hence, co-scheduling increases system throughput and saves energy, leading to significant benefits from both the user and system perspectives. 相似文献

7.

Edge-Cut Bounds on Network Coding Rates 总被引：1，自引：0，他引：1

Gerhard Kramer Serap A. Savari 《Journal of Network and Systems Management》2006,14(1):49-67

Active networks are network architectures with processors that are capable of executing code carried by the packets passing through them. A critical network management concern is the optimization of such networks and tight bounds on their performance serve as useful design benchmarks. A new bound on communication rates is developed that applies to network coding, which is a promising active network application that has processors transmit packets that are general functions, for example a bit-wise XOR, of selected received packets. The bound generalizes an edge-cut bound on routing rates by progressively removing edges from the network graph and checking whether certain strengthened d-separation conditions are satisfied. The bound improves on the cut-set bound and its efficacy is demonstrated by showing that routing is rate-optimal for some commonly cited examples in the networking literature. 相似文献

8.

Performance evaluation of TCP connections in ideal and non-ideal network environments

Hala ElAarag Mostafa Bassiouni 《Computer Communications》2001,24(18):1769-1779

In this paper, we study the performance of TCP in both ideal and non-ideal network environments. For the ideal environments, we develop a simple analytical model for the throughput and transfer time of TCP as a function of the file size and TCP parameters. Our simulation measurements demonstrate that this model can accurately predict the throughput for ideal TCP connections characterized by no packet loss due to congestion or bit errors. If these ideal conditions are not met, the model gives an upper bound for throughput and lower bound for transfer time. For the non-ideal environments, we concentrate on wireless links. While our ideal model provides an easy to use tool to calculate bounds on the performance of all TCP implementations in such environments, we also show through simulation the relative performance of four well-known TCP implementations. We also present simulation results that demonstrate the dominant factors affecting the performance of wireless TCP. 相似文献

9.

DLL-conscious instruction fetch optimization for SMT processors

Fayez Mrinmoy Hsien-Hsin S. 《Journal of Systems Architecture》2008,54(12):1089-1100

Simultaneous multithreading (SMT) processors can issue multiple instructions from distinct processes or threads in the same cycle. This technique effectively increases the overall throughput by keeping the pipeline resources more occupied at the potential expense of reducing single thread performance due to resource sharing. In the software domain, an increasing number of dynamically linked libraries (DLL) are used by applications and operating systems, providing better flexibility and modularity, and enabling code sharing. It is observed that a significant amount of execution time in software today is spent in executing standard DLL instructions, that are shared among multiple threads or processes. However, for an SMT processor with a virtually-indexed cache implementation, existing instruction fetching mechanisms can induce unnecessary false I-TLB and I-Cache misses caused by the DLL-based instructions that are intended to be shared. This problem is more prominent when multiple independent threads are executing concurrently on an SMT processor.In this work, we investigate a neglected form of contention between running threads in the I-TLB and I-Cache (including both VIVT and VIPT) due to DLLs. To address these shortcomings, we propose a system level technique involving a light-weight modification in the microarchitecture and the OS. By exploiting the nature of the DLLs in our optimized system, we can reinstate the intended sharing of the DLLs in an SMT machine. Using Microsoft Windows based applications, our simulation results show that the optimized instruction fetching mechanism can reduce the number of DLL misses up to 5.5 times and improve the instruction cache hit rates by up to 62%, resulting in up to 30% DLL IPC improvements and up to 15% overall IPC improvements. 相似文献

10.

Bound performance models of heterogeneous parallel processingsystems

Balsamo S. Donatiello L. Van Dijk N.M. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(10):1041-1056

Systems of heterogeneous parallel processing are studied such as arising in parallel programs executed on distributed systems. A lower and an upper bound model are suggested to obtain secure lower and upper bounds on the performance of these systems. The bounding models are solved by using a matrix-geometric algorithmic approach. Formal proofs of the bounds are provided along with error bounds on the accuracy of the bounds. These error bounds in turn are reduced to simple computational expressions. Numerical results are included. The results are of interest for application to arbitrary fork-join models with parallel heterogeneous processors and synchronization 相似文献

11.

Speedup and scalability analysis of Master–Slave applications on large heterogeneous clusters

Eduardo Javier Huerta Yero Marco Aurélio Amaral Henriques 《Journal of Parallel and Distributed Computing》2007

Although cluster environments have an enormous potential processing power, real applications that take advantage of this power remain an elusive goal. This is due, in part, to the lack of understanding about the characteristics of the applications best suited for these environments. This paper focuses on Master/Slave applications for large heterogeneous clusters. It defines application, cluster and execution models to derive an analytic expression for the execution time. It defines speedup and derives speedup bounds based on the inherent parallelism of the application and the aggregated computing power of the cluster. The paper derives an analytical expression for efficiency and uses it to define scalability of the algorithm–cluster combination based on the isoefficiency metric. Furthermore, the paper establishes necessary and sufficient conditions for an algorithm–cluster combination to be scalable which are easy to verify and use in practice. Finally, it covers the impact of network contention as the number of processors grow. 相似文献

12.

Throughput bounding and simple approximation methods for exponential fork/join queueing networks with blocking

Chun-Hyun Paik 《Computers & Industrial Engineering》1998,35(3-4):563-566

Exponential fork/join queueing networks (FJQNs) with finite buffers have been used as a major tool for evaluating the performances of manufacturing systems. In this study, we first suggest the throughput upper and lower bounds. Our upper-bounding method is elaborated on with general network configuration (acyclic configuration), while our lower bounds can be obtained only for networks with more specialized configuration. Next, developed is a simple approximation method for throughputs, which are based on decomposition/aggregation principles and structurally equivalent relations between different configurations. 相似文献

13.

Optimizing the Reliability of Streaming Applications Under Throughput Constraints

Anne Benoit Hinde Lilia Bouziane Yves Robert 《International journal of parallel programming》2011,39(5):584-614

Mapping a pipelined application onto a distributed and parallel platform is a challenging problem. The problem becomes even more difficult when multiple optimization criteria are involved, and when the target resources are heterogeneous (processors and communication links) and subject to failures. This paper investigates the problem of mapping pipelined applications, consisting of a linear chain of stages executed in a pipeline way, onto such platforms. The objective is to optimize the reliability under a performance constraint, i.e., while guaranteeing a threshold throughput. In order to increase reliability, we replicate the execution of stages on multiple processors. We compare interval mappings, where the application is partitioned into intervals of consecutive stages, with general mappings, where stages may be partitioned without any constraint, thereby allowing a better usage of processors and communication network capabilities. However, the price to pay for general mappings is a dramatic increase in the problem complexity. We show that computing the period of a given general mapping is an NP-complete problem, and we give polynomial bounds to determine a (conservative) approximated value. On the contrary, the period of an interval mapping obeys a simple formula, and we provide an optimal dynamic programming algorithm for the bi-criteria interval mapping problem on homogeneous platforms. On the more practical side, we design a set of efficient heuristics, and we compare the performance of interval and general mapping strategies through extensive simulations. 相似文献

14.

Routing schemes for multiple random broadcasts in arbitrary networktopologies

Varvarigos E.A. Banerjee A. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(8):886-895

We consider the problem where packets are generated at each node of a network according to a Poisson process with rate λ, and each of them has to be broadcast to all the other nodes. The network topology is assumed to be an arbitrary bidirectional graph. We derive upper bounds on the maximum achievable broadcast throughput, and lower bounds on the average time required to complete a broadcast. These bounds apply to any network topology, independently of the scheme used to perform the broadcasts. We also propose two dynamic broadcasting schemes, called the indirect and the direct broadcasting scheme, that can be used in a general topology, and we evaluate analytically their throughput and average delay. The throughput achieved by the proposed schemes is equal to the maximum possible, if a half-duplex link model is assumed, and is at least equal to one half of the maximum possible, if a full-duplex model is assumed. The average delay of both schemes is of the order of the diameter of the trees used to perform the broadcasts. The analytical results obtained do not use any approximating assumptions 相似文献

15.

Top-Down Characterization Approximation based on performance counters architecture for AMD processors

《Simulation Modelling Practice and Theory》2016

Due to the increasing complexity of the processors, developers often seek for tools that would simplify the process of finding bottlenecks while executing applications. Although more and more data may be collected from processors, usually much detailed knowledge about the internals of a given architecture is required to understand them.This paper introduces a Top-Down Characterization Approximation for the analysis of applications performance executed on AMD processors and is an extension of a Top-Down Method initially developed by Intel. Since not all required performance counters are available on AMD processors to calculate the exact values of metrics, this method was named as an approximation. It allows one to get a deeper understanding of different stages of program execution, compare different architectures and identify bottlenecks in out-of-order processors. It hides from the user the complexity of microarchitecture details and at the same time exposes the main contributors of inefficient program execution. This method aims at defining a few main metrics on top of performance counters to easily locate the main efficiency issues.At this time this method was applied to Intel processors only. The main reason behind it was the fact that it uses designated performance counters that are unique among different processors and its portability is not straightforward. Positive feedback from users encouraged the authors to develop a similar technique for AMD processors. 相似文献

16.

A measurement-based model to predict the performance impact ofsystem modifications: a case study

Dimpsey R.T. Iyer R.K. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(1):28-40

The paper presents a performance case study of parallel jobs executing in real multi user workloads. The study is based on a measurement based model capable of predicting the completion time distribution of the jobs executing under real workloads. The model constructed is also capable of predicting the effects of system design changes on application performance. The model is a finite state, discrete time Markov model with rewards and costs associated with each state. The Markov states are defined from real measurements and represent system/workload states in which the machine has operated. The paper places special emphasis on choosing the correct number of states to represent the workload measured. Specifically, the performance of computationally bound, parallel applications executing in real workloads on an Alliant FX/80 is evaluated. The constructed model is used to evaluate scheduling policies, the performance effects of multiprogramming overhead, and the scalability of the Alliant FX/8O in real workloads. The model identifies a number of available scheduling policies which would improve the response time of parallel jobs. In addition, the model predicts that doubling the number of processors in the current configuration would only improve response time for a typical parallel application by 25%. The model recommends a different processor configuration to more fully utilize extra processors. The paper also presents empirical results which validate the model created 相似文献

17.

Performance of shared memory in a parallel computer

Donovan K. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(2):253-256

A method for analyzing the lengths of memory queues when the network is conflict-free is described. An algorithm based on this method is shown to efficiently determine the upper and lower bounds of the queue length. Analysis indicates that the strategy of using hashing to spread data across memory modules is a good one. Results show that if the size of the system is increased while maintaining a constant ratio of numbers of processors to memories, then, asymptotically, the slowdown in performance from conflicts at the memory modules is Θ(log m /log log m). For m and n less than 100000 and λ between 0.25 and 4.0, the graphical data confirm this growth rate 相似文献

18.

面向多线程多道程序的加权共享Cache划分 总被引：5，自引：1，他引：4

所光杨学军《计算机学报》2008,31(11)

并行应用在共享Cache结构的多核处理器执行时,会因为对共享Cache的冲突访问而产生性能下降和执行时间不确定的现象.共享Cache划分技术可以把共享Cache互斥地分配给多个进程使用,是解决该问题的有效方法.由于线程间的数据共享,线程数目不同的应用对共享Cache的利用率不同,但传统的以失效率最低为目标的共享Cache划分算法(例如UCP)没有区分应用线程数目的不同.文中设计了一种面向多线程多道程序的加权共享Cache划分框架(Weighted Cache Partitioning,WCP),包括面向应用的失效率监控器和加权Cache划分算法.失效率监控器以进程为单位动态监控在不同的Cache容量下应用的失效率;而加权Cache划分算法扩展了传统的失效率最优的Cache划分算法,根据应用线程数目的不同在进行Cache划分时给应用赋予不同的权值,以使具有更多线程的应用获得更多的共享Cache,从而提高系统的整体性能.实验结果表明:加权Cache划分算法虽然失效率有所增高,但却改进了IPC吞吐率、加权加速比和公平性.在由科学和工程计算应用组成的多道程序测试用例中,WCP-1的IPC吞吐率比以失效率最低为目标函数的共享Cache划分算法最高高出10.8%,平均高出5.5%. 相似文献

19.

Lower and upper bounds on time for multiprocessor optimal schedules

Kumar Jain K. Rajaraman V. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(8):879-886

The lower and upper bounds on the minimum time needed to process a given directed acyclic task graph for a given number of processors are derived. It is proved that the proposed lower bound on time is not only sharper than the previously known values but also easier to calculate. The upper bound on time, which is useful in determining the worst case behavior of a given task graph, is presented. The lower and upper bounds on the minimum number of processors required to process a given task graph in the minimum possible time are also derived. It is seen with a number of randomly generated dense task graphs that the lower and upper bounds we derive are equal, thus giving the optimal time for scheduling directed acyclic task graphs on a given set of processors 相似文献

20.

网络RAID存储系统边界性能研究 总被引：2，自引：0，他引：2

崔宝江刘军王刚刘璟《计算机研究与发展》2005,42(6):1039-1046

目前针对网络存储系统性能的研究大都集中在定性研究方面,缺乏有效的定量分析方法和模型．在有限容量闭合排队网络理论的基础上,提出了网络RAID存储系统性能的定量分析模型．并提出了一种新的计算有限容量闭合排队网络系统边界性能的分析方法-APBA法,和其他近似分析方法相比,APBA法的计算时间复杂度更低．测试结果表明,通过利用APBA方法,由网络RAID存储系统的性能定量分析模型获得的系统性能值,可以有效反映网络RAID存储系统在轻载区、重载区和过载区的性能边界,以及系统的最大负载量．相似文献