首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Commercial transaction processing applications are an important workload running on symmetric multiprocessor systems (SMPs). They differ dramatically from scientific, numeric-intensive, and engineering applications because they are I/O bound, and they contain more system software activities. Most SMP servers available in the market have been designed and optimized for scientific and engineering workloads. A major challenge of studying architectural effects on the performance of a commercial workload is the lack of easy access to large-scale and complex database engines running on a multiprocessor system with powerful I/O facilities. Experiments involving case studies have been shown to be highly time-consuming and expensive. In this paper, we investigate the feasibility of using queuing network models with the support of simulation to study the SMP architectural impacts on the performance of commercial workloads. We use the commercial benchmark TPC-C as the workload. A bus-based SMP machine is used as the target platform. Queueing network modeling is employed to characterize the TPC-C workload on the SMP. The system components such as processors, memory, the memory bus, I/O buses, and disks are modeled as service centers, and their effects on performance are analyzed. Simulations are conducted as well to collect the workload-specific parameters (model parameterization) and to verify the accuracy of the model. Our studies find that among disk-related parameters, the disk rotation latency affects the performance of TPC-C most significantly. Among I/O buses and number of disks, the number of I/O buses has the deepest impact on performance. This study also demonstrates that our modeling approach is feasible, cost-effective, and accurate for evaluating the performance of commercial workloads on SMPs, and it is complementary to the measurement-based experimental approaches.  相似文献   

2.
Phenomenal improvements in the computational performance of multiprocessors have not been matched by comparable gains in I/O system performance. This imbalance has resulted in I/O becoming a significant bottleneck for many scientific applications. One key to overcoming this bottleneck is improving the performance of multiprocessor file systems. The design of a high-performance multiprocessor file system requires a comprehensive understanding of the expected workload. Unfortunately, until recently, no general workload studies of multiprocessor file systems have been conducted. The goal of the CHARISMA project was to remedy this problem by characterizing the behavior of several production workloads, on different machines, at the level of individual reads and writes. The first set of results from the CHARISMA project describe the workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5. This paper is intended to compare and contrast these two workloads for an understanding of their essential similarities and differences, isolating common trends and platform-dependent variances. Using this comparison, we are able to gain more insight into the general principles that should guide multiprocessor file-system design  相似文献   

3.
Kunkel  S. Armstrong  B. Vitale  P. 《Micro, IEEE》1999,19(3):56-64
Major performance enhancements in large commercial systems are best achieved when advances in hardware technology are matched with advances in software technology. This article connects recent AS/400 hardware advances with the corresponding approaches used to tune the system performance for large online transaction processing (OLTP) workloads. We particularly emphasize those tuning efforts that affect the memory system. OLTP workloads are large and complex, stressing many parts of both the software and hardware. These workloads quickly expose software bottlenecks caused by contention on software locks. They also have large working sets, populated with hard-to-predict access patterns that make cache miss rates high. This causes the processor to spend a significant part of its execution time waiting for memory accesses. In multiprocessor systems, compilers alone have minimal effect on cycles spent in storage latency. Other optimizations are needed to affect this portion of the execution time, and many of those require direct involvement of the system software  相似文献   

4.
Real-time concurrency control in a multiprocessor environment   总被引:1,自引:0,他引:1  
Although many high-performance computer systems are now multiprocessor-based, little work has been done in real-time concurrency control of transaction executions in a multiprocessor environment. Real-time concurrency control protocols designed for uniprocessor or distributed environments may not fit the needs of multiprocessor-based real-time database systems because of a lower concurrency degree of transaction executions and a larger number of priority inversions. This paper proposes the concept of a priority cap to bound the maximum number of priority inversions in multiprocessor-based real-time database systems to meet transaction deadlines. We also explore the concept of two-version data to increase the system concurrency level and to explore the abundant computing resources of multiprocessor computer systems. The capability of the proposed methodology is evaluated in a multiprocessor real-time database system under different workloads, database sizes and processor configurations. It is shown that the benefits of the priority cap in reducing the blocking time of urgent transactions are far greater than the losses involved in committing less urgent transactions. The idea of two-version data also greatly improves the system performance because of a much higher concurrency degree in the system  相似文献   

5.
The processor queuing model provides memory-hierarchy and system-design evaluation of memory-intensive commercial online-transaction-processing workloads on large multiprocessor systems. It differs from detailed cycle-accurate and direct-execution simulations in that it does not simulate instruction execution. Instead, as in analytical models, the authors build processor and workload characteristics that are easy to collect and estimate. Because the authors believe that the processor model's function is to accurately generate memory traffic to the rest of the system, they model a minimal set of processor and workload characteristics that captures the important interactions between a complex processor and the system-memory hierarchy.  相似文献   

6.
《Parallel Computing》1999,25(13-14):1517-1544
In this paper we analyze the major trends and changes in the High-Performance Computing (HPC) market place since the beginning of the journal `Parallel Computing'. The initial success of vector computers in the 1970s was driven by raw performance. The introduction of this type of computer systems started the area of `Supercomputing'. In the 1980s the availability of standard development environments and of application software packages became more important. Next to performance these factors determined the success of MP vector systems, especially at industrial customers. MPPs became successful in the early 1990s due to their better price/performance ratios, which was made possible by the attack of the `killer-micros'. In the lower and medium market segments the MPPs were replaced by microprocessor based symmetrical multiprocessor (SMP) systems in the middle of the 1990s. There success formed the basis for the use of new cluster concepts for very high-end systems. In the last few years only the companies which have entered the emerging markets for massive parallel database servers and financial applications attract enough business volume to be able to support the hardware development for the numerical high-end computing market as well. Success in the traditional floating point intensive engineering applications seems to be no longer sufficient for survival in the market.  相似文献   

7.
This paper describes the results of controlled experiments with a Honeywell H6000 series multiprocessor computer system, 212 experiments were performed using 14 test workloads on 8 H6000 configurations. The number of processors varied from 1 to 4, system controller units (SCUs) from 1 to 4, and main memory from 256K to 1024K words. The ratio (P) of I/O time and CPU time for the test workloads varied from 0.01 to 5.07. The improvement in throughput is expressed in terms of relative throughput (φ), defined as the ratio of elapsed time for a given test workload on a single-processor configuration to that on a multiprocessor configuration. The relative throughput increased monotonically with an increase in the number of processors for test workloads in the range 0<P<0.4 (CPU-bound) and φ exhibited an asymptotic behaviour for test workloads in the range 0.4<P<5.07 (I/O-bound).  相似文献   

8.
Inverted file partitioning schemes in multiple disk systems   总被引:1,自引:0,他引:1  
Multiple-disk I/O systems (disk arrays) have been an attractive approach to meet high performance I/O demands in data intensive applications such as information retrieval systems. When we partition and distribute files across multiple disks to exploit the potential for I/O parallelism, a balanced I/O workload distribution becomes important for good performance. Naturally, the performance of a parallel information retrieval system using an inverted file structure is affected by the partitioning scheme of the inverted file. In this paper, we propose two different partitioning schemes for an inverted file system for a shared-everything multiprocessor machine with multiple disks. We study the performance of these schemes by simulation under a number of workloads where the term frequencies in the documents are varied, the term frequencies in the queries are varied, the number of disks are varied and the multiprogramming level is varied  相似文献   

9.
An accurate and efficient model of a commercial multiprocessor bus is developed. Four important characteristics of the bus design are modeled: asynchronous memory write operations; in-order delivery of responses to processor read requests; priority scheduling of memory responses; and upper bounds on the number of outstanding processor requests. A two-level hierarchical model employing both Markov chain and mean value analysis techniques for analyzing queueing networks is used. The model is shown to accurately predict measured system performance for two parallel program workloads that have different memory access characteristics. The results provide evidence that analytic queueing models can be extremely accurate in spite of simplifying assumptions required for model tractability. Model estimates are compared against detailed simulation of the bus to investigate in more detail the likely source of small model inaccuracies. The use of the analytical model for assessing system design tradeoffs is illustrated  相似文献   

10.
Over the past two decades, rollback-recovery via checkpoint-restart has been used with reasonable success for long-running applications, such as scientific workloads that take from few hours to few months to complete. Currently, several commercial systems and publicly available libraries exist to support various flavors of checkpointing. Programmers typically use these systems if they are satisfactory or otherwise embed checkpointing support themselves within the application. In this paper, we project the performance and functionality of checkpointing algorithms and systems as we know them today into the future. We start by surveying the current technology roadmap and particularly how Peta-Flop capable systems may be plausibly constructed in the next few years. We consider how rollback-recovery as practiced today will fare when systems may have to be constructed out of thousands of nodes. Our projections predict that, unlike current practice, the effect of rollback-recovery may play a more prominent role in how systems may be configured to reach the desired performance level. System planners may have to devote additional resources to enable rollback-recovery and the current practice of using "cheap commodity" systems to form large-scale clusters may face serious obstacles. We suggest new avenues for research to react to these trends.  相似文献   

11.
We investigate techniques for efficiently executing multiquery workloads from data and computation-intensive applications in parallel and/or distributed computing environments. In this context, we describe a database optimization framework that supports data and computation reuse, query scheduling, and active semantic caching to speed up the evaluation of multiquery workloads. Its most striking feature is the ability of optimizing the execution of queries in the presence of application-specific constructs by employing a customizable data and computation reuse model. Furthermore, we discuss how the proposed optimization model is flexible enough to work efficiently irrespective of the parallel/distributed environment underneath. In order to evaluate the proposed optimization techniques, we present experimental evidence using real data analysis applications. For this purpose, a common implementation for the queries under study was provided according to the database optimization framework and deployed on top of three distinct experimental configurations: a shared memory multiprocessor, a cluster of workstations, and a distributed computational Grid-like environment.  相似文献   

12.
The importance of reporting is ever increasing in today’s fast-paced market environments and the availability of up-to-date information for reporting has become indispensable. Current reporting systems are separated from the online transaction processing systems (OLTP) with periodic updates pushed in. A pre-defined and aggregated subset of the OLTP data, however, does not provide the flexibility, detail, and timeliness needed for today’s operational reporting. As technology advances, this separation has to be re-evaluated and means to study and evaluate new trends in data storage management have to be provided. This article proposes a benchmark for combined OLTP and operational reporting, providing means to evaluate the performance of enterprise data management systems for mixed workloads of OLTP and operational reporting queries. Such systems offer up-to-date information and the flexibility of the entire data set for reporting. We describe how the benchmark provokes the conflicts that are the reason for separating the two workloads on different systems. In this article, we introduce the concepts, logical data schema, transactions and queries of the benchmark, which are entirely based on the original data sets and real workloads of existing, globally operating enterprises.  相似文献   

13.
Consolidated environments are progressively accommodating diverse and unpredictable workloads in conjunction with virtual desktop infrastructure and cloud computing. Unpredictable workloads, however, aggravate the semantic gap between the virtual machine monitor and guest operating systems, leading to inefficient resource management. In particular, CPU management for virtual machines has a critical impact on I/O performance in cases where the virtual machine monitor is agnostic about the internal workloads of each virtual machine. This paper presents virtual machine scheduling techniques for transparently bridging the semantic gap that is a result of consolidated workloads. To enable us to achieve this goal, we ensure that the virtual machine monitor is aware of task-level I/O-boundedness inside a virtual machine using inference techniques, thereby improving I/O performance without compromising CPU fairness. In addition, we address performance anomalies arising from the indirect use of I/O devices via a driver virtual machine at the scheduling level. The proposed techniques are implemented on the Xen virtual machine monitor and evaluated with micro-benchmarks and real workloads on Linux and Windows guest operating systems.  相似文献   

14.
A NUCA Substrate for Flexible CMP Cache Sharing   总被引:1,自引:0,他引:1  
We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses.  相似文献   

15.
RPM enables rapid prototyping of different multiprocessor architectures. It uses hardware emulation for reliable design verification and performance evaluation. The major objective of the RPM project is to develop a common, configurable hardware platform to accurately emulate different MIMD systems with up to eight execution processors. Because emulation is orders of magnitude faster than simulation, an emulator can run problems with large data sets more representative of the workloads for which the target machine is designed. Because an emulation is closer to the target implementation than an abstracted simulation, it can accomplish more reliable performance evaluation and design verification. Finally, an emulator is a real computer with its own I/O; the code running on the emulator is not instrumented. As a result, the emulator looks exactly like the target machine (to the programmer) and can run several different workloads, including code from production compilers, operating systems, databases, and software utilities  相似文献   

16.
Cloud computing has permeated into the information technology industry in the last few years, and it is emerging nowadays in scientific environments. Science user communities are demanding a broad range of computing power to satisfy the needs of high‐performance applications, such as local clusters, high‐performance computing systems, and computing grids. Different workloads are needed from different computational models, and the cloud is already considered as a promising paradigm. The scheduling and allocation of resources is always a challenging matter in any form of computation and clouds are not an exception. Science applications have unique features that differentiate their workloads; hence, their requirements have to be taken into consideration to be fulfilled when building a Science Cloud. This paper will discuss what are the main scheduling and resource allocation challenges for any Infrastructure as a Service provider supporting scientific applications.  相似文献   

17.
Both parallel and distributed network environment systems play a vital role in the improvement of high performance computing. Of primary concern when analyzing these systems is multiprocessor task scheduling. Therefore, this paper addresses the challenge of multiprocessor task scheduling parallel programs, represented as directed acyclic task graph (DAG), for execution on multiprocessors with communication costs. Moreover, we investigate an alternative paradigm, where genetic algorithms (GAs) have recently received much attention, which is a class of robust stochastic search algorithms for various combinatorial optimization problems. We design the new encoding mechanism with a multi-functional chromosome that uses the priority representation—the so-called priority-based multi-chromosome (PMC). PMC can efficiently represent a task schedule and assign tasks to processors. The proposed priority-based GA has show effective performance in various parallel environments for scheduling methods.  相似文献   

18.
Dynamic, high-performance or real-time applications require scheduling latencies and throughput not typically offered by current kernel or user-level threads schedulers. Moreover, it is widely accepted that it is important to be able to specialize scheduling policies for specific target applications and their execution environments. This paper presents one solution to the construction of such high-performance, application-specific thread schedulers. Specifically, scheduler implementations are composed from modular components, where individual scheduler modules may be specialized to underlying hardware characteristics or implement precisely the mechanisms and policies desired by application programs. The resulting user-level schedulers' implementations can provide resource guarantees by interaction with kernel-level facilities which provide means of resource reservation. This paper demonstrates the concept of composable schedulers by construction of several compositions for highly dynamic target applications, where low scheduling latencies are critical to application performance. Claims about the importance and effectiveness of scheduler composition are validated experimentally on a shared-memory multiprocessor. Scheduler compositions are optimized to take advantage of different low-level hardware attributes and of knowledge about application requirements specific to certain applications, including a Time Warp-based real-time discrete event simulator. Experimental evaluations are based on synthetic workloads, on a real-time simulation blending simulated with implemented control system components, and on a dynamic robot control program. Measurements indicate that schedulers can be composed and specialized to offer performance similar to that of dedicated scheduling co-processors. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

19.
This work examines scheduling for a real-time multiprocessor (MAFT) in which both hard deadlines and fault-tolerance are necessary system components. A workload for this system consists of a set of concurrent dependent tasks, each with some execution frequency; tasks are also fully ordered by priority. Fault tolerance mechanisms include hardware-supported voting on computation results as well as on task starts, task completions, and branch conditions. The distributed agreement mechanism used on system-level decisions adds a variable threading delay to the run time of each copy of a task. These delays make current schedule verification techniques inapplicable. In the most general execution profile, each processor in the system runs a subset of the tasks, with different tasks possibly having different frequencies. In this work, however, we restrict attention to a special class of workloads, termed uni-schedule, in which each processor executes the entire task set, using the multiple processors to implement full redundancy. In addition, all tasks are assumed to have the same periodicity. Given these restrictions, we produce stable schedules consistent with the initial workload specifications. Algorithms are first given for uni-schedule workloads with no run-time branches, and then for uni-schedule workloads with branches.  相似文献   

20.
The availability of low cost, high performance microprocessors has led to various designs of shared memory multiprocessor systems. As a result, commercial products which are based on shared memory have been proliferated. Such a multiprocessor system is heavily influenced by the structure of memory system and it is not difficult to find that most configurations include local cache memories. The more processors a system carries, the larger local cache memory is needed to maintain the traffic to and from the shared memory at reasonable level. The implementation of local cache memories, however, is not a simple task because of environmental limitations. In particular, the general lack of board space availability presents a formidable problem. A cache memory system usually needs space mostly to support its complex control logic circuits for the cache itself and network interfaces like snooping logic circuits for shared bus. Although packaging can be made denser to reduce system size, there are still multiple processors per board. It requires a more area-efficient cache memory architecture. This paper presents a design of shared cache for dual processor board of bus-based symmetric multiprocessors. The design and implementation issues are described first and then the evaluation and measurement results are discussed. The shared cache proposed in this paper has been determined to be quite area-efficient without the significant loss of throughput and scalability. It has been implemented as a plug-in unit for TICOM, a prevalent commercial multiprocessor system.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号