期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Exploiting locality for scalable information retrieval in peer-to-peer networks

《Information Systems》2005,30(4):277-298

An important problem in unstructured peer-to-peer (P2P) networks is the efficient content-based retrieval of documents shared by other peers. However, existing searching mechanisms are not scaling well because they are either based on the idea of flooding the network with queries or because they require some form of global knowledge.We propose the Intelligent Search Mechanism (ISM) which is an efficient, scalable yet simple mechanism for improving the information retrieval problem in P2P systems. Our mechanism is efficient since it is bounded by the number of neighbors and scalable because no global knowledge is required to be maintained.ISM consists of four components: A Profiling Structure which logs queryhit messages coming from neighbors, a Query Similarity function which calculates the similarity queries to a new query, RelevanceRank which is an online neighbor ranking function and a Search Mechanism which forwards queries to selected neighbors.We deploy and compare ISM with a number of other distributed search techniques over static and dynamic environments. Our experiments are performed with real data over Peerware, our middleware simulation infrastructure which is deployed on 75 workstations. Our results indicate that ISM outperforms its competitors and that in some cases it manages to achieve 100% recall rate while using only half of the network resources required by its competitors. Further, its performance is also superior with respect to the total query response time and our algorithm exhibits a learning behavior as nodes acquire more knowledge. Finally ISM works well in dynamic network topologies and in environments with replicated data sources. 相似文献

2.

Exploiting transition locality in automatic verification of finite-state concurrent systems

Giuseppe?Della Penna Email author Benedetto?Intrigila Igor?Melatti Enrico?Tronci Marisa?Venturini Zilli 《International Journal on Software Tools for Technology Transfer (STTT)》2004,6(4):320-341

In this paper we show that statistical properties of the transition graph of a system to be verified can be exploited to improve memory or time performances of verification algorithms.We show experimentally that protocols exhibit transition locality. That is, with respect to levels of a breadth-first state space exploration, state transitions tend to be between states belonging to close levels of the transition graph. We support our claim by measuring transition locality for the set of protocols included in the Mur verifier distribution .We present a cache-based verification algorithm that exploits transition locality to decrease memory usage and a disk-based verification algorithm that exploits transition locality to decrease disk read accesses, thus reducing the time overhead due to disk usage. Both algorithms have been implemented within the Mur verifier.Our experimental results show that our cache-based algorithm can typically save more than 40% of memory with an average time penalty of about 50% when using (Mur) bit compression and 100% when using bit compression and hash compaction, whereas our disk-based verification algorithm is typically more than ten times faster than a previously proposed disk-based verification algorithm and, even when using 10% of the memory needed to complete verification, it is only between 40 and 530% (300% on average) slower than (RAM) Mur with enough memory to complete the verification task at hand. Using just 300 MB of memory our disk-based Mur was able to complete verification of a protocol with about 10⁹ reachable states. This would require more than 5 GB of memory using standard Mur . 相似文献

3.

Exploiting locality and tolerating remote memory access latency using thread migration

Stephen Jenks Jean-Luc Gaudiot 《International journal of parallel programming》1997,25(4):281-304

Much research has focused on reducing and/or tolerating remote memory access latencies on distributed-memory parallel computers. Caching remote data is intended to reduce average access latency by handling as many remote memory accesses as possible using local copies of the data in the cache. Data-flow and multithreaded approaches help programs tolerate the latency of remote memory accesses by allowing processors to do other work while remote operations take place. The thread migration technique described here is a multithreaded architecture where threads migrate to remote processors that contain data they need. By exploiting access locality, the threads often use several data items from that processor before migrating to other processors for more data. Because the threads migrate in search of data, the approach is called Nomadic Threads. A prototype runtime system has been implemented on the CM5 and is portable to other distributed memory parallel computers. 相似文献

4.

Integrating load balancing and locality in the parallelization of irregular problems

Fabrizio Sarah Paolo Laura 《Future Generation Computer Systems》2001,17(8):969-975

An irregular problem models the evolution of a system where several elements are irregularly distributed in a domain. The evolution modifies this distribution in a way that cannot be foreseen and the behavior of each element depends upon the elements close to it according to a problem dependent relation. Starting from a hierarchical representation of the domain, we define a parallelization methodology that includes a load balancing strategy that preserves this locality property and a strategy to collect information distributed onto the processing nodes. 相似文献

5.

Exploiting the performance gains of modern disk drives by enhancing data locality 总被引：2，自引：0，他引：2

Yuhui Deng 《Information Sciences》2009,179(14):2494-2511

Due to the widening performance gap between RAM and disk drives, a large number of I/O optimization methods have been proposed and designed to alleviate the impact of this gap. One of the most effective approaches of improving disk access performance is enhancing data locality. This is because the method could increase the hit ratio of disk cache and reduce the seek time and rotational latency. Disk drives have experienced dramatic development since the first disk drive was announced in 1956. This paper investigates some important characteristics of modern disk drives. Based on the characteristics and the observation that data access on disk drives is highly skewed, the frequently accessed data blocks and the correlated data blocks are clustered into objects and moved to the outer zones of a modern disk drive. The idea attempts to enhance spatial locality, improve the efficiency of aggressive sequential prefetch, and take advantage of Zoned Bit Recording (ZBR). An experimental simulation is employed to investigate the performance gains generated by the enhanced data locality. The performance gains are analyzed by breaking down the disk access time into seek time, rotational latency, data transfer time, and hit ratio of the disk cache. Experimental results provide useful insights into the performance behaviours of a modern disk drive with enhanced data locality. 相似文献

6.

Rappel: Exploiting interest and network locality to improve fairness in publish-subscribe systems

Jay A. Patel Étienne Rivière Indranil Gupta Anne-Marie Kermarrec 《Computer Networks》2009,53(13):2304-2320

相似文献

7.

Exploiting application locality to design low-complexity, highly performing, and power-aware embedded classifiers 总被引：1，自引：0，他引：1

Alippi C. Scotti F. 《Neural Networks, IEEE Transactions on》2006,17(3):745-754

Temporal and spatial locality of the inputs, i.e., the property allowing a classifier to receive the same samples over time-or samples belonging to a neighborhood-with high probability, can be translated into the design of embedded classifiers. The outcome is a computational complexity and power aware design particularly suitable for implementation. A classifier based on the gated-parallel family has been found particularly suitable for exploiting locality properties: Subclassifiers are generally small, independent each other, and controlled by a master-enabling module granting that only a subclassifier is active at a time, the others being switched off. By exploiting locality properties we obtain classifiers with accuracy comparable with the ones designed without integrating locality but gaining a significant reduction in computational complexity and power consumption. 相似文献

8.

Parallel techniques in irregular codes: cloth simulation as case of study

《Journal of Parallel and Distributed Computing》2005,65(4):424-436

When parallelizing irregular applications on ccNUMA machines several issues should be taken into account in order to achieve high code performance. These factors include locality exploitation and parallelism, as well as careful use of memory resources (memory overhead). An important number of numerical simulation codes are clear examples of irregular applications. Frequently these kinds of codes include reduction operations in their core, so that an important fraction of the computational time is spent on such operations. Specifically, cloth simulation belongs to this class of applications, being a topic of increasing interest in diverse areas, like in the multimedia industry. Moreover, when real time simulation is the aim, its parallelization becomes an important option.This paper discusses and compares different irregular reduction parallelization techniques on ccNUMA share memory machines. Broadly speaking, we may classify them into two groups: privatization-based and data partitioning-based methods. In this paper we describe a framework, based on data affinity, that permits to develop various algorithms inside the group of the data partitioning-based techniques. All these techniques and approaches are analyzed and adapted to the computational structure of a real, physically based, cloth simulator. 相似文献

9.

Numerical ‘health check’ for scientific codes: the CADNA approach

N.S. Scott F. Jézéquel J.-M. Chesneaux 《Computer Physics Communications》2007,176(8):507-521

Scientific computation has unavoidable approximations built into its very fabric. One important source of error that is difficult to detect and control is round-off error propagation which originates from the use of finite precision arithmetic. We propose that there is a need to perform regular numerical ‘health checks’ on scientific codes in order to detect the cancerous effect of round-off error propagation. This is particularly important in scientific codes that are built on legacy software. We advocate the use of the CADNA library as a suitable numerical screening tool. We present a case study to illustrate the practical use of CADNA in scientific codes that are of interest to the Computer Physics Communications readership. In doing so we hope to stimulate a greater awareness of round-off error propagation and present a practical means by which it can be analyzed and managed. 相似文献

10.

Memory optimization in FPGA-accelerated scientific codes based on unstructured meshes

《Journal of Systems Architecture》2014,60(7):579-591

This paper approaches the memory bottleneck problem in FPGA-accelerated codes processing unstructured meshes. A methodology to reduce the required memory bandwidth is presented and evaluated, based on the combined application of data sorting, coding and compression techniques. Sorting allows efficient streaming between the memory and the FPGA, improving data locality and avoiding redundant data requests. Coding achieves a compact representation of the mesh connectivity. Differential compression reduces the size of the mesh data. We propose a hardware implementation with low resource requirements, tailored to accelerators based on reconfigurable devices. The combination of techniques reduces the memory traffic of two computational problems down to an average 34% and 19% of their original sizes, respectively. 相似文献

11.

Translating submachine locality into locality of reference

《Journal of Parallel and Distributed Computing》2006,66(5):633-646

In this work, we show that the submachine locality exposed by hierarchical bulk-synchronous computations can be efficiently turned into locality of reference on arbitrarily deep hierarchies. Specifically, we develop efficient schemes to simulate parallel programs written for the Decomposable BSP (a BSP variant which features a hierarchical decomposition into submachines) on the sequential Hierarchical Memory Model (HMM), which rewards the exploitation of temporal locality, and on its extension with block transfer, the BT model, which also rewards the exploitation of spatial locality. The simulations yield good hierarchy-conscious sequential algorithms from parallel ones, and provide evidence of the strict relation between submachine locality in parallel computation and locality of reference in the hierarchical memory setting. We also devise a generalization of the HMM result to the self-simulation of D-BSP augmented with hierarchical memory modules, which yields an interesting analog of Brent's lemma, thus proving that the enhanced model features a seamless integration of memory and network hierarchies. 相似文献

12.

Optimal locality preserving projection for face recognition

Yu Chen Xiao-hong Xu Jian-huang LaiAuthor vitae 《Neurocomputing》2011,74(18):3941-3945

In face recognition, when the number of images in the training set is much smaller than the number of pixels in each image, Locality Preserving Projections (LPP) often suffers from the singularity problem. To overcome singularity problem, principal component analysis is applied as a preprocessing step. But this procession may discard some important discriminative information. In this paper, a novel algorithm called Optimal Locality Preserving Projections (O-LPP) is proposed. The algorithm transforms the singular eigensystem computation to eigenvalue decomposition problems without losing any discriminative information, which can reduce the computation complexity. And the theoretical analysis related to the algorithm is also obtained. Extensive experiments on face databases demonstrate the proposed algorithm is superior to the traditional LPP algorithm. 相似文献

13.

Predicting locality phases for dynamic memory optimization

Xipeng Shen Yutao Zhong Chen Ding 《Journal of Parallel and Distributed Computing》2007

Dynamic data, cache, and memory adaptation can significantly improve program performance when they are applied on long continuous phases of execution that have dynamic but predictable locality. To support phase-based adaptation, this paper defines the concept of locality phases and describes a four-component analysis technique. Locality-based phase detection uses locality analysis and signal processing techniques to identify phases from the data access trace of a program; frequency-based phase marking inserts code markers that mark phases in all executions of the program; phase hierarchy construction identifies the structure of multiple phases; and phase-sequence prediction predicts the phase sequence from program input parameters. The paper shows the accuracy and the granularity of phase and phase-sequence prediction as well as its uses in dynamic data packing, memory remapping, and cache resizing. 相似文献

14.

Exploiting purity for atomicity

Flanagan C. Freund S.N. Qadeer S. 《IEEE transactions on pattern analysis and machine intelligence》2005,31(4):275-291

Multithreaded programs often exhibit erroneous behavior because of unintended interactions between concurrent threads. This paper focuses on the noninterference property of atomicity. A procedure is atomic if, for every execution, there is an equivalent serial execution in which the actions of the atomic procedure are not interleaved with actions of other threads. This key property makes atomic procedures amenable to sequential reasoning techniques, which significantly facilitates subsequent validation activities such as code inspection and testing. Several existing tools verify atomicity by using commutativity of actions to show that every execution reduces to a corresponding serial execution. However, experiments with these tools have highlighted a number of interesting procedures that, while intuitively atomic, are not reducible. In this paper, we exploit the notion of pure code blocks to verify the atomicity of such irreducible procedures. If a pure block terminates normally, then its evaluation does not change the program state and, hence, these evaluation steps can be removed from the program trace before reduction. We develop a static typed-based analysis for atomicity based on this insight, and we illustrate this analysis on a number of interesting examples that could not be verified using earlier tools based purely on reduction. 相似文献

15.

Quasidynamic layout optimizations for improving data locality

Kadayif I. Kandemir M. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(11):996-1011

Compiler-directed locality optimization techniques are effective in reducing the number of cycles spent in off-chip memory accesses. Recently, methods have been developed that transform memory layouts of data structures at compile-time to improve spatial locality of nested loops beyond current control-centric (loop nest-based) optimizations. Most of these data-centric transformations use a single static (program-wide) memory layout for each array. A disadvantage of these static layout-based locality enhancement strategies is that they might fail to optimize codes that manipulate arrays, which demand different layouts in different parts of the code. We introduce a new approach, which extends current static layout optimization techniques by associating different memory layouts with the same array in different parts of the code. We call this strategy "quasidynamic layout optimization." In this strategy, the compiler determines memory layouts (for different parts of the code) at compile time, but layout conversions occur at runtime. We show that the possibility of dynamically changing memory layouts during the course of execution adds a new dimension to the data locality optimization problem. Our strategy employs a static layout optimizer module as a building block and, by repeatedly invoking it for different parts of the code, it checks whether runtime layout modifications bring additional benefits beyond static optimization. Our experiments indicate significant improvements in execution time over static layout-based locality enhancing techniques. 相似文献

16.

Learning a locality discriminating projection for classification

Jiani Hu Weihong Deng Jun Guo Weiran Xu 《Knowledge》2009,22(8):562-568

This paper introduces a new algorithm called locality discriminating projection (LDP) for subspace learning, which provides a new scheme for discriminant analysis by considering both the manifold structure and the prior class information. In the LDP algorithm, the overlap among the class-specific manifolds is approximated by an invader graph, and a locality discriminant criterion is proposed to find the projections that best preserve the within-class local structures while decrease the between-class overlap. The feasibility of the LDP algorithm has been successfully tested in text data and visual recognition experiments. Experiment results show it is an effective technique for data modeling and classification comparing to linear discriminant analysis, locality preserving projection, and marginal Fisher analysis. 相似文献

17.

Improving cache locality for GPU-based volume rendering

《Parallel Computing》2014,40(5-6):59-69

We present a cache-aware method for accelerating texture-based volume rendering on a graphics processing unit (GPU). Because a GPU has hierarchical architecture in terms of processing and memory units, cache optimization is important to maximize performance for memory-intensive applications. Our method localizes texture memory reference according to the location of the viewpoint and dynamically selects the width and height of thread blocks (TBs) so that each warp, which is a series of 32 threads processed simultaneously, can minimize memory access strides. We also incorporate transposed indexing of threads to perform TB-level cache optimization for specific viewpoints. Furthermore, we maximize TB size to exploit spatial locality with fewer resident TBs. For viewpoints with relatively large strides, we synchronize threads of the same TB at regular intervals to realize synchronous ray propagation. Experimental results indicate that our cache-aware method doubles the worst rendering performance compared to those provided by the CUDA and OpenCL software development kits. 相似文献

18.

Domains of locality 总被引：1，自引：0，他引：1

Aravind K. Joshi 《Data & Knowledge Engineering》2004,50(3):277-289

相似文献

19.

Fusion of loops for parallelism and locality

Manjikian N. Abdelrahman T.S. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(2):193-209

Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which prevent parallelism. In addition, performance losses result from cache conflicts in fused loops. In this paper, we present new techniques to: (1) allow fusion of loop nests in the presence of fusion-preventing dependences, (2) maintain parallelism and allow the parallel execution of fused loops with minimal synchronization, and (3) eliminate cache conflicts in fused loops. We describe algorithms for implementing these techniques in compilers. The techniques are evaluated on a 56-processor KSR2 multiprocessor and on a 18-processor Convex SPP-1000 multiprocessor. The results demonstrate performance improvements for both kernels and complete applications. The results also indicate that careful evaluation of the profitability of fusion is necessary as more processors are used 相似文献

20.

Improving performance of codes with large/irregular stride memory access patterns via high performance reconfigurable computers

Khalid H. Abed Gerald R. Morris 《Journal of Parallel and Distributed Computing》2013

Codes that have large-stride/irregular-stride (L/I) memory access patterns, e.g., sparse matrix and linked list codes, often perform poorly on mainstream clusters because of the general purpose processor (GPP) memory hierarchy. High performance reconfigurable computers (HPRC) contain both GPPs and field programmable gate arrays (FPGAs) connected via a high-speed network. In this research, simple 64-bit floating-point codes are used to illustrate the runtime performance impact of L/I memory accesses in both software-only and FPGA-augmented codes and to assess the benefits of mapping L/I-type codes onto HPRCs. The experiments documented herein reveal that large-stride software-only codes experience severe performance degradation. In contrast, large-stride FPGA-augmented codes experience minimal performance degradation. For experiments with large data sizes, the unit-stride FPGA-augmented code ran about two times slower than software. On the other hand, the large-stride FPGA-augmented code ran faster than software for all the larger data sizes. The largest showed a 17-fold runtime speedup. 相似文献