首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 375 毫秒
1.
在当前的分布式文件系统(HDFS,Hadoop distributed file system)密钥管理系统中,加密区密钥在启动时全部加载至内存,提供密钥服务。随着密钥资源的增加,占据的内存空间也随之增长,带来内存空间不足和密钥索引瓶颈,如何组织缓存数据、高效处理未命中密钥的查询,如何调整缓存中的密钥资源,如何精准预测密钥的使用这三大要点是解决该瓶颈的关键所在。为了实现细粒度高效缓存,提高密钥使用效率,从密钥索引数据结构、密钥置换算法、密钥预取策略分析3个方面出发,设计了密钥缓存置换的模块架构,计算密钥热度,设置密钥置换算法。具体地,在密钥热度计算与缓存置换方面,从密钥所绑定的文件系统和用户出发,分析影响密钥缓存热度的潜在影响要素,构建密钥使用热度的基本模型,采用哈希表与小顶堆链表组合的方式,维护在用密钥的热度,基于热度识别设置淘汰算法,由时间控制器调整密钥使用,动态更新缓存中的密钥,从而实现基于热度计算的密钥差异化置换。在密钥预取策略分析方面,综合考虑业务流程和用户访问存在时间周期维度的规律,通过日志挖掘获取密钥使用规律,分析密钥预置策略。实验表明,所提密钥置换算法可在降低内存占用的...  相似文献   

2.
Modern Internet routers have to handle a large number of packet classification rules, which requires classification schemes to be scalable both in time and space. In this paper, we present a scalable packet classification algorithm that is developed by combining two new concepts to the well‐known bit vector (BV) scheme. We propose a range search method based on a cache‐aware tree (CATree) which makes full use of processor's cache line to reduce the number of dynamic random access memory (DRAM) accesses. Theoretically, the number of DRAM accesses of CATree is about log(m+1) times lower than that of the widely used binary search algorithm, where m is the number of keys in a single cache line. Based on our computational results on a set of 1024 keys, the CATree algorithm is 36% faster than binary search algorithm and the performance is better when applied to a larger set of keys. In addition, we develop a rule re‐arrangement algorithm to reduce the bitmap space of BV. With this re‐arrangement, the rules for the same action may be assigned an identical priority. This reduces the number of priorities as well as the memory space of the bitmap. Furthermore, this also reduces the number of memory accesses and hence, increases the CPU cache utilization. With CATree and rule re‐arrangement, the cache‐aware bit vector with rule re‐arrangement algorithm achieves better performance in comparison with the regular BV scheme, both in space and time. In our experiments, the proposed algorithm reduces the bitmap memory space of a practical set of firewall rules by two orders of magnitude and is 91% faster than the regular BV.  相似文献   

3.
Bruce W. Watson 《Software》2004,34(3):239-248
New applications of finite automata, such as computational linguistics and asynchronous circuit simulation, can require automata of millions or even billions of states. All known construction methods (in particular, the most effective reachability‐based ones that save memory, such as the subset construction, and simultaneously minimizing constructions, such as Brzozowski's) have intermediate memory usage much larger than the final automaton, thereby restricting the maximum size of the automata which can be built. In this paper, I present a reachability‐based optimization which can be used in most of these construction algorithms to reduce the intermediate memory requirements. The optimization is presented in conjunction with an easily understood (and implemented) canonical automaton construction algorithm. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

4.
针对现有键值数据库存储系统缺乏热点意识,导致系统在高度倾斜的工作负载下性能较差且不可靠,提出了一种自适应热点感知哈希索引模型,该模型基于key值摘要信息实现了一个高性能哈希表。首先,利用key的摘要信息代替key值,压缩key的存储空间,优化哈希表中桶的数据结构;其次,利用CPU的数据级并行技术以及CPU cache line,对哈希表的探查操作进行优化;最后,为解决摘要信息导致key值无法精准比较,需要额外磁盘I/O的问题,设计了一种自适应key值调度算法,该算法根据当前可用内存大小、哈希索引负载以及访问热点情况动态地调整key值的存储位置。在YCSB仿真数据集上进行了实验,实验表明,相较于最先进的哈希表,自适应热点感知哈希索引在相同内存使用率的情况下,将速度提升至1.2倍。  相似文献   

5.
This paper examines the performance and memory-access behavior of the C4.5 decision-tree induction program, a representative example of data mining applications, for both uniprocessor and parallel implementations. The goals of this paper are to characterize C4.5, in particular its memory hierarchy usage, and to decrease the run-time of C4.5 via algorithmic improvement and parallelization. Performance is studied via RSIM, an execution driven simulator, for three uniprocessor models that exploit instruction level parallelism to varying degrees. This paper makes the following four contributions. The first contribution is presenting a complete characterization of the C4.5 decision-tree induction program. The results show that with the exception of the input data set, the working set fits into an 8-Kbyte data cache; the instruction working set also fits into an 8-Kbyte instruction cache. For data sets larger than the L2 cache, performance is limited by accesses to main memory. The results further establish that four-way issue can provide up to a factor of two performance improvement over single-issue for larger L2 caches; for smaller L2 caches, out-of-order dispatch provides a large performance improvement over in-order dispatch. The second contribution made by this paper is examining the effect on the memory hierarchy of changing the layout of the input dataset in memory, showing again that the performance is limited by memory accesses. One proposed data layout decreases the dynamic instruction count by up to 24%, but usually results in lower performance due to worse cache behavior. Another proposed data layout does not improve the dynamic instruction count over the original layout, but has better cache behavior and decreases the run-time by up to a factor of two. Third, this paper presents the first decision-tree induction program parallelized for a ccNUMA architecture. A method for splitting the decision tree hash table is discussed that allows the hash table to be updated and accessed simultaneously without the use of locks. The performance of the parallel version is compared to the original version of C4.5 and a uniprocessor version of C4.5 using the best data layout found. Speedup curves from a six-processor Sun E4000 SMP system show a speedup on the induction step of 3.99, and simulation results show that the performance is mostly unaffected by increasing the remote memory access time until it is over a factor of ten greater than the local memory access time. Last, this paper characterizes the parallelized decision-tree induction program. Compared to the uniprocessor version, the parallel version exerts significantly less pressure on the memory hierarchy, with the exception of having a much larger first level data working set.  相似文献   

6.
Equipped with 512-bit wide SIMD inst d large numbers of computing cores, the emerging x86-based Intel(R) Many Integrated Core (MIC) Architecture ot only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT (three-di fast Fourier transform) is a widely-studied algorithm; however, the conventional algorithm needs to traverse the three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of rided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into ensions, and then combine the transform along each sub-dimension with one of the rest dimensions respectively erence in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. el parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of loc On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectoriz employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel(R) PhiTM coprocessor 7110P, and achieve a maximum performance of 136 Gflops with 240 threads in offload mode, which ts the vendor-specific Intel(R)MKL library by a factor of up to 2.22X.  相似文献   

7.
8.
网内缓存功能是信息中心网络ICN(Information-Centric Networking)最重要的特性之一,大大减小了信息请求的响应时间和网内流量。合理地分配每个路由器的缓存空间大小,对网络性能有较大影响,也可以节约网络成本。为了使路由器的缓存大小配置合理,首先综合考虑路由器的度数权重、紧密度、网络的中心度、请求影响度等度量指标,定义了一个新的度量指标,称为节点权重;然后,提出一种基于节点权重的缓存大小分配方案,将网络所需的容量按比例分配给路由器。仿真结果表明,与均匀分配相比,路由器的缓存空间利用率至少提升了8%,命中率至少提高了6%;与基于请求影响度的分配方案相比,路由器的缓存空间利用率至少提升3%,命中率至少提高了3%。  相似文献   

9.
张鸿骏  武延军  张珩  张立波 《软件学报》2020,31(10):3038-3055
散列表(hash table)作为一类根据关键码值(key value)提供高效数据访问的数据索引结构,其广泛应用于各类计算机应用中,尤其是在对性能要求极高的系统软件、数据库以及高性能计算领域.在网络、云计算和物联网服务方面,以散列表为核心结构已经成为缓存系统的重要系统组件.然而,随着大规模数据量的大幅度增加,以多核CPU为核心设计散列表结构的系统已经逐渐出现性能瓶颈,亟需进一步改进散列表的高性能和可扩展性.随着通用图形处理器(graphic processing unit,简称GPU)的日益普及以及硬件计算能力和并发性能的大幅度提升,各类以并行计算为核心的系统软件任务在GPU上进行了优化设计并得到可观的性能提升.由于存在稀疏性和随机性,采用现有散列表的并行结构直接在GPU上应用势必会带来高频次的内存访问和频繁的总线数据传输,影响了散列表在GPU上的性能发挥.重点分析了缓存系统中散列表索引的内存访问、命中率与索引开销,提出并设计了一种适应GPU的混合访问缓存索引框架CCHT(cache cuckoo hash table),提供了两种适应不同命中率和索引开销要求的缓存策略,允许写入与查询操作并发执行,最大程度地利用了GPU硬件的计算性能与并发特性,减少了内存访问与总线传输.通过在GPU硬件上的实现与实验验证,CCHT在保证缓存命中率的同时,性能优于其他用于缓存索引的散列表.  相似文献   

10.
Main memory cache performance continues to play an important role in determining the overall performance of object-oriented, object-relational and XML databases. An effective method of improving main memory cache performance is to prefetch or pre-load pages in advance to their usage, in anticipation of main memory cache misses. In this paper we describe a framework for creating prefetching algorithms with the novel features of path and cache consciousness. Path consciousness refers to the use of short sequences of object references at key points in the reference trace to identify paths of navigation. Cache consciousness refers to the use of historical page access knowledge to guess which pages are likely to be main memory cache resident most of the time and then assumes these pages do not exist in the context of prefetching. We have conducted a number of experiments comparing our approach against four highly competitive prefetching algorithms. The results shows our approach outperforms existing prefetching techniques in some situations while performing worse in others. We provide guidelines as to when our algorithm should be used and when others maybe more desirable.  相似文献   

11.
In embedded systems caches are very precious for keeping low the memory bandwidth and to allow employing slow and narrow off-chip devices. Conversely, the power and die size resources consumed by the cache force the embedded system designers to use small and simple cache memories. This kind of caches can experience poor performance because of their not flexible placement policy. In this scenario, a big fraction of the misses can originate from the mismatch between the cache behavior and the memory accesses' locality features (conflict misses).In this paper we analyze the conflict miss phenomenon and define a cache utilization measure. Then we propose an object level Cache Aware allocation Technique (CAT) to transform the application to fit the cache structure, minimize the number of conflict misses and maximize cache exploitation. The solution transforms the program layout using the standard functionalities of a linker.The CAT approach allowed the considered applications to deliver the same performance on two times and sometimes four times smaller caches. Moreover the CAT improved programs on direct-mapped caches outperformed the original versions on set-associative caches. In this way, the results highlight that our approach can help embedded system designers to meet the system requirements with smaller and simpler cache memories.  相似文献   

12.
针对联机分析处理(OLAP)中事实表与多个维表之间的星形连接执行代价较高的问题,提出了一种在先进的多核中央处理器(CPU)和图形处理器(GPU)上的星形连接优化方法。首先,对于多核CPU和GPU平台的星形连接中的物化代价问题,提出了基于向量索引的CPU和GPU平台上的向量化星形连接算法;然后,通过面向CPU cache和GPU shared memory大小的向量划分来提出基于向量粒度的星形连接操作,从而优化星形连接中向量索引的物化代价;最后,提出了基于压缩向量的星形连接算法,将定长向量索引压缩为变长的二元向量索引,从而在低选择率时提高cache内向量索引的存储访问效率。实验结果表明,在CPU平台上向量化星形连接算法相对于常规的行式或列式连接性能提升了40%以上,在GPU平台上向量化星形连接算法相对于常规星形连接算法性能提升超过了15%;与当前主流的内存数据库和GPU数据库相比,优化的星形连接算法性能相对于最优内存数据库Hyper性能提升了130%,相对于最优的GPU数据库OmniSci性能提升了80%。可见基于向量索引的向量化星形连接优化技术有效地提高了多表连接性能,与传统优化技术相比,基于向量索引的向量化处理提高了较小cache上的数据存储访问效率,压缩向量进一步提升了向量索引在cache内的访问效率。  相似文献   

13.
Iyer  Ravi 《World Wide Web》2004,7(3):259-280
As Internet usage continues to expand rapidly, careful attention needs to be paid to the design of Internet servers for achieving high performance and end-user satisfaction. Currently, the memory system continues to remain a significant performance bottleneck for Internet servers employing multi-GHz processors. In this paper, our aim is two-fold: (1) to characterize the cache/memory performance of web server workloads and (2) to propose and evaluate cache design alternatives for future web servers. We chose SPECweb99 as the representative web server workload and our entire characterization and evaluation methodology is based on our CASPER simulation framework. We begin by exploring the processor cache design space for single and dual-processor servers. Based on our observations, we then evaluate other cache hierarchy alternatives such as chipset caches, coherence filters and decompressed page stores. We show the sensitivity of these components to basic organization parameters such as cache size, line size and degree of associativity. We also present the performance implications of routing memory requests initiated by I/O devices through these caches. Based on detailed simulation data and its implications on system level performance, this paper shows that chipset caches have significant potential for improving future web server performance.  相似文献   

14.
Nesbit  K.J. Smith  J.E. 《Micro, IEEE》2005,25(1):90-97
Over the past couple of decades, trends in both microarchitecture and underlying semiconductor technology have significantly reduced microprocessor clock periods. These trends have significantly increased relative main-memory latencies as measured in processor clock cycles. To avoid large performance losses caused by long memory access delays, microprocessors rely heavily on a hierarchy of cache memories. But cache memories are not always effective, either because they are not large enough to hold a program's working set, or because memory access patterns don't exhibit behavior that matches a cache memory's demand-driven, line-structured organization. To partially overcome cache memories' limitations, we organize data cache prefetch information in a new way, a GHB (global history buffer) supports existing prefetch algorithms more effectively than conventional prefetch tables. It reduces stale table data, improving accuracy and reducing memory traffic. It contains a more complete picture of cache miss history and is smaller than conventional tables.  相似文献   

15.
Routing table lookup is an important operation in packet forwarding. This operation has a significant influence on the overall performance of the network processors. Routing tables are usually stored in main memory which has a large access time. Consequently, small fast cache memories are used to improve access time. In this paper, we propose a novel routing table compaction scheme to reduce the number of entries in the routing table. The proposed scheme has three versions. This scheme takes advantage of ternary content addressable memory (TCAM) features. Two or more routing entries are compacted into one using don’t care elements in TCAM. A small compacted routing table helps to increase cache hit rate; this in turn provides fast address lookups. We have evaluated this compaction scheme through extensive simulations involving IPv4 and IPv6 routing tables and routing traces. The original routing tables have been compacted over 60% of their original sizes. The average cache hit rate has improved by up to 15% over the original tables. We have also analyzed port errors caused by caching, and developed a new sampling technique to alleviate this problem. The simulations show that sampling is an effective scheme in port error-control without degrading cache performance.  相似文献   

16.
随着现实待挖掘数据库规模不断增长,系统可使用的内存成为用FP-GROWTH算法进行关联规则挖掘的瓶颈.为了摆脱内存的束缚,对大规模数据库中的数据进行关联规则挖掘,基于磁盘的关联规则挖掘成为重要的研究方向.对此,改进原始的FP-TREE数据结构,提出了一种新颖的基于磁盘表的DTRFP-GROWTH(disk table resident FP-TREE growth)算法.该算法利用磁盘表存储FP-TREE,降低内存使用,在传统FP-GROWTH算法占用过多内存、挖掘工作无法进行时,以独特的磁盘表存储FP-TREE技术,减少内存使用,能够继续完成挖掘工作,适合空间性能优先的场合.不仅如此,该算法还将关联规则挖掘和关系型数据库整合,克服了基于文件系统相关算法效率较低、开发难度较大等问题.在真实数据集上进行了验证实验以及性能分析.实验结果表明,在内存空间有限的情况下,DTRFP-GROWTH算法是一种有效的基于磁盘的关联规则挖掘算法.  相似文献   

17.
LU, QR, and Cholesky factorizations are the most widely used methods for solving dense linear systems of equations, and have been extensively studied and implemented on vector and parallel computers. Most of these factorization routines are implemented with block‐partitioned algorithms in order to perform matrix–matrix operations, that is, to obtain the highest performance by maximizing reuse of data in the upper levels of memory, such as cache. Since parallel computers have different performance ratios of computation and communication, the optimal computational block sizes are different from one another in order to generate the maximum performance of an algorithm. Therefore, the ata matrix should be distributed with the machine specific optimal block size before the computation. Too small or large a block size makes achieving good performance on a machine nearly impossible. In such a case, getting a better performance may require a complete redistribution of the data matrix. In this paper, we present parallel LU, QR, and Cholesky factorization routines with an ‘algorithmic blocking’ on two‐dimensional block cyclic data distribution. With the algorithmic blocking, it is possible to obtain the near optimal performance irrespective of the physical block size. The routines are implemented on the Intel Paragon and the SGI/Cray T3E and compared with the corresponding ScaLAPACK factorization routines. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

18.
A prefetch method that enables stride prefetching at the secondary cache without accessing the processor's internal resources is developed and evaluated. It uses a data-range-table that enables it to detect usable strides and memory access streams which fall into the same data range. By using program driven simulation of scientific applications in the context of shared-memory multiprocessors, it is shown that the proposed method can reduce load stall times by an amount comparable to a conventional stride driven prefetching method which requires access to the processor's instruction address register.  相似文献   

19.
基于域的编译框架   总被引:2,自引:2,他引:2  
传统的基于函数范围的后端编译框架是一种方便的程序划分方法,然而,考虑到编译过程中的资源需求(例如编译时间和内存使用),代码性能以及编译功能,函数的范围大小以及结构并不是最适合进行程序分析和优化的程序划分,在现代编译器为了尽可能地发掘指令级并行机会而寻求更复杂和时空复杂性更高的算法和情况下,这种不适应性变得更加突出,当函数的范围很大时,时空复杂性很高的算法以函数为基本编译单位通常会导致编译时间太长和(或)内存消耗太多,Hank提,出了一种编译框架,使得优化的范围和结构可以得到一定的控制,基于编译时间和优化机会的考虑,本文提出了一种新的基于域的编译框架,同时,允许一些基于域的优化制导属性在不同的优化阶段之间被传递和观察,这个基于域的编译框架已经在目标码为安腾(Itanium)处理器的编译器ORC(Open Research Compiler)中实现,实验结果表明,此框架在控制编译的时空复杂性方面是成功的。  相似文献   

20.
To boost the performance of massive data processing, solid-state drives (SSDs) have been used as a kind of cache in the Hadoop system. However, most of existing SSD cache management algorithms are ignorant of the characteristics of upper-level applications. In this paper, we propose a novel SSD cache management algorithm called DSA, which can exploit the application-level data similarity to improve the SSD cache performance in Hadoop. Our algorithm takes both temporal similarity and user similarity in querying behaviors into account. We evaluate the effectiveness of our proposed DSA algorithm in a small-scale Hadoop cluster. Our experimental results show that our algorithm can achieve much better performance than other well-known algorithms (e.g., LRU, FIFO). We also clearly point out the underlying tradeoff between cache performance and SSD deployment cost, and identify a number of key factors that affect SSD cache performance. Our findings can provide useful guidelines on how to effectively integrate SSDs into Hadoop.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号