期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

《计算机工程》2019,(6)

硬件数据预取技术可以有效提升处理器的访存性能,但传统流预取策略存在预取不及时的问题。为此,提出一种双倍步长流预取策略,并设计对应的预取部件结构。预取部件自动检测数据流的固定步长并将该步长扩大为原有的2倍,以计算预取地址。实验结果表明,加入该预取部件后,运行SPEC2006测试集的整数应用与浮点应用时,处理器性能最高可分别提升45%与57%,针对Cache Miss率较高的应用,该预取部件可以有效隐藏访存延时。相似文献

2.

基于申威GCC编译器的间接预取算法

余龙龙韩林《计算机系统应用》2022,31(8):203-211

对间接存储器的访问延迟往往会影响应用程序的执行性能, 一种有效的解决方案是使用预取技术. 国产申威平台中支持常规访问模式的软件预取和硬件预取机制, 但是其GCC编译器中缺少为间接存储器访问模式自动插入预取的方法. 为了解决这个问题, 基于申威GCC开发了一个完整间接预取优化遍, 它利用深度优先搜索算法查找引用循环归纳变量的间接内存引用并为之生成合适的软件预取. 在一组内存受限的基准测试中, 自动预取遍对SW1621处理器的平均加速比达到1.16倍. 相似文献

3.

预取技术分析 总被引：1，自引：0，他引：1

隋然张铮张为华《电子技术应用》2015,41(6)

内存时延是制约现代处理器性能的主要因素之一.预取技术通过提前从内存读取将来可能使用的数据降低内存时延对处理器执行的影响,是一种被广泛应用的提升处理器性能的技术.探讨了当前主流硬件平台的预取技术,分析了现有预取技术的不足并展望了预取技术的发展趋势. 相似文献

4.

申威1621处理器上矩阵乘法优化研究

下载免费PDF全文

闫昊刘芳芳马文静陈道琨《软件学报》2023,34(7):3451-3463

稠密矩阵乘法(GEMM)是很多科学与工程计算应用中大量使用的函数,也是很多代数函数库中的基础函数,其性能高低对整个应用往往有决定性的影响.另外,因其计算密集的特点,矩阵乘法效率往往也是体现硬件平台性能的重要指标.针对国产申威1621处理器,对稠密矩阵乘法进行了系统性地优化.基于对各部分开销的分析,以及对体系结构特点与指令集的充分利用,对DGEMM函数从循环与分块方案,打包方式,核心计算函数实现,数据预取等方面进行了深入优化.此外,开发了代码生成器,为不同的输入参数生成不同版本的汇编代码和C语言代码,配合自动调优脚本,选取最佳参数.经过优化和调优,单线程DGEMM性能达到了单核浮点峰值性能的85%,16线程DGEMM性能达到16核浮点峰值性能的80%.对DGEMM函数的优化不仅提高了申威1621平台BLAS函数库性能,也为国产申威系列多核处理器上稠密数据计算优化提供了重要参考. 相似文献

5.

面向非一致Cache的任意步长预提升技术 总被引：2，自引：0，他引：2

下载免费PDF全文

吴俊杰杨学军《计算机科学与探索》2010,4(7):577-588

随着微电子工艺的不断进步,片上大容量非一致cache的研究受到广泛关注。提出了一种面向非一致cache的任意步长预提升技术,它能够优化非一致cache中的数据组织,使得即将访问的数据被放置在距离处理器较近的cachebank中,从而降低访存延迟,提升系统性能。详细介绍了任意步长预提升技术的设计,比较了预提升技术与预取技术的差别,并提出了二者的结合技术。通过对来自NPB和SPEC2000的11个基准测试程序在全系统模拟器上的实验评测,发现任意步长预提升技术能够有效减小访存延迟,在访存预测表尺寸为16和32的情况下,系统IPC分别平均增长4.17%和4.91%;在结合预提升和预取技术的情况下,系统IPC分别平均增长8.84%和11.06%。相似文献

6.

Intel64体系结构的数据预取机制及效果

董钰山李春江《计算机科学》2016,43(5):34-41

数据预取是为缓解微处理器与DRAM之间速度差异而出现的隐藏访存延迟的方法。当前Intel各系列处理器都采用多种预取机制来加速数据和代码向Cache的移动,从而提升程序的性能。通过对Intel64体系结构存储层次的分析,剖析了X86/X64体系的数据预取机制,包括硬件预取和软件预取,并且分析了编译器对软件预取机制的支持。最后测试了Intel64体系结构数据预取对科学计算程序中紧嵌套循环性能的影响,总结出了影响数据预取有效性的几个因素。此项工作对在Intel平台上进行循环数组预取优化有指导意义。相似文献

7.

一种基于线程的数据预取方法 总被引：1，自引：0，他引：1

下载免费PDF全文

欧国东张民选《计算机工程与科学》2008,30(1):119-122

多线程、多核处理器的推广受限于应用。目前,大部分应用尤其是桌面应用都是单线程程序,不能充分利用多线程处理器提供的多个现场并行执行来提高速度。使用空闲现场加速单线程应用是目前研究的一个热点,研究主要集中在提高传统串行应用存储访问的效率和分支预测的精度。在基于线程的数据预取方法中,数据预取线程是从主线程的执执行踪迹中提取的。它们使用空闲的现场,和主线程并行执行,在主线程需要数据之前把数据取到离处理器更近的存储层次。基于线程的数据预取方法能够有效地解决传统数据预取方法难以处理的诸多问题,如不规则内存访问模式。本文具体分析了应用程序中访存行为的特点,结合控制流处理,设计并验证了一种基于线程的数据预取方法TDP。模拟结果显示,使用TDP可以获得7％左右的性能提升。相似文献

8.

片上多处理器中基于步长和指针的预取 总被引：1，自引：1，他引：0

下载免费PDF全文

肖俊华冯子军章隆兵《计算机工程》2009,35(4):58-60

在对大量程序访存行为进行分析的基础上,提出基于步长和指针的预取方法。能捕获规整的数据访问模式和指针访问模式。在L2cache和内存之间采用全局历史缓存实现该预取方法。全系统模拟结果表明,该预取方法对商业应用测试程序的性能平均提高14％,对科学计算测试程序的性能平均提高34．5％。相似文献

9.

基于RAID的适度贪婪并行预取技术 总被引：1，自引：0，他引：1

吴志刚冯丹张江陵《计算机工程》2003,29(18):164-165,176

Prefetching(预取)技术是在计算机体系设计中为提高系统性能而通常采用的一项重要技术。在RAID(廉价冗余磁盘阵列）系统中采用有效的预取技术可以缩短主机读请求的平均响应时间，提高磁盘阵列的数据吞吐率。在分析了一些主要应用模型的数据请求特性的基础上，实现了一种适度贪婪的并行预取算法，实验证明该预取技术对主机的连续大量数据读请求是十分有效的。相似文献

10.

动态二进制翻译中数据预取优化研究*

罗琼程吴强《计算机应用研究》2009,26(12):4572-4576

动态优化是动态二进制翻译研究中一个十分重要的课题,数据预取优化能提高现代处理器体系结构应用程序性能。基于超级块(Superblock)的动态数据预取优化采用软件插桩方式收集应用程序的load访存延迟信息并构造Superblock;然后根据延迟信息以及Superblock数据流分析得出的寄存器定值引用关系,对延迟load指令进行预取优化。通过在龙芯DigitalBridge动态二进制翻译系统上实验验证,数据预取优化可以提高翻译后SPEC2000浮点测试程序代码的平均性能3.3%,开销远小于0.5%。相似文献

11.

Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems

Lee Jaejin Jung Changhee Lim Daeseob Solihin Yan 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(9):1309-1324

This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle. 相似文献

12.

Correlation prefetching with a user-level memory thread

Solihin Y. Lee J. Torrellas J. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(6):563-580

This paper proposes using a user-level memory thread (ULMT) for correlation prefetching. In this approach, a user thread runs on a general-purpose processor in main memory, either in the memory controller chip or in a DRAM chip. The thread performs correlation prefetching in software, sending the prefetched data into the L2 cache of the main processor. This approach requires minimal hardware beyond the memory processor: The correlation table is a software data structure that resides in main memory, while the main processor only needs a few modifications to its L2 cache so that it can accept incoming prefetches. In addition, the approach has wide applicability, as it can effectively prefetch even for irregular applications. Finally, it is very flexible, as the prefetching algorithm can be customized by the user on an application basis. Our simulation results show that, through a new design of the correlation table and prefetching algorithm, our scheme delivers good results. Specifically, nine mostly-irregular applications show an average speedup of 1.32. Furthermore, our scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46. Finally, by exploiting the customization of the prefetching algorithm, we increase the average speedup to 1.53. 相似文献

13.

Adaptive prefetching using global history buffer in multicore processors

Mahmood Naderan-Tahan Hamid Sarbazi-Azad 《The Journal of supercomputing》2014,68(3):1302-1320

Data prefetching is a well-known technique to hide the memory latency in the last-level cache (LCC). Among many prefetching methods in recent years, the Global History Buffer (GHB) proves to be efficient in terms of cost and speedup. In this paper, we show that a fixed value for detecting patterns and prefetch degree makes GHB to (1) be conservative while there are more opportunities to create new addresses and (2) generate wrong addresses in the presence of constant strides. To resolve these problems, we separate the pattern length from the prefetching degree. The result is an aggressive prefetcher that can generate more addresses with a given pattern length. Furthermore with a variable pattern length mechanism, constant strides are grouped, such that more accurate patterns are detected. As the aggressiveness of this prefetcher is relatively high, we further propose an efficient throttling procedure to reduce the negative effects of wrong prefetching using a new measure of cache pollution. This adaptive method is suitable for CMP processors where the prefetcher resides in the shared LCC. Simulation results with a mixed suite of integer and floating point benchmarks from SPEC CPU2006 show that on a single-core processor both aggressive and adaptive methods outperform existing prefetchers by 48 and 28 %, respectively, while increasing the memory traffic by 20 and 14 %, respectively. Further on an 8-core CMP with a mix of multiprogrammed workloads, the adaptive method outperforms the state-of-the-art throttling methods by 8 % in speedup, while reducing the memory traffic by 3 %. 相似文献

14.

Design and evaluation of a hierarchical decoupled architecture

Won W. Ro Stephen P. Crago Alvin M. Despain Jean-Luc Gaudiot 《The Journal of supercomputing》2006,38(3):237-259

The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality. To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors. Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2%. 相似文献

15.

Estimating Effective Prefetch Distance in Threaded Prefetching for Linked Data Structures

Yan Huang Zhi-Min Gu Jie Tang Min Cai Jianxun Zhang Ninghan Zheng 《International journal of parallel programming》2012,40(5):465-487

Helper threaded prefetching based on chip multiprocessor has been shown to reduce memory latency and improve overall system performance, and has been explored in linked data structures accesses. In our earlier work, we had proposed an effective threaded prefetching technique that balances delinquent loads between main thread and helper thread to improve effectiveness of prefetching. In this paper, we analyze memory access characteristic of specific application to estimate effective prefetch distance range for our proposed threaded prefetching technique. The effect of hardware prefetchers on the estimation is also exploited. We discuss key design issues of our proposed method and present preliminary experimental results. Our experimental evaluations indicated that the bounded range of effective prefetch distance can be determined using our method, and the optimal prefetch distances can be determined based on the estimated effective prefetch distance range by few trial runs. 相似文献

16.

Performance optimization problem in speculative prefetching

Tuah N.J. Kumar M. Venkatesh S. Das S.K. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(5):471-484

Speculative prefetching has been proposed to improve the response time of network access. Previous studies in speculative prefetching focus on building and evaluating access models for the purpose of access prediction. This paper investigates a complementary area which has been largely ignored, that of performance modeling. We analyze the performance of a prefetcher that has uncertain knowledge about future accesses. Our performance metric is the improvement in access time, for which we derive a formula in terms of resource parameters (time available and time required for prefetching) and speculative parameters (probabilities for next access). We develop a prefetch algorithm to maximize the improvement in access time. The algorithm is based on finding the best solution to a stretch knapsack problem, using theoretically proven apparatus to reduce the search space. An integration between speculative prefetching and caching is also investigated 相似文献

17.

Stream data prefetcher for the GPU memory interface

Nuno Neves Pedro Tomás Nuno Roma 《The Journal of supercomputing》2018,74(6):2314-2328

相似文献