排序方式: 共有58条查询结果,搜索用时 0 毫秒
1.
由于主存系统的性能受到多种因素的影响,现有方法不能快速地得到可靠的分析结果,从而影响芯片质量和上市时机.为解决此问题,提出将带时序的程序员视图(PVT)模拟和排队论相结合的方法--ComPQ.首先从PVT模拟中提取与访存相关的系统级实时参数,然后将主存系统抽象为非抢占优先的M/G/1排队模型,再结合实时参数进行性能分析,得到平均访存延迟结果.由于PVT的建模和模拟代价小,从中得到的实时参数弥补了静态理论分析的不足;同时,排队论也提高了纯PVT模拟的精度.实验结果表明,ComPQ与周期精确级模拟相比平均误差为6. 38%,最后用主存系统设计空间探索的实例验证了ComPQ的有效性. 相似文献
2.
Nowadays energy-efficiency becomes the first design metric in chip development. To pursue higher energy efficiency, the processor architects should reduce or eliminate those unnecessary energy dissipations. Indirect-branch pre- diction has become a performance bottleneck, especially for the applications written in object-oriented languages. Previous hardware-based indirect-branch predictors are generally inefficient, for they either require significant hardware storage or predict indirect-branch targets slowly. In this paper, we propose an energy-efficient indirect-branch prediction technique called TAP (target address pointer) prediction. Its key idea includes two parts: utilizing specific hardware pointers to accelerate the indirect branch prediction flow and reusing the existing processor components to reduce additional hardware costs and power consumption. When fetching an indirect branch, TAP prediction first gets the specific pointers called target address pointers from the conditional branch predictor, and then uses such pointers to generate virtual addresses which index the indirect-branch targets. This technique spends similar time compared to the dedicated storage techniques without requiring additional large amounts of storage. Our evaluation shows that TAP prediction with some representative state-of-the-art branch predictors can improve performance significantly over the baseline processor. Compared with those hardware-based indirect-branch predictors, the TAP-Perceptron scheme achieves performance improvement equivalent to that provided by an 8 K-entry TTC predictor, and also outperforms the VPC predictor. 相似文献
3.
4.
5.
6.
动态翻译系统每执行一次间接转移指令均需进行一次地址转换,该过程是翻译系统性能开销的主要来源之一.无特殊硬件支持的翻译系统常采用软件预测法来降低地址转换开销,而软件预测法的预测准确率较低,制约其对翻译系统整体性能的提升.低开销关联软件预测算法(low-overhead correlated software prediction, LOCSP)可利用代码副本区分待预测指令的不同转移场景,将到达该指令的多条动态执行路径分离为多个互不重合的代码缓存副本,并为各个副本提供独立的预测链.从而在不增加动态指令数的前提下实现关联预测,显著提升软件预测的预测准确率.同时,LOCSP算法基于动态剖析的结果,仅对部分难预测的热点间接转移指令进行关联软件预测,进一步降低预测开销.实验表明,相比软件预测法,LOCSP算法可将平均预测准确率从58.9%提升至82.2%,将翻译系统的整体性能开销平均降低19.3%,最高降低41.9%,而平均静态代码数量仅增加2.4%. 相似文献
7.
8.
面向按序执行处理器开展预执行机制的设计空间探索,并对预执行机制的优化效果随Cache容量和访存延时的变化趋势进行了量化分析.实验结果表明,对于按序执行处理器,保存并复用预执行期间的有效结果和在预执行访存指令之间进行数据传递都能够有效地提升处理器性能,前者还能够有效地降低能耗开销.将两者相结合使用,在平均情况下将基础处理器的性能提升24.07%,而能耗仅增加4.93%.进一步发现,在Cache容量较大的情况下,预执行仍然能够带来较大幅度的性能提升.并且,随着访存延时的增加,预执行在提高按序执行处理器性能和能效性方面的优势都将更加显著. 相似文献
9.
在许多MIMD和SIMD计算机中,采用多级互联网络实现处理机之间的数据通信。各种互联网络的区别在于所用的开关模块和级间连接(ISC)模块以及控制方式不同,这种判别导致计算机所适用的应用的不同。提出一种基于FPID的动态可重构的多级互联网络,将开关模块相同的多种多级互联网络的功能合并,以适应多种应用算法。 相似文献
10.
Active Store Window: Enabling Far Store-Load Forwarding with Scalability and Complexity-Efficiency 下载免费PDF全文
Conventional dynamically scheduled processors often use fully associative structures named load/store queue (LSQ) to implement the value communication between loads and the older in-flight stores and to detect the store-load order violation.But this in-flight forwarding only occupies about 15% of all store-load communications,which makes the CAM-based micro-architecture the major bottleneck to scale store-load communication further.This paper presents a new micro-architecture named ASW (short for active store window).It provides a new structure named speculative active store window to implement more aggressively speculative store-load forwarding than conventional LSQ.This structure could forward the data of committed stores to the executing loads without accessing to L1 data cache,which is referred to as far forwarding in this paper.At the back-end of the pipeline,it uses in-order load re-execution filtered by the tagged SSBF (short for store sequence bloom filter) to verify the correctness of the store-load forwarding.The speculative active store window and tagged store sequence bloom filter are all set-associate structures that are more efficient and scalable than fully associative structures.Experiments show that this simpler and faster design outperforms a conventional load/store queue based design and the NoSQ design on most benchmarks by 10.22% and 8.71% respectively. 相似文献