期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Accelerating Video-Mining Applications Using Many Small, General-Purpose Cores

Li Eric Li Wenlong Tong Xiaofeng Li Jianguo Chen Yurong Wang Tao Wang Patricia P. Hu Wei Du Yangzhou Zhang Yimin Chen Yen-Kuang 《Micro, IEEE》2008,28(5):8-21

Emerging video-mining applications such as image and video retrieval and indexing will require real-time processing capabilities. A many-core architecture with 64 small, in-order, general-purpose cores as the accelerator can help meet the necessary performance goals and requirements. The key video-mining modules can achieve parallel speedups of 19× to 62× from 64 cores and get an extra 2.3× speedup from 128-bit SIMD vectorization on the proposed architecture. 相似文献

2.

Using Asymmetric Single-ISA CMPs to Save Energy on Operating Systems

《Micro, IEEE》2008,28(3):26-41

CPUs consume too much power. Modern complex cores sometimes waste power on functions that are not useful for the code they run. In particular, operating system kernels do not benefit from many power-consuming features intended to improve application performance. We advocate asymmetric single-ISA multicore systems, in which some cores are optimized to run OS code at greatly improved energy efficiency. 相似文献

3.

面向GPU平台的复杂网络core分解方法研究

张珩崔强侯朋朋武延军赵琛《软件学报》2020,31(4):1225-1239

在复杂网络理论中,core分解是一种最基本的度量网络节点"重要性"并分析核心子图的方法.Core分解广泛应用于社交网络的用户行为分析、复杂网络的可视化、大型软件的代码静态分析等应用.随着复杂网络的图数据规模和复杂性的增大,现有研究工作基于多核CPU环境设计core分解并行算法,由于CPU核数和内存带宽的局限性,已经无法满足大数据量的高性能计算需求,严重影响了复杂网络的分析应用.通用GPU提供了1万以上线程数的高并行计算能力和高于100GB/s访存带宽,已被广泛应用于大规模图数据的高效并行分析,如广度优先遍历和最短路径算法等.为了实现更为高效的core分解,提出面向GPU平台下的复杂网络core分解的两种并行策略.第1种RLCore策略基于图遍历思想,利用GPU高并发计算能力对网络图结构自底向上遍历,逐步迭代设置各节点所属的core层;第2种ESCore策略基于局部收敛思想,对各节点从邻居节点当前值进行汇聚计算更新直至收敛.ESCore相比RLCore能够大大降低遍历过程中GPU线程更新同一节点的同步操作开销,而其算法的迭代次数受收敛率的影响.在真实网络图数据上的实验结果表明,所提出的两... 相似文献

4.

Using task migration to improve non-contiguous processor allocation in NoC-based CMPs

《Journal of Systems Architecture》2013,59(7):468-481

In this paper, a processor allocation mechanism for NoC-based chip multiprocessors is presented. Processor allocation is a well-known problem in parallel computer systems and aims to allocate the processing nodes of a multiprocessor to different tasks of an input application at run time. The proposed mechanism targets optimizing the on-chip communication power/latency and relies on two procedures: processor allocation and task migration. Allocation is done by a fast heuristic algorithm to allocate the free processors to the tasks of an incoming application when a new application begins execution. The task-migration algorithm is activated when some application completes execution and frees up the allocated resources. Task migration uses the recently deallocated processors and tries to rearrange the current tasks in order to find a better mapping for them. The proposed method can also capture the dynamic traffic pattern of the network and perform task migration based on the current communication demands of the tasks. Consequently, task migration adapts the task mapping to the current network status. We adopt a non-contiguous processor allocation strategy in which the tasks of the input application are allowed to be mapped onto disjoint regions (groups of processors) of the network. We then use virtual point-to-point circuits, a state-of-the-art fast on-chip connection designed for network-on-chips, to virtually connect the disjoint regions and make the communication latency/power closer to the values offered by contiguous allocation schemes. The experimental results show considerable improvement over existing allocation mechanisms. 相似文献

5.

面向单线程应用的数据预取技术研究

欧国东张民选《计算机研究与发展》2007,44(Z1):140-147

多线程处理器的推广受限于应用,目前大部分应用尤其是桌面应用都是单线程程序,不能充分利用多线程处理器提供的多个现场,并行执行以提高速度.使用空闲现场加速单线程应用是目前研究的一个热点,研究主要集中在提高传统串行应用存储访问的效率和分支预测的精度.在基于线程的数据预取方法TDP中,数据预取线程是从主线程的执行踪迹中提取的,它们使用空闲的现场,和主线程并行执行.由于数据预取线程仅仅包括和预取相关的指令,它们比主线程执行要快,可以在主线程需要数据之前,把数据取到离处理器更近的存储层次.基于线程的数据预取方法能够有效地解决传统数据预取方法难以处理的诸多问题,如不规则内存访问模式.研究控制相关对TDP的影响,具体分析使用错误前瞻的数据预取方法:通过在预取线程中加入分支指令,并用它们控制预取线程的执行过程.通过研究发现,在某些情况下即使控制前瞻已经被证实是错误的,继续执行预取线程可以获得更好的预取效果.模拟结果显示,使用错误前瞻可以获得5%的性能提升. 相似文献

6.

基于中介面加快光线跟踪计算 总被引：1，自引：0，他引：1

黄沛杰王文成杨刚吴恩华《计算机学报》2007,30(2):262-271

提出一种新的光线跟踪方法,以提高光线找到相交面片的效率.它在场景中生成一些面积较大的规整中介面片,然后为中介面上的每个点建立一个场,以记录到达该点的不同方向的光线将相交的面片.由此,光线跟踪时,一条光线可方便地找到相交的中介面,并通过查找中介面上所记录的内容,就能得到它所相交的面片.与已有方法相比,新方法不仅能很好加速主光线与阴影光线的计算,而且能很好地加速反射、折射等二次光线的计算.它能在GPU上方便地实现,并能有效地处理动态场景. 相似文献

7.

众核处理器Cache一致性研究综述

韩立敏安建峰高德远樊晓桠任向隆《计算机应用研究》2012,29(11):4011-4016

以瓦片结构众核处理器一致性协议的设计为主线,综述了国内外近年来关于众核处理器cache一致性的相关研究;介绍了不同NUCA结构对一致性协议的影响;分析和对比了几种传统目录一致性协议的特性及其存在的问题;归纳了最新几个面向众核结构一致性协议的设计思想和特性。最后为设计具备应用程序适应性和可扩展性的cache一致性协议指出了几个关键的设计方向。相似文献

8.

粗等价粒度下基于多种加速策略的增量式求核算法

赵洁张恺航董振宁梁俊杰徐克付《计算机科学》2017,44(1):226-234, 258

提出一种全新的渐增式求核算法。首先基于全局等价类提出粗等价类概念并分析其性质,研究粗等价类下的求核与约简;深入研究3类粗等价类与核属性的内在联系,设计粗等价类下判断核属性的等价方法和渐增式求核方法,通过该方法可在一次增量计算中求得多个非核属性,从而设计双向剪枝策略;可从属性和实体双方面缩减计算域,无需遍历全部属性和实体,在无核情况下,剪枝策略仍然有效。设计多次Hash的属性增量划分算法来完成上述增量式计算,基于此给出完整的渐增式求核算法。最后用UCI中20个决策表及海量、超高维3类数据集从多个角度进行验证,实验结果证明了所提算法的有效性和高效性,其尤其适用于大型决策表,大多数情况下优于现有算法。算法可进一步作为新型约简和优化算法的基础。相似文献

9.

使用序列模式精简基挖掘序列模式 总被引：3，自引：1，他引：3

王涛卢炎生《小型微型计算机系统》2005,26(10):1810-1815

传统的序列模式挖掘方法在挖掘由短的频繁序列模式组成的数据库时有良好的性能．但在挖掘长的序列模式或支持度阈值很低时，这些方法可能遇到固有的困难，因为产生的频繁序列模式的数量经常太大．在许多情况下，用户可能只需要那些覆盖许多短模式的长模式．此外，在很多应用中，只要得到产生的频繁序列模式的近似支持度就已足够，而不需要它们的精确支持度．介绍了能将误差控制在确定范围内的频繁序列模式精简基的概念，并开发了一个挖掘这种序列模式精简基的算法．实验结果显示计算频繁序列模式精简基是很有前途的．相似文献

10.

Replacement techniques for dynamic NUCA cache designs on CMPs

Javier Lira Carlos Molina Ryan N. Rakvic Antonio González 《The Journal of supercomputing》2013,64(2):548-579

The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the offchip memory, because of the significant speed gap between processor and memory. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has eliminated the effectiveness of replacement policies because banks operate independently of each other, and hence their replacement decisions are restricted to a single NUCA bank. In this paper, we propose three different techniques to deal with replacements in NUCA caches. 相似文献

11.

为物流信息化提速开药方

郭成《信息与电脑》2005,(11):38-41

进入21世纪以来,随着异地数据采集、处理与传输技术及计算机通信技术的革命性发展,物流信息可以在全球范围内及时、快速地传递,并能安全可靠地被存储和处理。物流产生信息流,信息流控制物流,信息化已成为现代物流发展的灵魂,没有信息化, 相似文献

12.

多核处理器面向低功耗的共享Cache划分方案 总被引：1，自引：0，他引：1

下载免费PDF全文

熊伟殷建平所光赵志恒《计算机工程与科学》2010,32(10):26-29

随着多核处理器的发展,片上Cache的容量随之增大,其功耗占整个芯片功耗的比率也越来越大。如何减少Cache的功耗,已成为当今Cache设计的一个热点。本文研究了面向低功耗的多核处理器共享Cache的划分技术(LP-CP)。文中提出了Cache划分框架,通过在处理器中加入失效率监控器来动态地收集程序的失效率,然后使用面向低功耗的共享Cache划分算法,计算性能损耗阈值范围内的共享Cache划分策略。我们在一个共享L2 Cache的双核处理器系统中,使用多道程序测试集测试了面向低功耗的Cache划分:在性能损耗阈值为1%和3%的情况中,系统的Cache关闭率分别达到了20.8%和36.9%。相似文献

13.

一种面向多核处理器粗粒度的应用级Cache划分方法

所光《计算机工程与科学》2009,31(Z1)

Cache划分技术是解决共享Cache访问冲突的重要方法,但是已有的Cache划分技术具有开销高、Cache划分时机难以确定的缺点。本文提出了面向应用的Cache划分框架(ACP)。ACP的优点是能够使用程序员提供的应用最外层循环的边界信息,更好地获取应用的失效率信息,因此Cache划分算法具有更高的精度,从而降低了划分的频率,进而提高系统性能。实验结果表明,和传统的固定周期的Cache划分方向相比,ACP具有更好的性能。相似文献

14.

Efficient Sequential Consistency Using Conditional Fences

Changhui Lin Vijay Nagarajan Rajiv Gupta 《International journal of parallel programming》2012,40(1):84-117

Among the various memory consistency models, the sequential consistency (SC) model is the most intuitive and enables programmers to reason about their parallel programs the best. Nevertheless, processor designers often choose to support relaxed memory consistency models because the weaker ordering constraints imposed by such models allow for more instructions to be reordered and enable higher performance. Programs running on machines supporting weaker consistency models can be transformed into ones in which SC is enforced. The compiler does this by computing a minimal set of memory access pairs whose ordering automatically guarantees SC. To ensure that these memory access pairs are not reordered, memory fences are inserted. Unfortunately, insertion of such memory fences can significantly slowdown the program. We observe that the ordering of the minimal set of memory accesses that the compiler strives to enforce, is typically already enforced in the normal course of program execution. A study we conducted on programs with compiler inserted memory fences shows that only 8% of the executed instances of the memory fences are really necessary to ensure SC. Motivated by this study we propose the conditional fence mechanism, known as C-Fence that utilizes compiler information to decide dynamically if there is a need to stall at each fence, only stalling when necessary. Our experiments with SPLASH-2 benchmarks show that, with C-Fences and a centralized active table, programs can be transformed to enforce SC incurring only 12% slowdown, as opposed to 43% slowdown using normal fence instructions. Our approach requires very little hardware support (<350 bytes of on-chip-storage) and it avoids the use of speculation and its associated costs. Furthermore, to ameliorate the contention in the centralized active table arising from the increasing number of processors, we also design a distributed active table, which further improves the performance of C-Fence for a larger number of processors. 相似文献

15.

Accelerating Differential Evolution Using an Adaptive Local Search 总被引：18，自引：0，他引：18

Noman N. Iba H. 《Evolutionary Computation, IEEE Transactions on》2008,12(1):107-125

We propose a crossover-based adaptive local search (LS) operation for enhancing the performance of standard differential evolution (DE) algorithm. Incorporating LS heuristics is often very useful in designing an effective evolutionary algorithm for global optimization. However, determining a single LS length that can serve for a wide range of problems is a critical issue. We present a LS technique to solve this problem by adaptively adjusting the length of the search, using a hill-climbing heuristic. The emphasis of this paper is to demonstrate how this LS scheme can improve the performance of DE. Experimenting with a wide range of benchmark functions, we show that the proposed new version of DE, with the adaptive LS, performs better, or at least comparably, to classic DE algorithm. Performance comparisons with other LS heuristics and with some other well-known evolutionary algorithms from literature are also presented. 相似文献

16.

Characterizing the impact of process variation on 45 nm NoC-based CMPs

C. Hernández^{Author Vitae} A. Roca Author VitaeJ. Flich Author Vitae F. Silla Author VitaeJ. Duato Author Vitae 《Journal of Parallel and Distributed Computing》2011,71(5):651-663

Current integration scales make possible to design chip multiprocessors with a large amount of cores interconnected by a NoC. Unfortunately, they also bring process variation, posing a new burden to processor manufacturers.Regarding the NoC, variability causes that the delays of links and routers do not match those initially established at design time. In this paper we analyze how variability affects the NoC by applying a new variability model to 100 instances of an 8 × 8 mesh NoC synthesized using 45 nm technology. We also show that GALS-based NoCs present communication bottlenecks due to the slower components of the network, which cause congestion, thus reducing performance. This performance reduction finally affects the applications being executed in the CMP because they may be mapped to slower areas of the chip. In this paper we show that using a mapping algorithm that considers variability data may improve application execution time up to 50%. 相似文献

17.

Sofiware Partitioning for Distributed, Sequential, Pipelined Applications

Iyer V.R. Sholl H.A. 《IEEE transactions on pattern analysis and machine intelligence》1989,15(10):1270-1279

相似文献

18.

ASM图在时序电路测试生成中的应用研究

吴海亮熊家军戴光华《微计算机信息》2008,24(18)

随着电路规模的不断扩大,将测试技术向高层次推进,提高测试的效率成为数字系统测试的必然要求.本文研究了基于ASM图的时序电路测试向量生成方法.该方法根据电路的功能描述构造ASM图,然后将其转换为状态图,利用有限状态机的有关知识构造测试向量,最后通过软件仿真和实测验证说明测试向量的正确性.该方法生成的测试向量能体现系统的功能,且具有较高的故障覆盖率. 相似文献

19.

基于Spark的序列数据质量评价

《计算机科学与探索》2017,(6):897-907

随着序列数据在实际中的广泛应用,序列数据质量评价成为学术、工业等众多领域的热门研究问题。目前主流的序列数据质量评价方法是基于概率后缀树模型进行数据质量评价,然而这种方法难以实现对大规模数据的处理。为解决此问题,提出了基于Spark的序列数据质量评价算法STALK(sequential data quality evaluation with Spark),并且采用了改进的剪枝策略来提高算法效率。具体地,在Spark平台下,利用大规模序列数据高效建立生成模型,并根据生成模型对查询序列的数据质量进行快速评价。最后通过真实序列数据集验证了STALK算法的有效性、执行效率和可扩展性。相似文献

20.

基于Spark的Top-k对比序列模式挖掘

张鹏段磊秦攀左劼唐常杰元昌安彭舰《计算机研究与发展》2017,54(7):1452-1464

对比序列模式(distinguishing sequential pattern, DSP)指在目标类序列集合中频繁出现,而在非目标类序列集合中不频繁出现的序列.对比序列模式能够描述2个序列集合间的差异,有着广泛的应用,例如：构建序列分类器,识别DNA序列的生物特征,特定人群行为分析.与挖掘满足支持度阈值要求的对比序列模式相比,挖掘对比度top-k对比序列模式能避免用户设置不恰当的支持度阈值.因而,更易于用户使用.但是现有的top-k对比序列模式挖掘算法难以处理大规模序列数据.对此,设计了一种基于Spark的top-k对比序列模式并行挖掘算法,称为SP-kDSP-Miner.此外,为了提高SP-kDSP-Miner的效率,针对Spark结构的特点,设计了候选模式生成策略和若干剪枝策略,以及候选模式对比度的并行计算方法.通过在真实数据集与合成数据集上的实验,验证了SP-kDSP-Miner的有效性、执行效率和可扩展性. 相似文献