期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

欧阳一鸣杨鑫梁华国易茂祥高妍妍《计算机辅助设计与图形学学报》2015,27(3):533-541

在复用片上网络(No C)架构作为测试访问机制对No C内嵌核进行测试时,如果有多个I/O接口并行测试就容易发生资源的冲突.为了解决多区域并行测试的资源冲突问题,提出一种无链路冲突的测试调度方法.首先使用分区算法将网络划分多个区域,然后使用链路分配算法对节点建立路径树来查找可连通路径,最后全局考虑各个区域的连通路径来分配链路使所有区域都连通,从而避免资源冲突、减少测试时间、保证测试可靠性.实验结果表明,相对于基准对象,该方法可减少测试时间14.13%~52.62%,比已有算法的测试时间最多减少了16.42%,是一种较优的无冲突测试调度方法. 相似文献

2.

IXP2400的网络测试系统的多级并行处理技术 总被引：1，自引：0，他引：1

刘瑞东吴素琴安克魏帅《小型微型计算机系统》2008,29(6):1126-1129

多级并行处理问题一直是计算机及其网络设计、应用的一个重要问题.本文针对IXP2400这一多核可编程芯片的多处理器并行化问题进行应用研究,提出一种兼顾处理能力与开发灵活性的多级并行技术.以"基于网络处理器的网络测试系统"为应用实例,重点分析微引擎并行方案及线程级静态调度算法,并通过WorkBench仿真及七种以太帧平均最大发送速率实测结果对方案、算法进行验证.最后总结并展望了本文提出技术的前景. 相似文献

3.

SoC测试中低成本、低功耗的芯核包装方法 总被引：1，自引：1，他引：0

王伟韩银和胡瑜李晓维张佑生《计算机辅助设计与图形学学报》2006,18(9):1397-1402

提出一种SoC测试中新颖的并行芯核包装方法（parallel core wrapper design,pCWD）,该包装方法利用扫描切片重叠这一特点,通过缩短包装扫描链长度来减少测试功耗和测试时间．为了进一步减少测试时间,还提出了一种测试向量扫描切片划分和赋值算法．实验结果表明,针对ITC2002基准SoC集中d695芯片,应用并行包装方法和测试向量切片划分及赋值算法,能够减少50％的测试时间及95％的测试功耗．相似文献

4.

SoC测试访问机制和测试壳的蚁群联合优化 总被引：2，自引：0，他引：2

崔小乐程伟《计算机辅助设计与图形学学报》2009,21(4)

针对系统级芯片(SoC)测试壳优化和测试访问机制的测试总线划分问题,提出了基于蚁群算法的SoC Wrapper/TAM联合优化方法.构造蚁群算法时首先进行IP核的测试壳优化,用于缩短最长扫描链长度,减少单个IP核的测试时间;在此基础上进行TAM结构的蚁群优化,通过算法迭代逼近测试总线的最优划分,从而缩短SoC测试时间.对ITC2002基准SoC电路进行实验的结果表明,该方法能有效地解决SoC测试优化问题. 相似文献

5.

胖树拓扑中高效实用的定制多播路由算法

陈淑平李祎何王全漆锋滨《计算机研究与发展》2022,(12):2689-2707

在高性能计算领域,多播路由算法对硬件集合操作的性能具有至关重要的影响.随着系统规模的不断扩大,多播组的个数急剧增加,可能会超过硬件支持的多播表条目数,而现有的多播路由算法要么没有给出解决方案,要么存在时间开销大、多播路由经常变化等问题.为此,首先对胖树中的无冲突多播生成树数量进行了量化研究,并以此为基础提出了一种适用于胖树的高效实用的定制多播路由算法(customized multicast routing for limited multicast forwarding table size, C-MR4LMS).C-MR4LMS在构建多播树时,根据多播组的MGID(multicast global identification)静态地将多播组映射到1棵生成树中,从而快速完成多播树的构建;而在合并多播树时,仅需合并使用同一生成树的多播组,且不会改变被合并多播组的路由.然后提出了2种减少多播树冲突的方法：一是分层的MGID分配策略,以避免出现同一终端节点使用同一颜色加入多个多播组的情况;二是相互无干扰的作业节点分配策略,保证2个作业的多播组互不干扰.最后,在ibsim模拟器及神威E级原... 相似文献

6.

基于遗传算法的IP核测试调度优化

邬毅松谈恩民《计算机系统应用》2011,20(8):181-183

测试调度能够很好的减少测试时间和降低测试成本.通过调度,SOC中尽可能多的IP核可以进行并行测试,然而过度的并行测试会引起功耗过高,对SOC产生不利影响.为了改善这个问题,考虑峰值功耗因素的限制,提出一种基于遗传算法的IP核测试调度优化方案,寻求最短测试时间.通过对ISCAS标准电路组成的SOC进行仿真实验,验证了该方... 相似文献

7.

三维芯片多层与多核并行测试调度优化方法

陈田汪加伟安鑫任福继《计算机应用》2018,38(6):1795-1800

针对测试环节在三维（3D）芯片制造过程中成本过高的问题,提出一种基于时分复用（TDM）的协同优化各层之间、层与核之间测试资源的调度方法。首先,在3D芯片各层配置移位寄存器,通过移位寄存器组对输入数据的控制,实现对各层之间以及同一层的各个芯核之间的测试频率的合理划分,使位于不同位置的芯核能够被并行测试;其次,使用贪心算法优化寄存器的分配,减少芯核并行测试的空闲周期;最后,采用离散二进制粒子群优化（DBPSO）算法求出最优3D堆叠的布图,以便充分利用硅通孔（TSV）的传输潜力,提高并行测试效率,减少测试时间。实验结果表明,在功耗约束下,优化后整个测试访问机制（TAM）利用率平均上升16.28%,而3D堆叠的测试时间平均下降13.98%。所提方法减少了测试时间,降低了测试成本。相似文献

8.

基于GSPN和人工免疫算法的并行测试任务调度研究

管晗李文海王怡苹《测控技术》2017,36(12):67-70

针对ATS中并行测试任务调度复杂、难以优化的问题,提出了一种广义随机Petri网和人工免疫算法相结合的任务调度优化算法.首先对并行测试系统建立广义随机Petri网(GSPN)模型,然后将激发的变迁序列集作为并行测试任务调度路径;将免疫克隆选择算法(ICSA)应用到并行测试系统任务调度问题中,并提出一种自适应克隆选择算子,搜索最优任务调度路径,得到以测试时间最短为目标的最优任务调度方案.用某型雷达接收机并行测试系统对该算法进行仿真验证,结果表明,与改进的混合遗传算法(IHGA)相比,该算法能够便捷地得到任务调度最优序列,且测试效率更高. 相似文献

9.

一种多核共享测试数据的BIST方案

方祥圣梁华国沈祝财《微机发展》2006,16(5):214-216

文中提出了一种新颍的SOC芯片BIST方案。该方案是利用相容技术和折叠技术,将SOC芯片中多个芯核的测试数据整体优化压缩和生成,并且能够实现多个芯核的并行测试,具有很高的压缩率,平均压缩率在94%以上;且结构简单、解压方便、硬件开销低,实验证明是一种非常好的SOC芯片的BIST方案。相似文献

10.

非定常Monte Carlo输运问题的并行算法 总被引：1，自引：0，他引：1

刘杰邓力胡庆丰袁国兴李晓梅《计算机学报》2004,27(1):99-106

文中给出了非定常MonteCarlo(下文简写为MC)输运问题的并行算法 ,对并行程序的加载运行模式进行了讨论和优化设计 .针对MC并行计算设计了一种理想情况下无通信的并行随机数发生器算法 .动态MC输运问题有大量的I/O操作 ,特别是读取剩余粒子数据文件需要大量的I/O时间 ,文中针对I/O问题 ,提出了三种并行I/O算法 .最后给出了并行算法的性能测试结果 ,对比串行计算时间 ,使用 6 4台处理机时的并行计算时间缩短了 30倍相似文献

11.

Efficient programming paradigm for video streaming processing on TILE64 platform

Xuan-Yi Lin Kuan-Chou Lai Kuan-Ching Li Yeh-Ching Chung 《The Journal of supercomputing》2013,65(2):823-847

Advances at an unprecedented rate in computer hardware and networking technologies have made the many-core computing affordable and readily available in a matter of few years. Nonetheless, it incurs challenges to programmers to build scalable parallel software. Optimizations of parallel programs for a many-core platform are viewed as a multifaceted problem, where system and architectural factors should be taken into account. In this paper, we tackle this problem by implementing parallel programs with different available programming paradigms and evaluate application behaviors on TILE64 many-core platform. That is, we investigate a hybrid producer-write plus consumer-read shared memory programming paradigm for the implementation of master–worker video decoder and encoder in the referred many-core platform. Experimental results show that the proposed implementation has achieved competitive performance speedup, scaling well with the number of available cores and up to four times of performance improvement over other implementations on the decoding of sample 1080P video. 相似文献

12.

A task migration mechanism for distributed many-core operating systems

Simon Holmbacka Mohammad Fattah Wictor Lund Amir-Mohammad Rahmani Sébastien Lafond Johan Lilius 《The Journal of supercomputing》2014,68(3):1141-1162

Spatial locality of task execution is becoming important in future hardware platforms since the number of cores is steadily increasing. The large amount of cores requires an intelligent power manager and the high chip and core density requires increased thermal awareness to avoid thermal hotspots on the chip. This paper presents a lightweight task migration mechanism explicitly for distributed operating systems running on many-core platforms. As the distributed OS runs one scheduler on each core, the tasks are migrated between OS kernels within the same shared memory platform. The benefits, such as performance and energy efficiency, of task migration are achieved by re-locating running tasks on the most appropriate cores and keeping the overhead of executing such a migration sufficiently low. We investigate the overhead of migrating tasks on a distributed OS running both on a bus-based platform and a many-core NoC—with these means of measures, we can predict the task migration overhead and pinpoint the emerging bottlenecks. With the presented task migration mechanism, we intend to improve the dynamism of power and performance characteristics in distributed many-core operating systems. 相似文献

13.

一种面向众核处理器的嵌套循环多维并行识别方法*

李颖颖庞建民李雁冰翟胜伟《计算机应用研究》2018,35(11)

现有并行识别方法用于众核处理器时存在一定不足,当选择的循环并行维迭代数较少时可能导致严重地负载不均衡。针对这一问题,提出了一种面向众核处理器的多维并行识别方法,在现有并行识别方法无法做到较好的负载均衡时,选择嵌套循环的多个维进行并行,将多个并行维的迭代空间合并后再做任务划分,减少负载不均衡对程序并行效率的影响。此方法已在课题组开发的自动并行化系统中进行了实现,实际应用过程中能够提升一些应用程序在众核处理器上并行执行的效率。相似文献

14.

Towards Efficient Short-Range Pair Interaction on Sunway Many-Core Architecture

下载免费PDF全文

Jun-Shi Chen Hong An Wen-Ting Han Zeng Lin Xin Liu 《计算机科学技术学报》2021,36(1):123-139

The short-range pair interaction consumes most of the CPU time in molecular dynamics(MD)simulations.The inherent computation sparsity makes it challenging to achieve high-performance kernel on the emerging many-core ar-chitecture.In this paper,we present a highly efficient short-range force kernel on the Sunway,a novel many-core architecture with many unique features.The parallel efficiency of this algorithm on the Sunway many-core processor is strongly limited by the poor data locality and write conflicts.To enhance the data locality,we adopt a super cluster based neighbor list with an appropriate granularity that fits in the local memory of computing cores.In the absence of a low overhead locking mechanism,using data-privatization force array is a more feasible method to avoid write conflicts,but results in the large overhead of data reduction.We adopt a dual-slice partitioning scheme for both hardware resources and computing tasks,which utilizes the on-chip data communication to reduce data reduction overhead and provide load balancing.Moreover,we exploit the single instruction multiple data(SIMD)parallelism and perform instruction reordering of the force kernel on this many-core processor.The experimental results show that the optimized force kernel obtains a performance speedup of 226x compared with the reference implementation and achieves 20％of peak flop rate on the Sunway many-core processor. 相似文献

15.

Master–worker model for MapReduce paradigm on the TILE64 many-core platform

《Future Generation Computer Systems》2014

MapReduce is a popular programming paradigm for processing big data. It uses the master–worker model, which is widely used on distributed and loosely coupled systems such as clusters, to solve large problems with task parallelism. With the ubiquity of many-core architectures in recent years and foreseeable future, the many-core platform will be one of the main computing platforms to execute MapReduce programs. Therefore, it is essential to optimize MapReduce programs on many-core platforms. Optimizations of parallel programs for a many-core platform are viewed as a multifaceted problem, where both system and architectural factors should be taken into account. In this paper, we look into the problem by constructing a master–worker model for MapReduce paradigm on the TILE64 many-core platform. We investigate master share and worker share schemes for implementation of a MapReduce library on the TILE64. The theoretical analysis shows that the worker share scheme is inherently better for implementation of MapReduce library on the TILE64 many-core platform. 相似文献

16.

Heterogeneous parallel computing accelerated generalized likelihood uncertainty estimation (GLUE) method for fast hydrological model uncertainty analysis purpose

Kan Guangyuan He Xiaoyan Ding Liuqian Li Jiren Hong Yang Liang Ke 《Engineering with Computers》2020,36(1):75-96

The generalized likelihood uncertainty estimation (GLUE) is a famous and widely used sensitivity and uncertainty analysis method. It provides a new way to solve the “equifinality” problem encountered in the hydrological model parameter estimation. In this research, we focused on the computational efficiency issue of the GLUE method. Inspired by the emerging heterogeneous parallel computing technology, we parallelized the GLUE in algorithmic level and then implemented the parallel GLUE algorithm on a multi-core CPU and many-core GPU hybrid heterogeneous hardware system. The parallel GLUE was implemented using OpenMP and CUDA software ecosystems for multi-core CPU and many-core GPU systems, respectively. Application of the parallel GLUE for the Xinanjiang hydrological model parameter sensitivity analysis proved its much better computational efficiency than the traditional serial computing technology, and the correctness was also verified. The heterogeneous parallel computing accelerated GLUE method has very good application prospects for theoretical analysis and real-world applications.

相似文献

17.

Architecture-based design and optimization of genetic algorithms on multi- and many-core systems

《Future Generation Computer Systems》2014

A Genetic Algorithm (GA) is a heuristic to find exact or approximate solutions to optimization and search problems within an acceptable time. We discuss GAs from an architectural perspective, offering a general analysis of performance of GAs on multi-core CPUs and on many-core GPUs. Based on the widely used Parallel GA (PGA) schemes, we propose the best one for each architecture. More specifically, the Asynchronous Island scheme, Island/Master–Slave Hierarchy PGA and Island/Cellular Hierarchy PGA are the best for multi-core, multi-socket multi-core and many-core architectures, respectively. Optimization approaches and rules based on a deep understanding of multi- and many-core architectures are also analyzed and proposed. Finally, the comparison of GA performance on multi-core and many-core architectures are discussed. Three real GA problems are used as benchmarks to evaluate our analysis and findings.There are three extra contributions compared to previous work. Firstly, our findings based on deeply analyzing architectures can be applied to all GA problems, even for other parallel computing, not for a particular GA problem. Secondly, the performance of GAs in our work not only concerns execution speed, also the solution quality has not been considered seriously enough. Thirdly, we propose the theoretical performance and optimization models of PGA on multi-core and many-core architectures, finding a more practical result of the performance comparison of the GA on these architectures, so that the speedup presented in this work is more reasonable and is a better guide to practical decisions. 相似文献

18.

Parallel algorithms for reducing derivation time of distinguishing experiments for nondeterministic finite state machines

Khaled El-Fakih Gerassimos Barlas Mustafa Ali Nina Yevtushenko 《International Journal of Parallel, Emergent and Distributed Systems》2018,33(2):197-210

Many approaches have been proposed for deriving tests from finite state machine (FSM) specifications with respect to some established coverage criteria. A fundamental core problem in FSM-based testing relates to the derivation of input sequences that can distinguish states of an FSM specification, aka distinguishing sequences. A major effort in the construction of these sequences is based on the derivation of a successors search-tree labeled by sets of pairs of states of the given machine. We aim at reducing the time associated with such constructions through the use of state-of-the-art parallel technologies. Namely, we propose a parallel algorithm that we implement and evaluate on multicore CPUs and on many-core GPUs. We evaluate two alternative GPU implementations that use the CUDA and Thrust software platforms and a network of workstations based solution. The latter sports a workload partitioning based on Divisible Load Theory. A rigorous set of experiments highlights the differences of the proposed implementations in terms of execution time and speedup. 相似文献

19.

基于申威众核处理器的圣维南求解程序的并行与优化

丁哲昭储根深胡长军李扬《计算机工程与科学》2021,43(5):820-829

圣维南方程组可用于描述明渠非恒定流的汇流过程,在大规模水文模拟软件中,求该方程组的数值解是制约程序运行时间的最大瓶颈。通过分析串行程序结构及其计算热点,挖掘计算密集型程序中单步模拟循环计算段和指令排列等的可并行性,针对“神威·太湖之光”超级计算机的异构众核架构设计主从核异步并行方案,基于MPI和athread库对求解程序进行移植、并行和加速,采用SIMD技术将从核计算段向量化,使用双缓冲等策略对通信瓶颈进行优化。测试表明,计算热点函数的性能较优化前平均可提高3倍以上,在百万控制单元规模内,众核级优化后的并行程序加速比可保持近线性增长,在神威多结点上具有很好的可扩展性。相似文献

20.

Analysis and performance results of computing betweenness centrality on IBM Cyclops64

Guangming Tan Vugranam C. Sreedhar Guang R. Gao 《The Journal of supercomputing》2011,56(1):1-24

This paper presents a joint study of application and architecture to improve the performance and scalability of an irregular application—computing betweenness centrality—on a many-core architecture IBM Cyclops64. The characteristics of unstructured parallelism, dynamically non-contiguous memory access, and low arithmetic intensity in betweenness centrality pose an obstacle to an efficient mapping of parallel algorithms on such many-core architectures. By identifying several key architectural features, we propose and evaluate efficient strategies for achieving scalability on a massive multi-threading many-core architecture. We demonstrate several optimization strategies including multi-grain parallelism, just-in-time locality with explicit memory hierarchy and non-preemptive thread execution, and fine-grain data synchronization. Comparing with a conventional parallel algorithm, we get 4X-50X improvement in performance and 16X improvement in scalability on a 128-cores IBM Cyclops64 simulator. 相似文献