期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

方兴陈书明《计算机工程与科学》2009,31(4)

一些数字信号处理程序存在强数据相关性,在将这些数字信号处理程序划分到多核DSP上时,需要开发细粒度并行性,而细粒度并行性的开发需要快速的核间通信机制支持。本文提出了一种新的面向多核DSP的快速核间通信机制:标记式共享寄存器文件TSRF,TSRF由所有的DSP核共享,寄存器文件中的每个寄存器同一个有效标记位关联,该标记位提供了核间通信同步支持。本文构建了集成TSRF机制的多核DSP原型的周期精确模拟器,该多核DSP原型包含的处理器核数目为4个。通过详细模拟,我们使用数据相关性较强的数字信号处理算法:IIR滤波和ADPCM编解码,对TSRF机制的性能进行了测试,与单核DSP相比,TSDB机制性能提升分别为1.8、1.2和1.9左右。相似文献

2.

TMS320C6678多核DSP的核间通信方法 总被引：8，自引：3，他引：5

吴灏肖吉阳范红旗付强《电子技术应用》2012,38(9):11-13

嵌入式应用中采用多处理系统所面临的主要难题是多处理器内核之间的通信。对Key-Stone架构TMS320C6678处理器的多核间通信机制进行研究,利用处理器间中断和核间通信寄存器,设计并实现了多核之间的通信。从系统的角度出发,设计与仿真了两种多核通信拓扑结构,并分析对比了性能。对设计多核DSP处理器的核间通信有一定的指导价值。相似文献

3.

多态并行处理器的数据通信和路由器的设计 总被引：3，自引：1，他引：2

海虎李涛韩俊刚杨婷《电子技术应用》2014,(8)

随着多核技术的发展,核间通信问题面临新的挑战,核间通信性能决定了整个多核处理器的性能。通过分析多核处理器的数据通信需求,提出了一种适用于多态并行处理器的数据通信结构。该结构采用邻接共享寄存器实现的核间近邻通信和路由器硬件加速结构实现的远程通信两种数据通信方式,远程通信机制的路由器使用输入缓存机制实现,采用经典的确定性路由算法——XY路由算法实现了路由计算,加入多播和容错技术,采用专用的仲裁机制简化了设计复杂度。这些改进降低了处理器的核间通信延迟和功耗,提高了多态并行处理器的性能。相似文献

4.

基于扩展寄存器与片上网络的运算阵列设计

张家杰欧鹏俞政于学球虞志益《计算机工程》2013,39(7)

为提高多核处理器性能,在传统硬件加速部件的基础上,提出一种新型的运算阵列设计方案.将运算阵列与多核处理器的通信端口映射在扩展寄存器地址空间上,实现阵列与多核处理器的紧密耦合.通过片上网络连接各个运算单元,实现运算阵列的灵活配置和高度共享.在实验系统上实现1 024点快速傅里叶变换和H.264解码器,结果表明,与纯软件实现相比,该方案能使处理器性能和功耗都有所改善. 相似文献

5.

通用非对称多核方案设计 总被引：1，自引：0，他引：1

陈彬《计算机系统应用》2021,30(7):277-282

多核处理器是目前处理器发展的主流方向, 但硬实时保证方面存在诸多挑战. 通过研究分析实时应用需求和多核处理器的应用现状, 提出一种基于通用处理器的非对称多核方案. 重点讨论了方案的软件总体设计、共享资源管理、非对称多核模式下的从核镜像加载、启动和核间通信的设计. 采用通用非对称多核方案研制的低压保护测控装置, 其现场运行情况表明方案满足电力二次设备的实时性能要求. 相似文献

6.

一种适应多核处理器核间通信机制的设计

李静梅王军锋张岐《电脑学习》2011,1(4)

随着单芯片上集成处理器内核数量的增加,在支持多核处理器的应用程序方面,核间通信变得更加重要.通过分析多核运行任务特点,根据处理核上运行任务功能的不同,将处理核分成两类:控制核和计算核.根据对核的分类,提出了一种新的核间通信模型,该模型提供了三种不同的通信通道.运用这三条通道,把应用程序的I/O部分从计算核迁移到控制核来提高多核的利用率,实验结果表明该方式有效提高核间协作以及核间通信的效率,提升处理器的利用率. 相似文献

7.

基于剖析信息和关键路径长度的软件扇出树生成算法

曾斌安虹王莉《计算机科学》2010,37(3):248-252

开发利用ILP(Inst ruction-level Parallelism)是现代高性能处理器取得高性能的关键要素之一。宽发射的超标量处理器、超长指令字处理器和数据流处理器只有在并行执行多条相邻的指令时才能获得较高的性能。数据流处理器的一个关键问题是如何把指令的计算结果高效地播送给目标指令而不用读写集中式寄存器文件。对于每条目标数大于指令所能编码的目标数的指令,编译程序都要插入一棵由MOV指令构成的软件扇出树来把计算结果播送给多条目标指令。为了暴露更多的ILP给硬件执行基底,提出了一种改进的软件扇出树生成算法,本算法根据目标指令的执行概率大小以及目标指令到该指令所在块的出口的关键路径长度来计算目标指令的权值,然后对各个叶子的优先权值进行排序,再根据优先权值的顺序来构造一棵软件扇出树,以便把指令的计算结果播送给多条目标指令。实验结果发现,本算法相对于传统的软件扇出树生成算法其性能有较大的提高。相似文献

8.

基于投机执行的两级退休机制

段凌霄孟建熠李晓明《计算机应用研究》2015,32(4)

针对超标量处理器中指令长时间占用重排序缓存引起指令退休缓慢的问题,提出了一种基于投机执行的两级退休机制.该方案根据指令有无异常和预测错误风险将指令分为有风险指令和无风险指令,对重排序缓存进行轻量化改进,只有存在异常和预测风险的指令才允许进重排序缓存,在确认风险消除后将指令快速退休.重命名寄存器从重排序缓存分离,负责寄存器重命名和结果乱序回写.实验结果表明,在硬件资源相同的情况下,基于该方案的处理器比传统的按序退休处理器的性能平均提高28.8％以上. 相似文献

9.

X86平台上Open64软件流水的设计与实现

刘家兵徐云《计算机工程》2013,(9)

由于缺乏相关硬件功能,Open64编译器的软件流水技术没有面向X86处理器的版本。为此,提出一种适用于X86平台的Open64软件流水实现框架。利用软件实现处理器的部分硬件行为,通过循环过滤方法剔除不适用的循环。针对缺乏循环寄存器文件的问题,设计寄存器分配算法达到使用通用寄存器的目的,并添加模变量扩展模块以保证执行的正确性。实验结果表明,与循环展开方案相比,该框架可使系统平均获得9%的性能提升。相似文献

10.

分簇式VLIW密码专用处理器的编译器后端优化研究

吴艾青李伟别梦妮南龙梅陈韬《小型微型计算机系统》2023,(10):2346-2352

密码专用处理器常采用分簇式超长指令字(Very Long Instruction Word, VLIW)架构,其性能的发挥依赖于编译器的实现.当前对于通用VLIW架构的编译后端优化方案,在密码专用处理器上都有一定的不适应性.为此,本文提出了一种面向密码专用处理器的、同时进行簇指派、指令调度和寄存器分配的编译器后端优化方法.构造“定值-引用”链,求解变量的候选寄存器类型集合交集,确定其寄存器类型;实时评估可用资源,进行基于优先级的指令选择和基于平衡寄存器压力的簇指派;改进线性扫描算法,基于变量的“待引用次数”列表进行实时的寄存器分配.实验结果表明,本方法能够提升生成代码的性能,且算法是非启发式的,减小了编译所需的时间. 相似文献

11.

Compiler-assisted power optimization for clustered VLIW architectures

Rahul Nagpal Y.N. Srikant 《Parallel Computing》2011,37(1):42-59

Clustered VLIW architectures solve the scalability problem associated with flat VLIW architectures by partitioning the register file and connecting only a subset of the functional units to a register file. However, inter-cluster communication in clustered architectures leads to increased leakage in functional components and a high number of register accesses. In this paper, we propose compiler scheduling algorithms targeting two previously ignored power-hungry components in clustered VLIW architectures, viz., instruction decoder and register file.We consider a split decoder design and propose a new energy-aware instruction scheduling algorithm that provides 14.5% and 17.3% benefit in the decoder power consumption on an average over a purely hardware based scheme in the context of 2-clustered and 4-clustered VLIW machines. In the case of register files, we propose two new scheduling algorithms that exploit limited register snooping capability to reduce extra register file accesses. The proposed algorithms reduce register file power consumption on an average by 6.85% and 11.90% (10.39% and 17.78%), respectively, along with performance improvement of 4.81% and 5.34% (9.39% and 11.16%) over a traditional greedy algorithm for 2-clustered (4-clustered) VLIW machine. 相似文献

12.

On reducing load/store latencies of cache accesses

Yuan-Shin Hwang Jia-Jhe Li 《Journal of Systems Architecture》2010,56(1):1-15

Effective address calculations for load and store instructions need to compete for ALU with other instructions and hence extra latencies might be incurred to data cache accesses. Fast address generation is an approach proposed to reduce cache access latencies. This paper presents a fast address generator that can eliminate most of the effective address computations by storing computed effective addresses of previous load/store instructions in a dummy register file. Experimental results show that this fast address generator can reduce effective address computations of load and store instructions by about 74% on average for SPECint2000 benchmarks and cut the execution times by 8.5%. Furthermore, when multiple dummy register files are deployed, this fast address generator eliminates over 90% of effective address computations of load and store instructions and improves the average execution times by 9.3%. 相似文献

13.

Register port complexity reduction in wide-issue processors with selective instruction execution

《Microprocessors and Microsystems》2007,31(1):51-62

As the width of the processor grows, complexity of a register file (RF) with multiple ports grows more than linearly and leads to larger register access time and higher power consumption. Analysis of SPEC2000 programs reveals that only a small portion of the instructions in a program (16% in integer and 38% in floating-point) require both the source operands. Also, when the programs are executed in an 8-wide processor only a very few (two or less) two-source instructions are executed in a cycle for a significant portion of time (more than 98% for integer and 93% for floating-point), leading to a significant under-utilization of register port bandwidth. In this paper, we propose a novel technique to significantly reduce the number of register ports, with a very minor modification in the select logic to issue only a limited number of two-source instructions each cycle. This is achieved with no significant impact on processor’s overall performance. The novelty of the technique is that it is easy to implement and succeeds in reducing the access time, power, and area of the register file, without aggravating these factors in any other logic on the chip. With this technique in an 8-wide processor, as compared to a conventional 128-entry RF with 16 read ports, for integer programs a register file can be designed with 11 or 10 read ports as these configurations result in instructions per cycle (IPC) degradation of only 0.929% and 3.38%, respectively. This significantly low degradation in IPC is achieved while reducing the register access time by 9% and 12%, respectively, and reducing power by 35% and 50%, respectively. For FP programs, a register file can be designed with 12 read ports (1.16% IPC loss, 8% less access time, and 28% less power) or with 11 read ports (3.5% IPC loss, 9% less access time, and 35% less power). The paper analyzes the performance of all the possible flavors of the proposed technique for register file in both 4-wide and 8-wide processors, and presents a choice of the performance and register port complexity combination to the designer. 相似文献

14.

Software-directed register deallocation for simultaneousmultithreaded processors

Lo J.L. Parekh S.S. Eggers S.J. Levy H.M. Tullsen D.M. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(9):922-933

This paper proposes and evaluates software techniques that increase register file utilization for simultaneous multithreading (SMT) processors. SMT processors require large register files to hold multiple thread contexts that can issue instructions out of order every cycle. By supporting better interthread sharing and management of physical registers, an SMT processor can reduce the number of registers required and can improve performance for a given register file size. Our techniques specifically target register deal location. While out-of-order processors with register renaming are effective at knowing when a new physical register must be allocated, they have limited knowledge of when physical registers can be deallocated. We propose architectural extensions that permit the compiler and operating system to: 1) free registers immediately upon their last use, and 2) free registers allocated to idle thread contexts. Our results, based on detailed instruction-level simulations of an SMT processor, show that these techniques can increase performance significantly for register-intensive, multithreaded programs 相似文献

15.

The design space of register renaming techniques

Sima D. 《Micro, IEEE》2000,20(5):70-83

Register renaming is a technique to remove false data dependencie-write after read (WAR) and write after write (WAW)-that occur in straight line code between register operands of subsequent instructions. By eliminating related precedence requirements in the execution sequence of the instructions, renaming increases the average number of instructions that are available for parallel execution per cycle. This results in increased IPC (number of instructions executed per cycle). The identification and exploration of the design space of register-renaming lead to a comprehensive understanding of this intricate technique. As this article shows, the design space of register renaming is spanned by four main dimensions: the scope of register renaming, the layout of the rename buffers, the method of register mapping, and the rename rate. Relevant aspects of the design space give rise to eight basic alternatives for register-renaming. In addition, the kind of operand fetch policy significantly affects how the processor carries out the rename process, which duplicates the eight basic alternatives to 16 possible implementation schemes. The article indicates which basic implementation scheme is used in relevant superscalar processors. As register renaming is usually implemented in conjunction with shelving, the underlying microarchitecture is assumed to employ shelving 相似文献

16.

Register spilling via transformed interference equations for PAC DSP architecture

Chung‐Ju Wu Chia‐Han Lu Jenq Kuen Lee 《Concurrency and Computation》2014,26(3):779-799

Digital signal processors (DSPs) with very long instruction word (VLIW) data‐path architectures are increasingly being deployed on embedded devices for multimedia processing applications. To reduce the power consumption and design cost of VLIW DSP processors, distributed register files and multibank register architectures are being adopted to reduce the number of read and write ports associated with register files, which presents new challenges for devising compiler optimization schemes. This paper addresses the issues of reducing the spill code for a VLIW DSP with distributed register files. Spill code produced by register allocation is traditionally handled by memory spills, but the multibank register‐file architecture provides the opportunity to spill‐out register values onto different register banks. We present a conceptual framework based on the universal and the proxy interference graphs to model the live ranges of registers for spilling codes to different register banks. Heuristic algorithms are then developed on the basis of this concept. By heuristically estimating the register pressure for each register file, we treat different register banks as optional spilling locations in addition to traditional spilling to memory. Experiments were performed on the parallel architecture core VLIW DSP with distributed register files by incorporating our proposed optimization schemes into an Open64‐based compiler. The experimental results show that our approach can improve the performances on average for DSPStone and MiBench benchmarks with spilling cases by 7.1% and 21.6%, respectively, compared with the one always handling spill code in memory. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

17.

On the Boosting of Instruction Scheduling by Renaming

Wang L. Yang Ted C. 《The Journal of supercomputing》2001,19(2):173-197

Speculative execution is the execution of instructions before it is known whether these instructions should be executed. In the speculative execution for instruction level parallelism (ILP) processors, the concept of shadow register provides a hardware solution to maintain semantics of a program from the pollution of boosted instructions that are incorrectly predicted. In a recent study, Chang and Lai proposed a special register file based on shadow register, named conjugate register file (CRF), to support multilevel boosting in speculative execution. They also proposed a scheduling heuristic named frequency-driven scheduling to incorporate with CRF for execution. However, the ability of boosting is still constrained since the concept of register pair will force the results produced speculatively be stored in dedicated locations. Moreover, when the parallelism potential increases to tens through the advancement of hardware techniques, the heavy demand on register usage and the complexity of register file may well become a serious bottleneck for the exploitation of ILP.In this paper, the algorithm of frequency-driven scheduling is modified by replacing the function of hardware CRF with the technique of variable renaming during compilation. The new scheduling technique, named LESS, can exploit the parallelism efficiently with limited number of registers. Moreover, since the technique can benefit ILP without any special hardware support, it can be incorporated with any other ILP architecture without changing its instruction set architecture (ISA).Simulation results show that the performance achievable by LESS is better than other existing methods. For example, under the ILP model with an issue rate of 8, the speculative execution can achieve an increase of 34% in parallelism, as compared to 18% in CRF scheme. 相似文献

18.

V-PIM中低功耗分体多端口向量寄存器文件设计

温璞杨学军《计算机工程与应用》2006,42(4):24-26

向量处理逻辑与DRAM相结合形成向量PI(MV-PIM)结构,可充分利用PIM结构的高带宽特性。向量寄存器文件是V-PIM的关键资源,其端口数和容量大小直接影响着向量处理器的频率和功耗。设计一个低功耗、高速、多端口的向量寄存器文件是向量处理器数据通路设计的重要任务之一。文章描述了采用多个端口数较少的寄存器体通过交叉互连构成多端口向量寄存器文件的设计方案,实验表明多体交叉结构的向量寄存器文件在功耗、面积等方面比单一的多端口结构具有明显优势。相似文献

19.

基于Hadoop的访问热点副本迁移技术

冯钧王纯朱康康魏童童《计算机与现代化》2016,(1):108

提出一种云环境下的访问热点负载均衡模型：基于节点的吞吐量与响应时间等主要参考指标,构建节点负载判定模块;文件在HDFS存储的过程中,将文件对应的数据块编号与存储路径相结合,设计存放在数据节点中的数据块到文件目录映射表;提出一种基于节点负载以及节点的存储空间的迁移源节点和目标节点选择方法;基于机架感知的机制,制定一种动态副本迁移方案。最后利用执行器下发指令给相应的数据节点,执行具体的迁移任务以及完善迁移后副本因子等参数信息的调整。通过迅速扩散副本的方式,来增加热点文件的副本数量,使得系统能够对外提供更大的吞吐量,缩短系统反应时间。  相似文献

20.

Achieving spilling-friendly register file assignment for highly distributed register files

Chia-Han Lu Wen-Li Shih Chung-Ju Wu Jenq Kuen Lee 《The Journal of supercomputing》2014,69(3):1342-1362

Distributed register file architectures divide registers into multiple sets, and it follows that the register files could be small. This can increase the frequency of spilling if register allocation encounters high register pressure, which will reduce the performance. That is, there is extra spilling to handle the pressure and results in performance decline. One of the factors that can produce high pressure is improper register file assignment. Register file assignment is a phase that assigns virtual registers to suitable register files and avoids communication costs. To reduce spilling in the phase of register file assignment, this paper proposes the SPIlling-FRiendly (SPIFR) method, which attempts to improve spilling by estimating the spilling cost from two aspects: assignment and spilling. We used MiBench and EEMBC benchmarks in experiments performed with the Open64-based compiler and a cycle-accurate instruction set simulator. The MiBench experimental results show that the SPIFR method improved the average cycle counts of the benchmarks by 6.0 %. For the kernels of the benchmarks, the method improved the average cycle counts by 20.5 % and reduced the average spilling ratio by 19.0 %. The results on the EEMBC benchmarks indicate that the method improved the cycle counts with the average speedup of 7.0 %, the speedup average of the kernel functions was 11.3 %, and the average reduction in the spilling ratio was 11.7 %, respectively. We conclude that the SPIFR method can reduce spilling and increase the performance. 相似文献