首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
With the variety of computer architectures available today, it is often difficult to determine which particular type of architecture will provide the best performance on a given application program. In fact, one type of architecture may be well suited to executing one section of a program while another architecture may be better suited to executing another section of the same program. One potentially promising approach for exploiting the best features of different computer architectures is to partition an application program to simultaneously execute on two or more types of machines interconnected with a high-speed communication network. A fundamental difficulty with this heterogeneous computing, however, is the problem of determining how to partition the application program across the interconnected machines. The goal of this paper is to show how a programmer or a compiler can use a model of a heterogeneous system to determine the machine on which each subtask should be executed. This technique is illustrated with a simple model that relates the relative performance of two heterogeneous machines to the communication time required to transfer partial results across their interconnection network. Experiments with a Connection Machine CM-200 demonstrate how to apply this model to partition two different application programs across the sequential front-end processor and the parallel back-end array.  相似文献   

2.
一种分片式多核处理器的用户级模拟器   总被引:1,自引:0,他引:1  
黄琨  马可  曾洪博  张戈  章隆兵 《软件学报》2008,19(4):1069-1080
随着片上晶体管资源的增多和互连线延迟的加大,分片式多核微处理器已成为多核处理器设计的新方向.为了对这种新型处理器进行体系结构的深入研究和设计空间的探索,设计并实现了针对分片式多核处理器的用户级多核性能模拟器.该多核模拟器在龙芯2号单处理器核的基础上,完整地模拟了基于目录的Cache一致性协议和存储转发式片上互联网络的结构模型,详细地刻画了由于系统乱序处理各种请求应答和请求之间的冲突而造成的时序特性,可以通过运行各种串行或并行的工作负载对多核处理器的各种重要性能指标加以评估,为多核处理器的结构设计提供了快速、灵活、高效的研究平台.  相似文献   

3.
矩阵乘法作为高性能计算中的关键组成部分,是一种具有计算和访存密集特点的典型应用,因此优化矩阵乘法的性能对通用处理器是非常重要的.为了提高矩阵乘法的性能,本文提出了一种性能模型,用于预测通用处理器上矩阵乘法的执行时间.该模型反映了矩阵乘法执行时间与通用处理器的运算部件、访存带宽、寄存器个数等结构参数之间的关系,可以指导处理器结构的优化来平衡计算和访存能力、提高执行速度.基于该模型本文给出了在一个优化的通用处理器结构中,寄存器个数和访存带宽应满足的理论下界.本文在Godson-3B处理器平台上对该性能模型进行了验证,实验结果表明矩阵乘法执行时间的预测精确度达到95%以上.基于该模型,本文还提出了一种对Godson-3B结构进行优化的方法,使矩阵乘法的执行时间减少了50%左右.  相似文献   

4.
提出了一个应用于时频分析的短时傅里叶变换处理器.为了克服已有的离散短时傅里叶变换算法和结构的缺点,给出了一种基于快速傅里叶变换阵列的新结构.根据实际需要提出了一种新的高频域分辨率的SDF(Single-path Delay Feedback)结构FFT单元,和传统的SDF结构FFT单元相比,反馈FIFO的深度和蝶形单元的数量都有所降低.再加上开发窗函数的对称性和适当合并硬件资源,与原始设计相比处理器的功耗降低了20%.使用中芯国际0.18微米工艺实现之后,系统工作时钟可以达到200MHz,即该处理器可以满足同样频率的采样信号的实时时频分析需求.  相似文献   

5.
提出集束式整数线性规划形式化模型,利用指令间的功能依赖性解决专用指令集处理器中指令集自动定制的指数性空间问题.在此基础上,针对其前端和后端分别提出了相应的指令定制实现策略.实验结果表明,该指令定制方法可以有效地实现专用指令集的自动设计,并使最终处理器的运算性能得到优化.  相似文献   

6.
许彤  张仕健  吕涛 《计算机工程》2010,36(20):19-21
为提高处理器核仿真模型的效率,提出基于SimpleScalar架构对龙芯1号处理器进行虚拟处理器模型行为建模,IPC平均误差为2.3%,速度达到每秒1 000 000条指令。基于可控随机事件机制实现的总线功能模型可以为片上系统(SoC)设计提供激励主动生成方案和片上互连验证功能。实验结果证明,该方法对处理器IP仿真建模具有普适意义,能够被无缝融入SoC流程中。  相似文献   

7.
8.
Radio frequency identification (RFID) tag delegation enables a centralized back-end server to delegate the right to identify and authenticate a tag to specified readers. This should be used to mitigate the computational load on the server side and also to solve the issues in terms of latency and dependency on network connectivity. In this study, we describe a basic RFID delegation architecture and then under this model, we investigate the security of an RFID delegation protocol: Song Mitchell delegation (SMD), which is recently proposed by Song and Mitchell. We point out security flaws that have gone unnoticed in the design and present two attacks namely, a tag impersonation attack and a desynchronization attack against it. We also discover a subtle flaw by which a delegated entity can still keep its delegation rights after the expire of them—this infringes security policy of the scheme. More precisely, we show that the protocol will be still vulnerable to previously mentioned attacks, even if the back-end server ends the delegation right of a delegated reader and update the secrets of the delegated tags. To counteract such flaws, we improve the SMD protocol with a stateful variant so that it provides the claimed security properties.  相似文献   

9.
10.
11.
三值光学计算机(ternary optical computer, TOC)作为一种新体系结构的计算机,具有处理器位数众多且易扩展、位功能可重构、位可分组使用等特点,在海量或复杂数据的快速处理方面有很大潜力,但它的应用开发探索尚处于初期。为了扩展其应用范围,提出了一种基于三值光学计算机体系架构的模拟器TOCSim设计方案。该方案通过软件形式模拟TOC的运行过程,并在普通PC机上实现其雏形。TOCSim主要模拟TOC处理器的重构策略、处理器位的分配策略、中间结果解码以及运算效果模拟等过程。通过本模拟器的模拟效果图与TOC原型机上的运行结果光图进行对比,表明该模拟器的设计方案是正确的、可行的。  相似文献   

12.
针对数模混合电路仿真精度与性能之间的矛盾问题和仿真工业级复杂数模混合电路时仿真工具存在主流芯片和电路模块不足问题,提出了一种粘合模式的数模混合仿真平台模型架构,基于该架构设计并实现了一种基于Simulink软件,通过嵌入数字电路和模拟电路主流仿真引擎获得充足主流芯片和电路模块支持的数模混合电路仿真平台,设计了一种结合了拓扑排序算法的仿真控制方式,实现了对工业级复杂电路进行流程化、模块化的数模混合仿真;最后通过一个能够时序上可以逻辑拆分的典型数模混合电路仿真验证了仿真平台的有效性。  相似文献   

13.
Although the dataflow model has been shown to allow the exploitation of parallelism at all levels, research of the past decade has revealed several fundamental problems. Synchronization at the instruction level, token matching, coloring, and re-labeling operations have a negative impact on performance by significantly increasing the number of non-compute "overhead" cycles. Recently, many novel hybrid von-Neumann data driven machines have been proposed to alleviate some of these problems. The major objective has been to reduce or eliminate unnecessary synchronization costs through simplified operand matching schemes and increased task granularity. Moreover, the results from recent studies quantifying locality suggest sufficient spatial and temporal locality is present in dataflow execution to merit its exploitation. In this paper we present a data structure for exploiting locality in a data driven environment: the vector cell. A vector cell consists of a number of fixed length chunks of data elements. Each chunk is tagged with a presence bit, providing intra-chunk strictness and inter-chunk non-strictness to data structure access. We describe the semantics of the model, processor architecture and instruction set as well as a Sisal to dataflow vectorizing compiler back-end. The vector cell model is evaluated by comparing its performance to those of both a classical fine-grain dataflow processor employing I-structures and a conventional pipelined vector processor. Results indicate that the model is surprisingly resilient to long memory and communication latencies and is able to dynamically exploit the underlying parallelism across multiple processing elements at run time.  相似文献   

14.
This paper presents a scalable and partitionable asynchronous bus arbiter for use with chip multiprocessors and its corresponding pre-layout simulation results using VHDL. The arbiter exploits the advantage of a concurrency control instruction (Brk) provided by the micro-threaded microprocessor model to set the priority processor and move the circulated arbitration token to the most likely processor to issue the create instruction. This mechanism provides latency hiding during token circulation by decoupling the micro-threaded processor from the ring’s timing. The arbiter provides a very simple arbitration mechanism and can be used for chip multiprocessor arbitration purposes.  相似文献   

15.
The evolution of robust speech recognition systems that maintain a high level of recognition accuracy in difficult and dynamically-varying acoustical environments is becoming increasingly important as speech recognition technology becomes a more integral part of mobile applications. In distributed speech recognition (DSR) architecture the recogniser's front-end is located in the terminal and is connected over a data network to a remote back-end recognition server. The terminal performs the feature parameter extraction, or the front-end of the speech recognition system. These features are transmitted over a data channel to the remote back-end recogniser. DSR provides particular benefits for the applications of mobile devices such as improved recognition performance compared to using the voice channel and ubiquitous access from different networks with a guaranteed level of recognition performance. A feature extraction algorithm integrated into the DSR system is required to operate in real-time as well as with the lowest possible computational costs.In this paper, two innovative front-end processing techniques for noise robust speech recognition are presented and compared, time-domain based frame-attenuation (TD-FrAtt) and frequency-domain based frame-attenuation (FD-FrAtt). These techniques include different forms of frame-attenuation, improvement of spectral subtraction based on minimum statistics, as well as a mel-cepstrum feature extraction procedure. Tests are performed using the Slovenian SpeechDat II fixed telephone database and the Aurora 2 database together with the HTK speech recognition toolkit. The results obtained are especially encouraging for mobile DSR systems with limited sizes of available memory and processing power.  相似文献   

16.
随着计算机应用领域不断拓展,流媒体应用及科学计算正成为微处理器的一种重要负载.流媒体应用的特征是大量的数据并行、少量的数据重用以及每次访存带来的大量计算.因为带宽的限制,传统的微处理器结构很难满足这些特点.X处理器是一款流处理器,针对流应用特点,X处理器采用了新型的三级流式存储层次:局部寄存器文件、流寄存器文件和片外存储器,有效解决了带宽问题.本文在模拟平台采用了两种方法(RS码和测试程序)测试,验证了流存储层次解决带宽瓶颈的有效性,也证明了设计的正确性.  相似文献   

17.
MIPS 处理器是精简指令集(RISC)处理器中的一个重要代表,通常应用于嵌入式系统中.近年来,随着MIPS处理器性能的大幅度提升,其应用渐渐扩展到了高性能服务器领域.龙芯3号处理器是MIPS架构的典型代表.在目前的服务器研究领域中,多核技术是一项重要的技术指标,而虚拟化技术是另一项重要的技术指标.当前,虽然虚拟化技术得到了快速发展,但是龙芯3号处理器上的虚拟化技术却鲜有成果.基于龙芯3号处理器的多核虚拟化技术面临许多问题,虚拟多核架构结构复杂、核间通信方式难以模拟等都会为龙芯3号处理器上的多核虚拟化带来困难.分析了多核龙芯3号处理器的硬件结构以及物理多核的核间中断通信方式,在此基础上介绍了龙芯3号处理器上多核虚拟化关键技术.主要在多核处理器虚拟化总体架构设计、虚拟多核结构设计以及虚拟多核的核间通信方式等方面进行了讨论.实验的结果表明,在龙芯3号处理器上,该多核虚拟化方法具有良好的效果.  相似文献   

18.
同时多线程(SMT)是一种允许多个独立的线程每周期发射多条指令的技术,这种技术充分利用了可能存在的指令级并行和线程级并行,提高了有限资源的利用率。文章以西北工业大学航空微电子中心自主研发的32位超标量处理器“龙腾R2”为基础,引入SMT技术,在基本不改变内部结构大小、不增加执行功能部件、仅做一些必要修改的前提条件下进行研究。通过仿真不同的线程数和各种线程组合,进行性能分析。尽管存在制约性能提升的一些因素,引入SMT技术后依然获得了最高约50%的性能增加。  相似文献   

19.
The developments of multi-core systems (MCS) have considerably improved the existing technologies in the field of computer architecture. The MCS comprises several processors that are heterogeneous for resource capacities, working environments, topologies, and so on. The existing multi-core technology unlocks additional research opportunities for energy minimization by the use of effective task scheduling. At the same time, the task scheduling process is yet to be explored in the multi-core systems. This paper presents a new hybrid genetic algorithm (GA) with a krill herd (KH) based energy-efficient scheduling technique for multi-core systems (GAKH-SMCS). The goal of the GAKH-SMCS technique is to derive scheduling tasks in such a way to achieve faster completion time and minimum energy dissipation. The GAKH-SMCS model involves a multi-objective fitness function using four parameters such as makespan, processor utilization, speedup, and energy consumption to schedule tasks proficiently. The performance of the GAKH-SMCS model has been validated against two datasets namely random dataset and benchmark dataset. The experimental outcome ensured the effectiveness of the GAKH-SMCS model interms of makespan, processor utilization, speedup, and energy consumption. The overall simulation results depicted that the presented GAKH-SMCS model achieves energy efficiency by optimal task scheduling process in MCS.  相似文献   

20.
In this contribution the concept of functional- level power analysis (FLPA) for power estimation of programmable processors is extended in order to model embedded as well as heterogeneous processor architectures featuring different embedded processor cores. The basic FLPA approach is based on the separation of the processor architecture into functional blocks like, e.g. processing unit, clock network, internal memory, etc. The power consumption of these blocks is described by parameterized arithmetic models. By application of a parser based automated analysis of assembler codes the input parameters of the arithmetic functions like e.g. the achieved degree of parallelism or the kind and number of memory accesses can be computed. For modeling an embedded general purpose processor (here, an ARM940T) the basic FLPA modeling concept had to be extended to a so-called hybrid functional-level and instruction-level (FLPA/ILPA) model in order to achieve a good modeling accuracy. In order to show the applicability of this approach even a heterogeneous processor architecture (OMAP5912) featuring an ARM926EJ-S core and a C55x DSP core has been modeled using the hybrid FLPA/ILPA technique described before. The approach is exemplarily demonstrated and evaluated applying a variety of basic digital signal processing tasks ranging from basic filters to complete audio decoders or classical benchmark suits. Estimated power figures for the inspected tasks are compared to physically measured values for both inspected processor architectures. A resulting maximum estimation error of 9% for the ARM940T and less than 4% for the OMAP5912 is achieved.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号