首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 765 毫秒
1.
TMS320C6678多核DSP的核间通信方法   总被引:8,自引:3,他引:5  
嵌入式应用中采用多处理系统所面临的主要难题是多处理器内核之间的通信。对Key-Stone架构TMS320C6678处理器的多核间通信机制进行研究,利用处理器间中断和核间通信寄存器,设计并实现了多核之间的通信。从系统的角度出发,设计与仿真了两种多核通信拓扑结构,并分析对比了性能。对设计多核DSP处理器的核间通信有一定的指导价值。  相似文献   

2.
肖瑞瑾  权衡  张家杰  尤凯迪  英彦  虞志益 《计算机工程》2012,38(15):283-285,289
针对处理器中可用寄存器数量有限的问题,提出一种适用于多核处理器的扩展寄存器文件设计方案。采用多组结构进行硬件设计,将通信端口映射在扩展寄存器地址空间上,以实现寄存器寻址核间通信机制,引入兼具底层指令与高层封装的混合软件配置方案,改进软件编译流程。评估结果表明,该方案将可用寄存器文件的数量增加一倍,核间通信指令数目减少50%,系统吞吐率得到优化。  相似文献   

3.
HR-2(华睿2号)是核高基重大专项中面向雷达应用的一款高性能数字信号处理器。为了给HR-2 DSP核开发提供一款模拟器以进行性能评测和优化指导,并提前进行多核架构的探索,提出一种高效的周期精度软件模拟器建模方法。首先分析该处理器的流水线结构,指令动态执行和分支预测机制,然后使用LISA语言在PD(Processor Designer)工具中对该处理器的流水线、指令集和寄存器重命名等内容进行设计实现,从而开发出HR-2 DSP核的周期精度模拟器模型。实验结果表明,基于该建模方法开发的模拟器周期精度误差在10%以内,可以进行高精度的处理器性能评测和各种模式下的架构探索。  相似文献   

4.
在多核处理器的软件设计中,核间通信机制是关键所在,有效合理的核间通信可以发挥多核处理器的并行处理能力.中断和查询方式是传统的核间通信手段,但存在丢失中断和查询效率低的缺点.为解决这一问题,多核处理器提供了一种全新的硬件信号量机制,用于核间通信.本文以多核DSP芯片TMS320C6678为例,描述了硬件信号量的工作原理和方法以及模块的结构和配置,并给出两个核之间通信的实例.  相似文献   

5.
多态并行处理器的数据通信和路由器的设计   总被引:3,自引:1,他引:2  
随着多核技术的发展,核间通信问题面临新的挑战,核间通信性能决定了整个多核处理器的性能。通过分析多核处理器的数据通信需求,提出了一种适用于多态并行处理器的数据通信结构。该结构采用邻接共享寄存器实现的核间近邻通信和路由器硬件加速结构实现的远程通信两种数据通信方式,远程通信机制的路由器使用输入缓存机制实现,采用经典的确定性路由算法——XY路由算法实现了路由计算,加入多播和容错技术,采用专用的仲裁机制简化了设计复杂度。这些改进降低了处理器的核间通信延迟和功耗,提高了多态并行处理器的性能。  相似文献   

6.
一种面向多核DSP的小容量紧耦合快速共享数据池   总被引:7,自引:0,他引:7  
该文结合片上便笺式存储器(SPM)的结构特点,提出了一种面向异构多核DSP的新型小容量紧耦合共享存储结构——快速共享数据池(FSDP).FSDP在存储层次上与一级Cache平行,可以被访存指令直接访问,采用多体并行的结构、交叉访问模式和基于硬件信号灯的自动同步机制,支持多个DSP核的并行访问与快速的核间数据交换,两核之间交换单个数据只需4拍.该文构建了FSDP的模拟模型,并进行了RTL级设计实现和分析.多种典型测试程序的验证表明,FSDP对于DSP核间细粒度共享数据的传输具有很高的效率,相比同类的VS-SPM结构能够将程序性能提高37%,与传统的共享数据Cache结合使用能够将异构多核DSP的性能提高13%.  相似文献   

7.
异构双核芯片 AMR+DSP具有强大的任务管理、人机交互和数据处理功能,为嵌入式图像处理领域提供了一种新的架构。为了减少开发人员对底层驱动的设计,TI 研发了 Syslink 驱动用于核间通信,包括 Notify、MessageQ 协议等,其中基于 MessageQ的通信协议常用于核间图像传输,但其占用资源较多、延迟高。本文对 TI 公司的达芬奇架构OMAP L138处理器的多核通信理论进行研究,利用核间中断寄存器和共享内存队列存储机制进行数据交互,实现了一种高吞吐量的图像处理平台。  相似文献   

8.
众核处理器片上同步机制和评估方法研究   总被引:1,自引:0,他引:1  
同步机制是片上多核/众核处理器正确执行和协同通信的关键,其效率对处理器的性能非常重要.针对片上众核体系结构,提出并实现了两种粗粒度同步机制和一种细粒度同步机制,即片上专用硬件支持的同步机制、基于原语的片上互斥访问同步机制和基于满空标志位的细粒度同步机制;提出了粗粒度同步机制的评估标准和评估方法,并设计了量化评估程序.以片上同构众核处理器Godson-T模拟器和AMD Opteron商业片上多核处理器为平台,评估比较了提出的硬件支持的同步机制与基于原语的同步机制的性能.结果表明,硬件支持可以使得片上众核处理器的同步机制性能明显提高;在传统基于原语的同步机制中,大部分性能损失是由于负载不平衡和同步点的串行化操作而造成的等待时间.  相似文献   

9.
为了解决面向多任务密码处理的多核核间通信机制的优化实现问题,设计一种混合通信机制。在分析多核处理架构及核间通信特点的基础上,融合了簇内共享存储通信和簇间No C通信机制,同时引入了DMA通信机制,提出构建混合通信机制,进一步提升通信效率。其次,给出核间通信同步机制的优化实现,解决了同步和存储一致性冲突问题。最后,基于Design Complier对设计方案进行了实验评估。实验结果表明,相比其他方案,该方案具有较小的资源代价和较高的性能指标,获得了满意的通信吞吐率。  相似文献   

10.
现有基于异构多核DSP的IEEE 802.11a接收端实现方法中DSP核空闲等待时间较长,不能充分体现多核DSP的高性能计算能力。结合多核DSP的特点,通过核内细粒度流水和核间粗粒度流水的方法,来提高多核DSP的执行效率,并在目标异构多核DSP上实现完整的IEEE 802.11a接收端基带处理。实验结果表明,该方法不仅能满足系统吞吐量和实时性,与类似工作相比还能保证较高的DSP核平均利用率。  相似文献   

11.
As chip multiprocessors with simultaneous multithreaded cores are becoming commonplace, there is a need for simple approaches to exploit thread-level parallelism. In this paper, we consider thread-level speculation as a means to reap thread-level parallelism out of application binaries. We first investigate the tradeoffs between scheduling speculative threads on the same core and on different cores. While threads contend for the same resources using the former approach, the latter approach is plagued by the overhead for inter-core communication. Despite the impact of resource contention, our detailed simulations show that the first approach provides the best performance due to lower inter-thread communication cost. The key contribution of the paper is the proposed design and evaluation of the dual-thread speculation system. This design point has very low complexity and reaps most of the gains of a system. The work was carried out while Fredrik Warg was a graduate student at Chalmers University of Technology.  相似文献   

12.
随着嵌入式设备应用场景日趋复杂的变化,异构多核架构逐渐成为嵌入式处理器的主流架构.目前,多核处理器主要采用的单操作系统模式在实际应用中存在诸多局限性.为了充分发挥异构处理器的多核特性,针对异构处理器不同核部署相应的操作系统并实现多操作系统协同处理技术至关重要.本文对异构多核处理器(ARM+DSP)操作系统进行了研究,在异构多核平台上成功移植了嵌入式Linux和国产DSP实时操作系统ReWorks;为实现ReWorks与Linux操作系统协同处理,本文对核间通信的关键技术进行分析研究,并以TI公司的AM5718为例,设计了一系列多核异构通信组件.经测试,本文设计的异构通信组件实现了在ARM上对DSP核进行ReWorks操作系统和应用程序的动态加载、Linux与ReWorks核间消息收发、以及Linux与ReWorks的协同计算等功能.  相似文献   

13.
In order to exploit the efficient computing power of many integrated cores on heterogeneous cluster, a multi-level and multi-granularity collaborative parallel computing method is proposed for finite element structural mechanical analysis. Computing tasks are divided into three levels: inter-node parallelism, inter-device parallelism and inter-core parallelism. Through mapping decomposablecomput- ing jobs to different hardware layers of heterogeneous MIC system, the proposed method not only effectively resolves the load balancing problem between CPU and MIC devices, but also significantly reduces the communication overheads of the system. Different engineering simulation case experiments for large scale parallel computing were conducted on “Tianhe 2” supercomputer. Up to 39000 CPU+MIC cores were employed and the finite element size of the analysis was more than 100 million units. Test results show that the proposed method can achieve good speedup and parallel computing efficiency in large scale parallel computing of finite element structural analysis. The optimized adaptation of finite element structural analysis and heterogeneous MIC computing platform is realized, which can provide reference for parallel porting and performance optimization of similar applications.  相似文献   

14.
This work analyses the effects of sequential-to-parallel synchronization and inter-core communication on multicore performance, speedup and scaling from Amdahl’s law perspective. Analytical modeling supported by simulation leads to a modification of Amdahl’s law, reflecting lower than originally predicted speedup, due to these effects. In applications with high degree of data sharing, leading to intense inter-core connectivity requirements, the workload should be executed on a smaller number of larger cores. Applications requiring intense sequential-to-parallel synchronization, even highly parallelizable ones, may better be executed by the sequential core. To improve the scalability and performance speedup of a multicore, it is as important to address the synchronization and connectivity intensities of parallel algorithms as their parallelization factor.  相似文献   

15.
New and proposed communication systems are entirely digital, including Voice over Internet Protocol tasks as well as traditional data packets, fax, etc. Numerous digital signal processing (DSP) algorithms are available to encode and decode these signals, each having different requirements for data memory, program memory, and processor speed. A DSP multiprocessor having numerous DSP cores receives a stream of requests for encoding and decoding tasks. A service request is “blocked” if no DSP core can handle the task when it arrives. We present algorithms for assigning DSP tasks to cores in order to minimize the number of blocked tasks. This is similar to an online bin-packing problem with the important difference that the program memory can be shared between simultaneous service requests for the same DSP algorithm. Since bin-packing is known to be NP-complete, we develop fast heuristic online methods for this problem.  相似文献   

16.
An experimental approach is chosen to investigate the performance of a fine-grained dataflow architecture for numerically intensive digital signal processing (DSP) applications. The focus is on the behavior of pipelined data-parallel algorithms. However, the granularity of the high-level language programming blocks is not explicitly optimized to balance computation and communication; a natural and logical fine-grained decomposition of problems is used instead. The authors interpret their empirical data by means of parameters such as a number of instructions per generic unit of computation, a density of precedence relations, and a serial fraction. The performance and limitations of fine-grained general-purpose dataflow computing are discussed  相似文献   

17.
In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors.  相似文献   

18.
本文提出了基于现场保存与恢复的双核冗余软错误恢复执行模型DCR。该执行模型在两个冗余的内核上执行相同的线程,并对store指令进行比较。本文对每个内核增加了硬件实现的现场保存与恢复机制,在检测到软错误以后可以恢复到上一现场保存点继续执行。实验结果表明,与传统的软错误恢复执行模型CRTR相比,DCR执行模型对核间通信带宽的需求降低了57.5%。在发生软错误的情况下,DCR能够恢复99.69%的软错误。  相似文献   

19.
数据流编程语言简化了相关领域的编程,很好地把任务计算和数据通信分开,从而使应用程序分别在任务级和数据级均具有可并行性。针对GPU/CPU混合架构中存在的大量数据并行、任务并行和流水线并行等问题,提出并实现了面向GPU/CPU混合架构的数据流程序任务划分方法和多粒度调度策略,包括任务的分类处理、GPU端任务的水平分裂和CPU端离散任务的均衡化,构造了软件流水调度,经过编译优化生成OpenCL的目标代码。任务的分类处理根据数据流程序各个任务的计算特点和任务间的通信量大小,将各任务分配到合适的计算平台上;GPU端任务的水平分裂利用GPU端任务的并行性将其均衡分裂到各个GPU,以避免GPU间高额的通信开销影响程序整体的执行性能;CPU端离散任务的均衡化通过选择合适CPU核,将CPU端各任务均衡分配给各CPU核,以保证负载均衡并提高各CPU核的利用率。实验以多块NVIDIA Tesla C2050、多核CPU为混合架构平台,选取多媒体领域典型的算法作为测试程序,实验结果表明了划分方法和调度策略的有效性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号