首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper a dataflow architecture is introduced that maps efficiently onto multi-FPGA platforms and is composed of communication channels which can be dynamically adapted to the dataflow of the algorithm. The reconfiguration of the topology can be accomplished within a single clock cycle while DSP operations are in progress. Finally, the programmability and scalability of the proposed architecture is demonstrated by a high-performance parallel FFT implementation.  相似文献   

2.
3.
Performance heterogeneous multicore processors (HMP for brevity) consisting of multiple cores with the same instruction set but different performance characteristics (e.g., clock speed, issue width), are of great concern since they are able to deliver higher performance per watt and area for programs with diverse architectural requirements than comparable homogeneous ones. However, such power and area efficiencies of performance heterogeneous multicore systems can only be achieved when workloads are matched with cores according to both the properties of the workload and the features of the cores.  相似文献   

4.
5.
HPRC (High-Performance Reconfigurable Computing) systems include multicore processors and reconfigurable devices acting as custom coprocessors. Due to economic constraints, the number of reconfigurable devices is usually smaller than the number of processor cores, thus preventing that a 1:1 mapping between cores and coprocessors could be achieved. This paper presents a solution to this problem, based on the virtualization of reconfigurable coprocessors. A Virtual Coprocessor Monitor (VCM) has been devised for the XtremeData XD2000i In-Socket Accelerator, and a thread-safe API is available for user applications to communicate with the VCM. Two reference applications, an IDEA cipher and an Euler CFD solver, have been implemented in order to validate the proposed architecture and execution model. Results show that the benefits arising from coprocessor virtualization outperform its overhead, specially when code has a significant software weight.  相似文献   

6.
<正> 飞思卡尔半导体推出 QorIQ P4080多核处理器——一个旨在为嵌入式多核空间中的性能、功效和编程性设定新标准的非常先进的八核通信处理器。P4080多核处理器是飞思卡尔新 QorIQ 平台的标志性成员,基于45纳米处理技术。它集成了增强的 PowerArchitecture~(TM)内核、三级缓存分层、创新 CoreNet~(TM)片上结构和数据路径加速,可在最大30W 的功率电路内提供  相似文献   

7.
The availability of multicore processors and programmable NICs, such as TOEs (TCP/IP Offloading Engines), provides new opportunities for designing efficient network interfaces to cope with the gap between the improvement rates of link bandwidths and microprocessor performance. This gap poses important challenges related with the high computational requirements associated to the traffic volumes and wider functionality that the network interface has to support. This way, taking into account the rate of link bandwidth improvement and the ever changing and increasing application demands, efficient network interface architectures require scalability and flexibility. An opportunity to reach these goals comes from the exploitation of the parallelism in the communication path by distributing the protocol processing work across processors which are available in the computer, i.e. multicore microprocessors and programmable NICs.Thus, after a brief review of the different solutions that have been previously proposed for speeding up network interfaces, this paper analyzes the onloading and offloading alternatives. Both strategies try to release host CPU cycles by taking advantage of the communication workload execution in other processors present in the node. Nevertheless, whereas onloading uses another general-purpose processor, either included in a chip multiprocessor (CMP) or in a symmetric multiprocessor (SMP), offloading takes advantage of processors in programmable network interface cards (NICs). From our experiments, implemented by using a full-system simulator, we provide a fair and more complete comparison between onloading and offloading. Thus, it is shown that the relative improvement on peak throughput offered by offloading and onloading depends on the rate of application workload to communication overhead, the message sizes, and on the characteristics of the system architecture, more specifically the bandwidth of the buses and the way the NIC is connected to the system processor and memory. In our implementations, offloading provides lower latencies than onloading, although the CPU utilization and interrupts are lower for onloading. Taking into account the conclusions of our experimental results, we propose a hybrid network interface that can take advantage of both, programmable NICs and multicore processors.  相似文献   

8.
Power-efficiency has been a key issue for today’s application and system design, ranging from embedded systems to data centers. While application-specific designs and optimizations may improve the power efficiency, it requires significant efforts to co-design the hardware and software, which are difficult to re-use. On the hardware front, the trend of heterogeneous computing enables custom designs for specific applications by integrating different types of processors and reconfigurable hardware to handle compute-intensive tasks. However, what is still missing is an elegant application framework, i.e., a programming environment and a runtime system, to develop portable applications which can offload tasks or be reconfigured dynamically to run on a variety of systems efficiently.Our ongoing work, MobileFBP, provides an application framework which aims to support heterogeneous and reconfigurable systems. Using the framework, the developers build portable applications with a dataflow programming paradigm, and the MobileFBP runtime system dynamically schedules the task components to run on available computing resources locally or remotely based on the application profiles. We hope that this ability produces high-level portable applications and reduces the efforts and skills needed for the developers to optimize their applications on a range of systems. This paper describes this work and presents our preliminary results.  相似文献   

9.
Systems engineering aims to produce reliable systems which function according to specification. In this paper we follow a systems engineering approach to design a biomedical signal processing system. We discuss requirements capturing, specification definition, implementation and testing of a classification system. These steps are executed as formal as possible. The requirements, which motivate the system design, are based on diabetes research. The main requirement for the classification system is to be a reliable component of a machine which controls diabetes. Reliability is very important, because uncontrolled diabetes may lead to hyperglycaemia (raised blood sugar) and over a period of time may cause serious damage to many of the body systems, especially the nerves and blood vessels. In a second step, these requirements are refined into a formal CSP‖ B model. The formal model expresses the system functionality in a clear and semantically strong way. Subsequently, the proven system model was translated into an implementation. This implementation was tested with use cases and failure cases.Formal modeling and automated model checking gave us deep insight in the system functionality. This insight enabled us to create a reliable and trustworthy implementation. With extensive tests we established trust in the reliability of the implementation.  相似文献   

10.
针对联机分析处理(OLAP)中事实表与多个维表之间的星形连接执行代价较高的问题,提出了一种在先进的多核中央处理器(CPU)和图形处理器(GPU)上的星形连接优化方法.首先,对于多核CPU和GPU平台的星形连接中的物化代价问题,提出了基于向量索引的CPU和GPU平台上的向量化星形连接算法;然后,通过面向CPU cache...  相似文献   

11.
High-level language abstraction for reconfigurable computing   总被引:1,自引:0,他引:1  
  相似文献   

12.
MES是一个常驻工厂层的信息系统,由于该系统分布不集中且本身存在一些漏洞,使得维护人员很难及时发现系统故障并处理.MES系统运行报警及处理平台可以自动对MES系统容易出现的故障进行监视,可以对一些故障进行自动处理,将不能自动处理的故障以短信形式通知维护人员,并提供日报、月报等日志形式便于相关人员统计与分析运行状况.  相似文献   

13.
RISPs (Reconfigurable Instruction Set Processors) are increasingly becoming popular as they can be customized to meet design constraints. However, existing instruction set customization methodologies do not lend well for mapping custom instructions on to commercial FPGA architectures. In this paper, we propose a design exploration framework that provides for rapid identification of a reduced set of profitable custom instructions and their area costs on commercial architectures without the need for time consuming hardware synthesis process. A novel clustering strategy is used to estimate the utilization of the LUT (Look-Up Table) based FPGAs for the chosen custom instructions. Our investigations show that the area costs computations using the proposed hardware estimation technique on 20 custom instructions are shown to be within 8% of those obtained using hardware synthesis. A systematic approach has been adopted to select the most profitable custom instruction candidates. Our investigations show that this leads to notable reduction in the number of custom instructions with only marginal degradation in performance. Simulations based on domain-specific application sets from the MiBench and MediaBench benchmark suites show that on average, more than 25% area utilization efficiency (performance/area) can be achieved with the proposed technique.  相似文献   

14.
Previously we have proposed a Layered Self-Scheduling (LSS) approach that is a hybrid MPI and OpenMP based loop self-scheduling approach for dealing with the heterogeneity problem on a cluster system consisting of multi-core compute nodes, where the allocation functions of several well-known schemes have been modified for better performance. Though LSS provides better performance than the conventional self-scheduling schemes, we found the performance can be improved further after our comprehensive experiments and analyses. The newly proposed task scheduling strategy, called Enhanced Layered Self-Scheduling (ELSS), aims at how to utilize the compute powers of multiple processor cores more efficiently in the master compute node and how to schedule tasks to have more stable performance improvements. We have evaluated the new task scheduling strategy by three benchmark applications: Matrix Multiplication, Monte Carlo Integration, and Mandelbrot Set Computation. It is recommended that the global scheduler adopts Guided Self-Scheduling (GSS) for all, and the local scheduler adopts the static scheme for applications with regular workload distribution but any scheme for applications with irregular workload distribution. Experimental results show the best speedups obtained by ELSS for the three benchmark programs are 1.373, 13.34 and 2.4, respectively, compared with that scheduled by LSS.  相似文献   

15.
This paper is concerned with the design of reconfigurable control laws for fault-tolerant systems. From the control design point of view the issue of performance robustness during the system normal operation versus failure sensitivity at the time of a system component failure is addressed. To achieve a balanced design, a parametrized set of controllers satisfying a prescribed performance level is first constructed, and then the parameter is optimized under a detection criterion that sensitizes the measurements to some specific failures. An example involving aircraft surface impairment is given to explain the procedure for obtaining the optimized design.  相似文献   

16.
The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013)  [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation–communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.  相似文献   

17.
In deeply embedded heterogeneous multicores the allocation of data to memories is crucial for application performance. For applications with stringent throughput constraints, the allocation is often done manually by carefully assigning static memory locations to the logical buffers of the application. Today, designers are confronted with applications with thousands of buffers and architectures with hundreds of memories, rendering manual approaches impractical. In this paper we present an automatic approach for statically allocating logical buffers to physical memories, assuming a fixed task-to-processor mapping and respecting multiple throughput constraints.In our approach, we model the application in a data-centric way, by explicitly defining buffers and associating computational tasks that access the buffers within well-specified time intervals. Besides, we use an architecture model that allows to perform an allocation that is aware of the topology of the multicore and the physical bandwidth constraints of the interconnect. We present a layered approach to describe and solve the buffer-allocation problem as well as related subproblems, using mixed-integer linear programming. We show that the buffer-allocation problem is NP-complete, and present a more scalable formulation as a semi-definite programming problem. We evaluate the proposed LP methods by allocating around 1000 buffers corresponding to processing one frame in the Long-Term Evolution (LTE) standard, onto a multicore with 80 processing elements. We introduce a solution approach that allowed to find an optimal allocation in around 2 hours, which is at least two orders of magnitude faster than a straightforward formulation.  相似文献   

18.
A reconfigurable machining system is usually a modularized system, and its configuration design concerns the selections of modules and the determination of geometric dimensions in some specific modules. All of its design perspectives from kinematics, dynamics, and control have to be taken into considerations simultaneously, and a multidisciplinary design optimization (MDO) tool is required to support the configuration design process. This paper presents a new MDO tool for reconfigurable machining systems, and it includes the following works: (i) the literatures on the computer-aided design of reconfigurable parallel machining systems have been reviewed with a conclusion that the multidisciplinary design optimization is essential, but no comprehensive design tool is available to reconfigurable parallel machining systems; (ii) a class of reconfigurable systems called reconfigurable tripod-based machining system has been introduced, its reconfiguration problem is identified, and the corresponding design criteria have been discussed; (iii) design analysis in all of the disciplines including kinematics, dynamics, and control have been taken into considerations, and design models have been developed to evaluate various design candidates; in particular, the innovative solutions to direct kinematics, stiffness analysis for the design configurations of tripod-based machines with a passive leg, and concise dynamic modelling have been provided; and (iv) A design optimization approach is proposed to determine the best solution from all possible configurations. Based on the works presented in this paper, a computer-aided design and control tool have been implemented to support the system reconfiguration design and control processes. Some issues relevant to the practical implementation have also been discussed.  相似文献   

19.
The next-generation of partially and fully autonomous cars will be powered by embedded many-core platforms. Technologies for Advanced Driver Assistance Systems (ADAS) need to process an unprecedented amount of data within tight power budgets, making those platform the ideal candidate architecture. Integrating tens-to-hundreds of computing elements that run at lower frequencies allows obtaining impressive performance capabilities at a reduced power consumption, that meets the size, weight and power (SWaP) budget of automotive systems. Unfortunately, the inherent architectural complexity of many-core platforms makes it almost impossible to derive real-time guarantees using “traditional” state-of-the-art techniques, ultimately preventing their adoption in real industrial settings. Having impressive average performances with no guaranteed bounds on the response times of the critical computing activities is of little if no use in safety-critical applications. Project Hercules will address this issue, and provide the required technological infrastructure to exploit the tremendous potential of embedded many-cores for the next generation of automotive systems. This work gives an overview of the integrated Hercules software framework, which allows achieving an order-of-magnitude of predictable performance on top of cutting-edge Commercial-Off-The-Shelf components (COTS). The proposed software stack will let both real-time and non real-time application coexist on next-generation, power-efficient embedded platforms, with preserved timing guarantees.  相似文献   

20.
利用RapidIO技术搭建的可重构信号处理平台   总被引:1,自引:0,他引:1  
军事领域常选择ADI公司的TS201芯片用于信号处理平台,但由于其采用基于电路交换的LINK口进行连接,难以实现军方对电子系统设计提出的可重构性的需求。FPGA可以用来实现接口转换功能,如果利用FPGA将基于电路交换的LINK口转换成基于包交换的其他形式的接口,就能在不改变硬件连接的基础上,实现DSP系统的重构。本文介绍了一种基于串行RapidIO技术的可重构的信号处理平台,并对其中核心的FPGA的逻辑设计进行了讨论。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号