首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 515 毫秒
1.
In this paper we propose a scheme for mapping two important artificial neural network (ANN) models on the popular k-ary n-cube parallel architectures (KNCs). The scheme is based on generalizing the mapping of a bipartite graph onto the KNC architecture and thus can be adapted to any model whose computations can be represented by a bipartite task graph. Our approach is the first to adjust the granularity of parallelism so as to achieve the best possible performance based on properties of the computational model and the target architecture. We first introduce a methodology for optimal implementation of multi-layer feedforward artificial neural networks (FFANNs) trained with the backpropagation algorithm on KNCs. We prove that our mapping methodology is time-optimal and that it provides for maximum processor utilization regardless of the structure of the FFANN. We show that the same methodology can be utilized for efficient mapping of Radial Basis Function neural networks (RBFs) on KNCs. This revised version was published online in June 2006 with corrections to the Cover Date.  相似文献   

2.
An embedded system is called a multi-mode embedded system if it performs multiple applications by dynamically reconfiguring the system functionality. Further, the embedded system is called a multi-mode multi-task embedded system if it additionally supports multiple tasks to be executed in a mode. In this paper, we address an important HW/SW partitioning problem, that is, HW/SW partitioning of multi-mode multi-task embedded applications with timing constraints of tasks. The objective of the optimization problem is to find a minimal total system cost of allocation/mapping of processing resources to functional modules in tasks together with a schedule that satisfies the timing constraints. The key success of solving the problem is closely related to the degree of the amount of utilization of the potential parallelism among the executions of modules. However, due to an inherently excessively large search space of the parallelism, and to make the task of schedulability analysis easy, the prior HW/SW partitioning methods have not been able to fully exploit the potential parallel execution of modules. To overcome the limitation, we propose a set of comprehensive HW/SW partitioning techniques which solve the three subproblems of the partitioning problem simultaneously: (1) allocation of processing resources, (2) mapping the processing resources to the modules in tasks, and (3) determining an execution schedule of modules. Specifically, based on a precise measurement on the parallel execution and schedulability of modules, we develop a stepwise refinement partitioning technique for single-mode multi-task applications, which aims to solve the subproblems 1, 2 and 3 effectively in an integrated fashion. The proposed techniques is then extended to solve the HW/SW partitioning problem of multi-mode multi-task applications (i.e., to find a globally optimized allocation/mapping of processing resources with feasible execution schedule of modules). From experiments with a set of real-life applications, it is shown that the proposed techniques are able to reduce the implementation cost by 19.0 and 17.0% for single- and multi-mode multi-task applications over that by the conventional method, respectively.  相似文献   

3.
An Algorithm-Hardware-System Approach to VLIW Multimedia Processors   总被引:2,自引:0,他引:2  
Very Long Instruction Word (VLIW) processor architectures for multimedia applications are discussed from an algorithm, hardware and system based point of view. VLIW processors show high flexibility and processing power, as well as a good utilization of resources by compiler-generated code, but their exclusive exploitation of instruction level parallelism (ILP) decreases in efficiency as the degree of parallelism increases. This is mainly caused by characteristics of multimedia algorithms, increasing wiring delays, compiler restrictions, and a widening gap between on-chip processing speed and available bandwidth to external memory. As new multimedia applications and standards continue to evolve (MPEG-4), the demand for higher processing power will continue. Therefore, parallel processing in all its available forms will have to be exploited to achieve significant performance improvements. We show that, due to the diminishing returns from a further increase in ILP, multimedia applications will benefit more from an additional exploitation of parallelism at thread-level. We examine how simultaneous multithreading (SMT), a novel architectural approach combining VLIW techniques with parallel processing of threads, can efficiently be used to further increase performance of typical multimedia workloads.  相似文献   

4.
可重构计算系统成为加速计算密集型应用的重要选择之一.在众多受到关注的计算密集型问题中,矩阵三角化分解作为典型的基础类应用始终处于研究的核心地位,在求解线性方程组、求矩阵特征值等科学与工程问题中有重要的研究价值.本文面向矩阵三角化分解中共有的三角化计算过程,通过分析该过程的线性计算规律,提出一种适于硬件并行实现的子矩阵更新同一化算法及矩阵三角化计算FPGA (Field Programmable Gate Array)并行结构.针对LU矩阵三角化分解在并行结构模板上的高性能实现及优化方法开展了研究.理论分析表明,该算法针对矩阵三角化计算过程具有更高的数据并行性与流水并行性;实验结果表明,与通用处理器的软件实现相比,根据该算法实现的矩阵三角化分解FPGA并行结果在关键计算性能上可以取得10倍以上的加速比.  相似文献   

5.
MapReduce has emerged as a popular computing model used in datacenters to process large amount of datasets.In the map phase,hash partitioning is employed to distribute data that sharing the same key across data center-scale cluster nodes.However,we observe that this approach can lead to uneven data distribution,which can result in skewed loads among reduce tasks,thus hamper performance of MapReduce systems.Moreover,worker nodes in MapReduce systems may differ in computing capability due to(1) multiple generations of hardware in non-virtualized data centers,or(2) co-location of virtual machines in virtualized data centers.The heterogeneity among cluster nodes exacerbates the negative effects of uneven data distribution.To improve MapReduce performance in heterogeneous clusters,we propose a novel load balancing approach in the reduce phase.This approach consists of two components:(1) performance prediction for reducers that run on heterogeneous nodes based on support vector machines models,and(2) heterogeneity-aware partitioning(HAP),which balances skewed data for reduce tasks.We implement this approach as a plug-in in current MapReduce system.Experimental results demonstrate that our proposed approach distributes work evenly among reduce tasks,and improves MapReduce performance with little overhead.  相似文献   

6.
In the embedded computer system domain, MPSoC systems have become increasingly popular due to the ever-increasing performance demands of modern embedded applications. The number of processing elements in these MPSoCs also steadily increases. Whereas current MPSoCs still contain a limited number of processing elements, future MPSoCs will feature tens up to hundreds of (heterogeneous) processing elements that are all integrated on a single chip. On these future large-scale MPSoC systems, the mapping of applications onto the hardware resources plays an important role to fully explore the parallelism of applications. In this article, a hierarchical run-time adaptive resource allocation framework which uses an intelligent task remapping approach is proposed to improve the system performance for large-scale MPSoCs.  相似文献   

7.
Complex network protocols and various network services require significant processing capability for modern network applications. One of the important features in modern networks is differentiated service. Along with differentiated service, rapidly changing network environments result in congestion problems. In this paper, we analyze the characteristics of representative congestion control applications-scheduling and queue management algorithms, and we propose application-specific acceleration techniques that use instruction-level parallelism (ILP) and packet-level parallelism (PLP) in these applications. From the PLP perspective, we propose a hardware acceleration model based on detailed analysis of congestion control applications. In order to get large throughputs, a large number of processing elements (PEs) and a parallel comparator are designed. Such hardware accelerators provide large parallelism proportional to the number of processing elements added. A 32-PE enhancement yields 24/spl times/ speedup for weighted fair queueing (WFQ) and 27/spl times/ speedup for random early detection (RED). For ILP, new instruction set extensions for fast conditional operations are applied for congestion control applications. Based on our experiments, proposed architectural extensions show 10%-12% improvement in performance for instruction set enhancements. As the performance of general-purpose processors rapidly increases, defining architectural extensions (e.g., multi-media extensions (MMX) as in multimedia applications) for general-purpose processors could be an alternative solution for a wide range of network applications.  相似文献   

8.
The optimum architecture design and mapping of QRD-RLS adaptive filters can be achieved through filter architecture selections, look-ahead transformations, and hierarchical pipelining/folding transformations. In this paper, a relaxed annihilation-reordering look-ahead (RARL) architecture is proposed, and shown to be more power and area efficient than pipelined processing architecture which was considered the most area efficient. The filters with this architecture are based on relaxed weight-update through filtering approximation, where a filter tap weight is updated upon arrival of every block of input data, and are speeded up with annihilation-reordering look-ahead transformation. As a result of the computational complexity reduction, this architecture does not change the iteration bound and filter clock frequency, and leads to speed up with linear increase in power consumption, while the pipelined processing architectures result in speedup with quadratic increase in power consumption. Upon hardware mapping, this architecture is also more advantageous to achieve low area designs. Two design examples are presented to illustrate mapping optimization using above transformations. These results are important for mapping designs onto ASICs, FPGAs or parallel computing machines. The results show significant improvements in throughput, power consumption and hardware requirement. It is also interesting to show through mathematics and simulations that the RARL QRD-RLS filters have no performance degradation in terms of convergence rate.  相似文献   

9.
MapReduce模型的调度及容错机制研究   总被引:1,自引:0,他引:1  
MapReduce是一种并行编程模型,可以用来处理和生成大量数据集。它的调度以及容错机制是模型的重要一部分。通过对MapReduce模型的执行过程进行分析,提取得到其上面的调度以及容错模型。并将P2P模型中常用的调度思想使用于MapReduce调度模型上,对原来的调度机制和容错机制做一定的修改。  相似文献   

10.
雷元武  窦勇  倪时策  周杰 《电子学报》2012,40(9):1715-1722
本文针对科学应用中基本函数种类多、实现复杂、使用频率低的特点,提出一种定制VLIW结构四精度浮点基本函数协处理器(QPC-Processor).该结构通过显示并行技术挖掘基本函数实现算法的并行性,在同一硬件平台上通过元操作的不同组合来计算多种基本函数.同时,本文还提出基本函数元操作序列到定制VLIW指令的映射算法,指导基本函数的设计.最后,在FPGA平台上进行验证.实验结果表明,相对软件实现,单个QPC-Processor能够取得6倍以上的加速比,而且,QFC-Processor在同一硬件平台上实现多种类型的算法,弥补单一算法的不足,获得较高的硬件资源利用率.  相似文献   

11.
针对现有职位信息抽取方法由于缺乏自适应性和并行性,存在冗余度高和抽取效率低的问题,提出了基于CSS模板的方式并行职位信息抽取方法。该方法根据职位信息页面特点使用CSS路径抽取方法,并制定抽取模板解决抽取的准确性和自适应性,使用了MapReduce编程模型实现职位信息的并行化抽取。使用MD5算法计算已抽取得到的职位信息的MD5值,结合MapReduce并行计算编程模型的特性实现职位信息去重,最终将去重后的职位信息存储在分布式数据库HBase。实验测试结果表明,并行计算与传统的非并行编程模型相比在处理的时间效率和采集的职位信息量上都有明显的提高。  相似文献   

12.
Modeling applications and architectures at various levels of abstraction is becoming more and more an accepted approach in embedded system design. When looking at the modeling of applications in the domain of video, audio, and graphics applications, we notice that they exhibit a high degree of task parallelism and operate on streams of data. Models that we can use to specify such stream-based applications on a high level of abstraction are the dataflow models and process network models. Each of these models has its own merits. Therefore, an alternative approach is to introduce a model of computation that combines the semantics of both models of computation. In this article, we introduce such a model of computation, which we call the Stream-Based Functions (SBF) model of computation and show an example. Furthermore, we discuss the composition and decomposition of SBF objects and put the SBF model of computation in the context of relevant dataflow models and process network models.  相似文献   

13.
We evaluate the validity of the fundamental assumption behind application-specific programmable processors: that applications differ from each other in key parameters which are exploitable, such as the available instruction-level parallelism (ILP), demand on various hardware resources, and the desired mix of function units. Following the tradition of the CAD community, we develop an accurate chip area estimate and a set of aggressive hardware optimization algorithms. We follow the tradition of the architecture community by using comprehensive real-life benchmarks and production quality tools. This combination enables us to build a unique framework for system-level synthesis and to gain valuable insights about design and use of application-specific programmable processors for modern applications. We explore the application-specific programmable processor (ASSP) design space to understand the relationship between performance and area. The architecture model we used is the Hewlett Packard PA-RISC with single level caches. The system, including all memory and bus latencies, is simulated and no other specialized ALU or memory structures are being used. The experimental results reveal a number of important characteristics of the ASSP design space. For example, we found that in most cases a single programmable architecture performs similarly to a set of architectures that are tuned to individual application. A notable exception is highly cost sensitive designs, which we observe need a small number of specialized architectures that require smaller areas. Also, it is clear that there is enough parallelism in the typical media and communication applications to justify use of high number of function units. We found that the framework introduced in this paper can be very valuable in making early design decisions such as area and architectural configuration tradeoff, cache and issue width tradeoff under area constraint, and the number of branch units and issue width  相似文献   

14.
Modular arithmetic is a building block for a variety of applications potentially supported on embedded systems. An approach to turn modular arithmetic more efficient is to identify algorithmic modifications that would enhance the parallelization of the target arithmetic in order to exploit the properties of parallel devices and platforms. The Residue Number System (RNS) introduces data-level parallelism, enabling the parallelization even for algorithms based on modular arithmetic with several data dependencies. However, the mapping of generic algorithms to full RNS-based implementations can be complex and the utilization of suitable hardware architectures that are scalable and adaptable to different demands is required. This paper proposes and discusses an architecture with scalability features for the parallel implementation of algorithms relying on modular arithmetic fully supported by the Residue Number System (RNS). The systematic mapping of a generic modular arithmetic algorithm to the architecture is presented. It can be applied as a high level synthesis step for an Application Specific Integrated Circuit (ASIC) or Field Programmable Gate Array (FPGA) design flow targeting modular arithmetic algorithms. An implementation with the Xilinx Virtex 4 and Altera Stratix II Field Programmable Gate Array (FPGA) technologies of the modular exponentiation and Elliptic Curve (EC) point multiplication, used in the Rivest-Shamir-Adleman (RSA) and (EC) cryptographic algorithms, suggests latency results in the same order of magnitude of the fastest hardware implementations of these operations known to date.  相似文献   

15.
面向嵌入式应用的指令集自动扩展   总被引:2,自引:1,他引:1       下载免费PDF全文
 面向特定应用扩展指令集,并通过定制的硬件实现这些扩展指令,能够大幅度提高嵌入式处理器的性能.本文提出了一种全自动的面向特定应用的指令集扩展流程,该流程能够较精确地估算扩展指令的性能加速比和硬件开销,并高效完成指令模板匹配.实验结果表明,在给定的硬件开销限制下,该方法产生的扩展指令能够显著提升嵌入式应用的性能.  相似文献   

16.
The main problem with the hardware implementation of turbo codes is the lack of parallelism in the MAP-based decoding algorithm. This paper proposes to overcome this problem by using a new family of turbo codes called Multiple Slice Turbo Codes. This family is based on two ideas: the encoding of each dimension with P independent tail-biting codes and a constrained interleaver structure that allows the parallel decoding of the P independent codewords in each dimension. The optimization of the interleaver is described. A high degree of parallelism is obtained with equivalent or better performance than thedvb-rcs turbo code. For very high throughput applications, the parallel architecture decreases both decoding latency and hardware complexity compared to the classical serial architecture, which requires memory duplication.  相似文献   

17.
Interconnection of components in a VLSI chip is becoming an increasingly complex problem. In this paper we examine the complexity of the wire routing process and discuss several new approaches to solving the problem using a parallel system architecture. The machines discussed range from compact systems for highly specialized applications to more general designs suited for broader applications. The process speedup due to parallelism and the cost advantage due to the use of large numbers of identical VLSI parts make these new machines practical today.  相似文献   

18.
The introduction of high-performance applications such as multimedia applications into SoCs led the manufacturers to provide embedded SoCs able to offer an important computing power which makes it possible to answer the increasing requirements of future evolutions of these applications. One of the adopted solutions is the use of multiprocessor SoCs. In this paper, we present a joint SW/HW design exploration methodology for multiprocessor SoCs. The system model relies on transaction-level component-based models for modeling parallel software and multiprocessor hardware. Our proposal comprises two original points. First, we propose a composable software-level scheduler constraints synthesis technique. Second, we present a combined software-level and exploratory hardware-level schedulers. The methodology has the advantage of combining real-time requirements of software with effective exploitation of multiprocessor hardware. We describe and apply the methodology to synthesize a scheduler of a slice-based MPEG-4 video encoder on the multiprocessor Cake SoCs.  相似文献   

19.
Today's communications systems especially in the field of wireless communications rely on many different algorithms to provide applications with constantly increasing data rates and higher quality. This development combined with the wireless channel characteristics as well as the invention of turbo codes has particularly increased the importance of interleaver algorithms. In this paper, we demonstrate the feasibility to exploit the hardware parallelism in order to accelerate the interleaving procedure. Based on a heuristic algorithm, the possible speedup for different interleavers as a function of the degree of parallelism of the hardware is presented. The parallelization is generic in the sense that the assumed underlying hardware is based on a parallel datapath DSP architecture and therefore provides the flexibility of software solutions.  相似文献   

20.
In the areas of signal processing and communications, such as antenna-array beamforming, adaptive filtering, multiuser and multiple-input–multiple-output (MIMO) detection, channel estimation and equalization, echo and interference cancellation, and others, solving linear systems of equations often provides an optimal performance. However, this is also a very complicated operation that designers try to avoid by proposing different suboptimal techniques. The dichotomous coordinate descent (DCD) algorithm allows linear systems of equations to be solved with high computational efficiency. In this paper, we present architectures and field-programmable gate-array (FPGA) designs of two variants of the DCD algorithm, which are known as cyclic and leading DCD algorithms. For each of these techniques, we present serial designs, group-2 and group-4 designs, as well as a design with parallel update of the residual vector for the cyclic DCD algorithm. These designs have different degrees of parallelism, thus enabling a tradeoff between FPGA resources and computation time. The serial designs require the smallest FPGA resources; they are well suited for applications where many parallel solvers are required, e.g., for detection in MIMO–orthogonal-frequency-division-multiplexing communication systems. The parallelism introduced in the proposed group-2 and group-4 designs allows faster convergence to the true solution at the expense of an increase in FPGA resources. The design with parallel update of the residual vector provides the fastest convergence speed; however, if the system size is high, it may result in a significant increase in FPGA resources. The proposed fixed-point designs provide an accuracy performance that is very close to the performance of floating-point counterparts and require significantly lower FPGA resources than techniques based on QR decomposition.   相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号