首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 578 毫秒
1.
This work proposes a new FPGA architecture, to meet the requirements of signal processing and testing of current system-on-chip designs. The proposed architecture provides the hardware reuse and the reconfigurability advantages of an FPGA, not only for the system functionality, but also for the system testing, while keeping the performance level required by current signal processing applications. This paper presents the new FPGA model, along with preliminary experimental results that clearly show the possible advantages at the system level of merging design and test in a reconfigurable device.  相似文献   

2.
Shrinking technology nodes combined with the need for higher clock speeds have made it increasingly difficult to distribute a single global clock across a chip while meeting the power requirements of the design. Globally asynchronous locally synchronous (GALS) design style can help achieve low power consumption and modularity of a design while greatly reducing the number of global interconnects. Such multiple clock domain architectures can benefit from having frequency/voltage values assigned to each domain based on workload requirements. The work presented in this paper proposes a new hardware-based approach to dynamically change the frequencies and potentially voltages of a voltage-frequency island (VFI) system driven by a dynamic workload. This technique tries to change the frequency of a synchronous island such that it will have efficient power utilization while satisfying performance constraints. In recent years, there have been major developments, both in industry and academia, in the field of multiprocessor systems. Such multiprocessor systems are very good candidates for VFI design style implementation, where one or more processors can be part of a single VFI. To demonstrate the feasibility of our proposed method, we have implemented a multiprocessor system for a field-programmable gate array (FPGA) platform that uses independently generated clocks for each processor. The results from the FPGA platform confirm the claim that the power consumption of a system can potentially be reduced while maintaining the performance of many applications. Our work concentrates primarily on embedded systems, but the idea can be explored for general-purpose computing as well.   相似文献   

3.
提出一种基于FPGA的专用处理器设计.它是用于高级加密标准的超小面积设计,支持密钥扩展(现在设计为128位密钥),加密和解密.这个设计采用了完全的8位数据路径宽度,创新的字节替换电路和乘累加器结构,在最小规模的Xilinx Spartan II FPGA芯片XC2S15上实现了一个高级加密标准AES的专用处理器,使用了不到60%的资源.当时钟为70MHz时,可以达到平均加密解密吞吐量2.1Mb/s.主要应用在把低资源占用,低功耗作优先考虑的场合.  相似文献   

4.
This paper presents a novel unified and programmable 2-D Discrete Wavelet Transform (DWT) system architecture, which was implemented using a Field Programmable Gate Array (FPGA)-based Nios II soft-core processor working in combination with custom hardware accelerators generated through high-level synthesis. The proposed system architecture, synthesized on an Altera DE3 Stratix III FPGA board, was developed through an iterative design space exploration methodology using Altera’s C2H compiler. Experimental results show that the proposed system architecture is capable of real-time video processing performance for grayscale image resolutions of up to 1920?×?1080 (1080p) when ran on the Altera DE3 board, and it outperforms the existing 2-D DWT architecture implementations known in literature by a considerable margin in terms of throughput. While the proposed 2-D DWT system architecture satisfies real-time performance constraints, it can also perform both forward and inverse DWT, support a number of popular DWT filters used for image and video compression and provide architecture programmability in terms of number of levels of decomposition as well as image width and height. Based from the design principles used to implement the proposed 2-D DWT system architecture, a system design guideline can be formulated for SOC designs which plan to incorporate dedicated 2-D DWT hardware acceleration.  相似文献   

5.
文章基于GALS(Globally Asynchronous Locally Synchronous)设计理念,提出一个Core的异步接口设计模型:门控时钟停Core机制、握手机制、电平转脉冲逻辑等构成的异步控制信号处理模型:异步FIFO和双FIFO结构构成的异步数据处理模型。此结构允许Core和总线系统在完全异步的时钟域上工作。FPGA验证结果表明.该模型能正确地实现两者问的信号同步,并能满足具体应用的带宽需求。  相似文献   

6.
本文设计了异步LDPC解码器运算通路,利用异步电路减少信号到达时间不一致引起的毛刺和时钟引起的功耗.利用输入数据的统计特性设计了运算通路中的主要运算单元,减少了冗余运算.本文还实现了同步运算通路和基于门控时钟的运算通路作为比较.三种设计采用相近的架构,在0.18μm CMOS工艺下实现相同的功能.仿真结果表明,提出的异步设计功耗最小,相比于同步设计和基于门控时钟设计,分别节省了42.0%和32.6%的功耗.虽然性能稍逊于同步设计,但优于门控时钟设计.其中,同步设计的延时是1.09ns,基于门控时钟的设计延时是1.61ns,而异步设计则是1.20ns.  相似文献   

7.
Reconfigurable hardware is ideal for use in systems-on-a-chip (SoC), as it provides both hardware-level performance and post-fabrication flexibility. However, any one architecture is rarely equally optimized for all applications. SoCs targeting a specific set of applications can greatly benefit from incorporating customized reconfigurable logic instead of generic field-programmable gate-array (FPGA) logic. Unfortunately, manually designing a domain-specific architecture for every SoC would require significant design time. Instead, this paper discusses our initial efforts towards creating a reconfigurable hardware generator capable of automatically creating flexible, yet domain-specific, designs. Our tests indicate that our generated architectures are more than 5times smaller than equivalent FPGA implementations and nearly as area-efficient as standard cell designs. We also use a novel technique employing synthetic circuit generation to demonstrate the flexibility of our architecture generation techniques.  相似文献   

8.
This paper presents a novel type of high-speed and area-efficient register-based transpose memory architecture enabled by reporting on both edges of the clock. The proposed new architecture, by using the double-edge triggered registers, doubles the throughput and increases the maximum frequency by avoiding some of the combinational circuit used in prior work. The proposed design is evaluated with both FPGA and ASIC flow in 28/32nm technology. The experimental results show that the proposed memory achieves almost 4X improvement in throughput while consuming 46 % less area with the FPGA implementations compared to prior work. For ASIC implementations, it achieves more than 60 % area reduction and at least 2X performance improvement while burning 60 % less power compared to other register-based designs implemented with the same flow. As an example, a proposed 8X8 transpose memory with 12-bit input/output resolution is able to achieve a throughput of 107.83Gbps at 647MHz by taking only 140 slices on a Virtex-7 Xilinx FPGA platform, and achieve a throughput of 88.2Gbps at 529MHz by taking 0.024mm 2 silicon area for ASIC. The proposed transpose memory is integrated in both 2D-DCT and 2D-IDCT blocks for signal processing applications on the same FPGA platform. The new architecture allows a 3.5X speed-up in performance for the 2D-DCT algorithm, compared to the previous work, while consuming 28 % less area, and 2D-IDCT achieves a 3X speed-up while consuming 20 % less area.  相似文献   

9.
文章研究了基于IHS变换与小波变换相结合的图像融合方法,采用VerilogHDL语言和同步设计方法对设计进行了完全可综合的RTL级描述,给出了一种基于CycloneⅡ系列EP2C50的硬件实现方案,并利用Altera公司的FPGA开发软件QuartusⅡ6.0对设计的各模块进行了仿真和实现,结果表明,给出的设计能很好地实现图像融合。  相似文献   

10.
本文提出了一种基于握手协议的GALS接口设计方法。该接口采用异步FIFO作为输入缓冲区,有效降低了数据传输延迟;采用环形缓冲的概念来管理缓冲区,使接口具有了可扩展性。FPGA验证结果表明,该接口保证了适配单元与网络路由之间完成准确的异步传输,4通道的接口共占用了405个ALUT(Adaptive Look-Up Table)和支持211 MHz的时钟频率。  相似文献   

11.
On-FPGA communication is becoming more problematic as the long interconnection performance is deteriorating in technology scaling. In this paper, we address this issue by proposing a novel wave-pipelined signaling scheme to achieve substantial throughput improvement in FPGAs. A new analytical model capturing the electrical characteristics in FPGA interconnects is presented. Based on the model, throughput and power consumption of a wave-pipelined link have been derived analytically and compared to the conventional synchronous links. Two circuit designs are proposed to realize wave-pipelined link using FPGA fabrics. The proposed approaches are also compared with conventional synchronous and asynchronous pipelining techniques. It is shown that the wave-pipelined approach can achieve up to 5.7 times improvement in throughput and 13% improvement in power consumption versus conventional delay-based on-chip communication schemes. Also, trade-offs between power, throughput and area consumption between the proposed and conventional designs are studied. The wave-pipelining approach provides a new alternative for on-FPGA communications and can potentially become a promising solution to mitigate the future interconnect scaling challenge.  相似文献   

12.
Chip multiprocessors with globally asynchronous locally synchronous (GALS) clocking styles are promising candidates for processing computationally-intensive and energy-constrained workloads. The GALS methodology simplifies clock tree design, provides opportunities to use clock and voltage scaling jointly in system submodules to achieve high energy efficiencies, and can also result in easily scalable clocking systems. However, its use typically also introduces performance penalties due to additional communication latency between clock domains. We show that GALS chip multiprocessors (CMPs) with large inter-processor first-inputs–first-outputs (FIFOs) buffers can inherently hide much of the GALS performance penalty while executing applications that have been mapped with few communication loops. In fact, the penalty can be driven to zero with sufficiently large FIFOs and the removal of multiple-loop communication links. We present an example mesh-connected GALS chip multiprocessor and show it has a less than 1% performance (throughput) reduction on average compared to the corresponding synchronous system for many DSP workloads. Furthermore, adaptive clock and voltage scaling for each processor provides an approximately 40% power savings without any performance reduction. These results compare favorably with the GALS uniprocessor, which compared to the corresponding synchronous uniprocessor, has a reported greater than 10% performance (throughput) reduction and an energy savings of approximately 25% using dynamic clock and voltage scaling for many general purpose applications.   相似文献   

13.
Partial Reconfiguration (PR) is a method for Field Programmable Gate Array (FPGA) designs which allows multiple applications to time-share a portion of an FPGA while the rest of the device continues to operate unaffected. Using this strategy, the physical layer processing architecture in Software Defined Radio (SDR) systems can benefit from reduced complexity and increased design flexibility, as different waveform applications can be grouped into one part of a single FPGA. Waveform switching often means not only changing functionality, but also changing the FPGA clock frequency. However, that is beyond the current functionality of PR processes as the clock components (such as Digital Clock Managers (DCMs)) are excluded from the process of partial reconfiguration. In this paper, we present a novel architecture that combines another reconfigurable technology, Dynamic Reconfigurable Port (DRP), with PR based on a single FPGA in order to dynamically change both functionality and also the clock frequency. The architecture is demonstrated to reduce hardware utilization significantly compared with standard, static FPGA design.  相似文献   

14.
15.
This paper presents an architecture for a reconfigurable device that is specifically optimized for floating-point applications. Fine-grained units are used for implementing control logic and bit-oriented operations, while parameterized and reconfigurable word-based coarse-grained units incorporating word-oriented lookup tables and floating-point operations are used to implement datapaths. In order to facilitate comparison with existing FPGA devices, the virtual embedded block scheme is proposed to model embedded blocks using existing field-programmable gate array (FPGA) tools. This methodology involves adopting existing FPGA resources to model the size, position, and delay of the embedded elements. The standard design flow offered by FPGA and computer-aided design vendors is then applied and static timing analysis can be used to estimate the performance of the FPGA with the embedded blocks. On selected floating-point benchmark circuits, our results indicate that the proposed architecture can achieve four times improvement in speed and 25 times reduction in area compared with a traditional FPGA device.   相似文献   

16.
为了提高SoC内部总线的性能,优化总线架构.文章提出了一种新颖的LotteryBus总线机制.通过将其与静态优先级及时分复用总线进行比较,介绍了它的特点及其仲裁机制.并且设计和实现了一个4-Masters的LottervBus用于龙芯SoC内部高速总线的改进,功能仿真和FPGA验证证明这一总线机制的可行性和正确性.  相似文献   

17.
This paper presents a flexible 2/spl times/2 matrix multiplier architecture. The architecture is based on word-width decomposition for flexible but high-speed operation. The elements in the matrices are successively decomposed so that a set of small multipliers and simple adders are used to generate partial results, which are combined to generate the final results. An energy reduction mechanism is incorporated in the architecture to minimize the power dissipation due to unnecessary switching of logic. Two types of decomposition schemes are discussed, which support 2's complement inputs, and its overall functionality is verified and designed with a field-programmable gate array (FPGA). The architecture can be easily extended to a reconfigurable matrix multiplier. We provide results on performance of the proposed architecture from FPGA post-synthesis results. We summarize design factors influencing the overall execution speed and complexity.  相似文献   

18.
针对现有基于PLLs/DLLs的全数字化同步倍频器结构存在的不足,本文提出了基于一种双环结构的全数字同步倍频器。它由延迟锁相环和锁频环共享一个共同的参考时钟信号(FREF)构成,不需要任何模拟组件。它可以采用Verilog-HDL语言设计,可在Altera DE2-70开发板上实现合成,而且可以很容易地适应于不同的FPGA系列以及作为一个集成电路实现,同时也可用于为分布式数字处理系统以及片上系统的片内/片间通信提供时钟参考;实验结果表明,本文所提出的结构相比于现有的结构,能够获得更高频率的输出时钟信号,提供更好的频率分辨率、更好的抖动性能和高倍乘因子。  相似文献   

19.
An architecture is presented for real-time continuous speech recognition based on a modified hidden Markov model. The algorithm is adapted to the needs of continuous speech recognition by efficient encoding of the state space, and logarithmic encoding of the weights so that products can be computed as sums. The paper presents the algorithm and its application related modifications, the mapping of the algorithm to a special purpose architecture, and the detailed design of this architecture using configurable logic. Emphasis is given on how the attributes of the algorithm are exploited in a configurable logic based design. A concrete design example is presented with a coprocessor engine having one large FPGA, 64 Mbytes of synchronous DRAM (SDRAM), a small FPGA as a SDRAM controller, and 2 Mbytes SRAM. This engine operating at 66 MHz performs roughly nine times as fast as a high end personal computer running a fully optimized version of the same algorithm.  相似文献   

20.
The maximum a posterior probability (MAP) algorithm has been widely used in Turbo decoding for its outstanding performance. However, it is very challenging to design high-speed MAP decoders because of inherent recursive computations. This paper presents two novel high-speed recursion architectures for MAP-based Turbo decoders. Algorithmic transformation, approximation, and architectural optimization are incorporated in the proposed designs to reduce the critical path. Simulations show that neither of the proposed designs has observable decoding performance loss compared to the true MAP algorithm when applied in Turbo decoding. Synthesis results show that the proposed Radix-2 recursion architecture can achieve comparable processing speed to that of the state-of-the-art recursion (Radix-4) architecture with significantly lower complexity while the proposed Radix-4 architecture is 32% faster than the best existing design  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号