期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient multi-Gb/s multi-mode LDPC decoder architecture for IEEE 802.11ad applications

《Integration, the VLSI Journal》2015

This paper presents a novel multi-Gb/s multi-mode LDPC decoder architecture and efficient design techniques for gigabit wireless communications. An efficient dynamic and fixed column-shifting scheme is presented for multi-mode architectures. A novel low-complexity local switch is proposed to implement the dynamic and fixed column-shifting scheme. Furthermore, an efficient quantization method and the usage of a one׳s-complement scheme instead of a two׳s-complement scheme are explored. The proposed decoder achieves very high throughput with minimal area overhead. Post layout results using TSMC 65-nm CMOS technology shows much better throughput, as well as better area- and energy-efficiency, compared to other multi-mode LDPC decoders. 相似文献

2.

High-speed & Low Area Hardware Architectures of the Whirlpool Hash Function

Máire McLoone Ciaran McIvor 《Journal of Signal Processing Systems》2007,47(1):47-57

High-speed and low area hardware architectures of the Whirlpool hash function are presented in this paper. A full Look-up Table (LUT) based design is shown to be the fastest method by which to implement the non-linear layer of the algorithm in terms of logic. An unrolled Whirlpool architecture implemented on the Virtex XC4VLX100 device achieves a throughput of 4.9 Gbps. This is faster than a SHA-512 design implemented on the same device and other previously reported hash function architectures. A low area iterative architecture, which utilises 64-bit operations as opposed to full 512-bit operations, is also described. It runs at 430 Mbps and occupies 709 slices on a Virtex X4VLX15. This proves to be one of the smallest 512-bit hash function architectures currently available. 相似文献

3.

High-speed & Low Area Hardware Architectures of the Whirlpool Hash Function

Máire McLoone Ciaran McIvor 《The Journal of VLSI Signal Processing》2007,47(1):47-57

High-speed and low area hardware architectures of the Whirlpool hash function are presented in this paper. A full Look-up Table (LUT) based design is shown to be the fastest method by which to implement the non-linear layer of the algorithm in terms of logic. An unrolled Whirlpool architecture implemented on the Virtex XC4VLX100 device achieves a throughput of 4.9 Gbps. This is faster than a SHA-512 design implemented on the same device and other previously reported hash function architectures. A low area iterative architecture, which utilises 64-bit operations as opposed to full 512-bit operations, is also described. It runs at 430 Mbps and occupies 709 slices on a Virtex X4VLX15. This proves to be one of the smallest 512-bit hash function architectures currently available.

Ciaran McIvorEmail:

相似文献

4.

Low-memory and high-performance architectures for the CCSDS 122.0-B-1 compression standard

《Integration, the VLSI Journal》2019

Two low-memory and high-performance architectures for the CCSDS 122.0-B-1 standard are proposed. They use novel memory organizations to reduce the total memory requirements in order to be implemented in a single FPGA device. The architectures were implemented in radiation-hardened and commercial FPGA devices. Based on the experimental results for the case of Virtex5QV radiation-hardened device, the throughput is 135 MSamples/sec for image with 12 bits/pixel and horizontal resolution up 8192 pixels. Also, the proposed architectures outperform the existing one in terms of the memory requirements and area. 相似文献

5.

A high performance 5 stage pipeline architecture for the H.264/AVC deblocking filter

《Integration, the VLSI Journal》2015

Exploiting specific properties of the algorithm, a high-throughput pipelined architecture is introduced to implement the H.264/AVC deblocking filter. The architecture was synthesized in 0.18 μm technology and the clock frequency and area are 400 MHz and 16.8 Kgates, respectively. Also, it is able to filter 217 and 55 Frames per second (Fps) for Full- and Ultra-HD videos, respectively. The introduced architecture outperforms similar ones in terms of frequency (1.8× up to 4×), throughput, (1.5× up to 3.8×), and Fps. Moreover, extensions to support different sample bit-depths and chroma formats are included. Also, experimental results for different FPGA families are offered. 相似文献

6.

A Fault Detection Scheme for the FPGA Implementation of SHA-1 and SHA-512 Round Computations 总被引：1，自引：0，他引：1

Mohsen Bahramali Jin Jiang Arash Reyhani-Masoleh 《Journal of Electronic Testing》2011,27(4):517-530

The Secure Hash Algorithm is the most popular hash function currently used in many security protocols such as SSL and IPSec. Like other cryptographic algorithms, the hardware implementation of hash functions is of great importance for high speed applications. Because of the iterative structure of hash functions, a single error in their hardware implementation could result in a large number of errors in the final hash value. In this paper, we propose a novel time-redundancy-based fault diagnostic scheme for the implementation of SHA-1 and SHA-512 round computations. This scheme can detect permanent as well as transient faults as opposed to the traditional time redundancy technique which is only capable of detecting transient errors. The proposed design does not impose significant timing overhead to the original implementation of SHA-1 and SHA-512 round computation. We have implemented the proposed design for SHA-1 and SHA-512 on Xilinx xc2p7 FPGA. It is shown that for the proposed fault detection SHA-1 and SHA-512 round computations, there are, respectively, 3% and 10% reduction in the throughput with 58% and 30% area overhead as compared to the original schemes. The fault simulation of the implementation shows that almost 100% fault coverage can be achieved using the proposed scheme for transient and permanent faults. 相似文献

7.

Design of quadruple precision multiplier architectures with SIMD single and double precision support

《Integration, the VLSI Journal》2019

This paper proposes architectures for dual-mode and tri-mode dynamically configurable multiplier for quadruple precision arithmetic. The proposed dual-mode QPdDP multiplier architectures can either compute on a pair of quadruple precision (QP) operands or provide SIMD support for two-parallel (dual) sets of double precision (DP) operands. The proposed tri-mode QPdDPqSP multiplier architectures are aimed to include the four-parallel (quad) single precision (SP) along with dual-DP and a QP operand processing. For the underlying largest sub-component, the mantissa multiplier, two methods are analyzed to design the dual-mode/tri-mode architectures. One is based on the Karatsuba method, and in another a dual-mode/tri-mode Radix-4 Modified Booth (MB) multiplier is proposed. The proposed dual-mode/tri-mode MB multiplier requires few extra 2:1 MUXs as an overhead compared to a simple MB multiplier. To support dual-mode/tri-mode functioning other important sub-components of the FP multiplication are also re-designed for multi-mode support. The proposed architectures are synthesized using UMC 90 nm ASIC technology, and are compared against prior literature in terms of area, period, and a unified metric “Area (Gate Count) × Period (FO4) × Latency × Throughput (in cycles)”. The dual-mode/tri-mode FP architectures with MB mantissa multipliers shows better timings, however, those with Karatsuba mantissa multipliers acquires smaller area. 相似文献

8.

Efficient Error Control Decoder Architectures for Noncoherent Random Linear Network Coding

Jun Lin Hongmei Xie Zhiyuan Yan 《Journal of Signal Processing Systems》2014,76(2):195-209

Random linear network coding is an efficient technique for disseminating information in networks, but it is highly susceptible to errors. Kötter-Kschischang (KK) codes and Mahdavifar-Vardy (MV) codes are two important families of subspace codes that provide error control in noncoherent random linear network coding. List decoding has been used to decode MV codes beyond half distance. Existing hardware implementations of the rank metric decoder for KK codes suffer from limited throughput, long latency and high area complexity. The interpolation-based list decoding algorithm for MV codes still has high computational complexity, and its feasibility for hardware implementations has not been investigated. In this paper we propose efficient decoder architectures for both KK and MV codes and present their hardware implementations. Two serial architectures are proposed for KK and MV codes, respectively. An unfolded decoder architecture, which offers high throughput, is also proposed for KK codes. The synthesis results show that the proposed architectures for KK codes are much more efficient than rank metric decoder architectures, and demonstrate that the proposed decoder architecture for MV codes is affordable. 相似文献

9.

Cost-Efficient SHA Hardware Accelerators

Chaves R. Kuzmanov G. Sousa L. Vassiliadis S. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(8):999-1008

This paper presents a new set of techniques for hardware implementations of Secure Hash Algorithm (SHA) hash functions. These techniques consist mostly in operation rescheduling and hardware reutilization, therefore, significantly decreasing the critical path and required area. Throughputs from 1.3 Gbit/s to 1.8 Gbit/s were obtained for the SHA implementations on a Xilinx VIRTEX II Pro. Compared to commercial cores and previously published research, these figures correspond to an improvement in throughput/slice in the range of 29% to 59% for SHA-1 and 54% to 100% for SHA-2. Experimental results on hybrid hardware/software implementations of the SHA cores, have shown speedups up to 150 times for the proposed cores, compared to pure software implementations. 相似文献

10.

A high throughput pass parallel block decoder architecture for JPEG 2000 that prevents stalling in the decoding process

《Integration, the VLSI Journal》2020

The Block Decoder (BD) which is an indispensable component of the JPEG 2000 image compression standard has the highest computational complexity and determines the speed of the overall decoder system. This paper proposes a high throughput pass parallel BD architecture, which can decode more than one bit per clock cycle. In BD, the dependency between context generation and arithmetic decoding unit incorporates stalling and reduces the throughput of the decoding process. The proposed selective byte input and synchronous sample skipping techniques are used to prevent stalling in the decoding process. The proposed architecture achieves 86% more throughput with 50% increment in the hardware cost than that of the best available serial BD architecture. In comparison with the best available pass parallel architecture, throughput improves almost 8.2 times with 61% increment in the hardware cost. Incorporation of the speed up techniques in the design is the main reason for more hardware consumption. The Figure of Merit of the proposed design, which is the ratio of throughput and hardware cost, is more than that of the available BD architectures for typical code block (CB) size of 32 × 32. The ASIC implementation of the proposed design consumes 66 mW power at maximum operating frequency. 相似文献

11.

Low Power Semi-systolic Architectures for Polynomial-Basis Multiplication over GF(2m) Using Progressive Multiplier Reduction

Atef Ibrahim Fayez Gebali 《Journal of Signal Processing Systems》2016,82(3):331-343

We present low area and low power semi-systolic array architectures for polynomial basis multiplication over GF(2^m) using Progressive Multiplier Reduction Technique (PMR). These architectures are explored using linear and nonlinear techniques applied to the polynomial multiplication algorithm. The nonlinear techniques allow the designer, to control the processor workload and reduce the inter-processor communications. The semi-systolic architectures obtained have simple structure with local communication. ASIC implementations of our designs and comparable published designs show that the proposed scalable semi-systolic structures have less area complexity (56.8–94.6 %) and power consumption (55.2–84.2 %) except for a scalable design published by the same authors. However, one of the proposed scalable designs outperforms this design in terms of throughput by 73.8 %. This makes the proposed designs suited to embedded applications that require low power consumption and moderate speed. 相似文献

12.

一种基于循环展开结构的SHA-1算法实现 总被引：1，自引：0，他引：1

孙楠楠韩银河许都《信息技术》2007,31(3):29-31

哈希算法在信息安全领域主要应用于验证数据完整性和签名认证。通过对SHA-1算法进行深入分析，提出了一种快速实现此算法的硬件方案。该方案改变了标准算法中的迭代结构，减少消息处理时钟周期数，进而提高吞吐量。与其他IP）核相比，该设计在面积、频率和吞吐量等方面表现出了较强的优势。相似文献

13.

Security of Internet of Things using RC4 and ECC Algorithms (Case Study: Smart Irrigation Systems)

Mousavi Seyyed Keyvan Ghaffari Ali Besharat Sina Afshari Hamed 《Wireless Personal Communications》2021,116(3):1713-1742

Internet of Things (IoT) deploys a wide range of technologies including wireless sensor networks, machine-to-machine communication, robots, internet technologies, and smart devices. IoT is a novel phenomenon in the IT world wherein objects can transmit data, and interact through the internet or intranet networks. But the most important and crucial issue on the IoT is privacy and data security. The objective of this paper is to create a new encryption model for data storage servers in an IoT-based irrigation systems. Thus, a hybrid encryption algorithm based on Elliptic Curve Cryptography (ECC), RC4, and SHA-256 is proposed to protect sensitive data of IoT-based irrigation systems. The proposed model uses ECC to improve RC4. In RC4, XOR operation is performed using a key encrypted by ECC and shift-right, and then the resulting data are transformed to SHA-256 to ensure security. Simulation results indicate that encryption and decryption time in the proposed model are shorter than other models like XXTEA & ECC, XXTEA & RSA, ECC&3DES&SHA-256, RC4&3DES&SHA-256, AES&RC4&SHA-256, AES&3DES &SHA-256, RC4&AES&SHA-256, RC2&3DES&SHA-256, and ECC&RC2&SHA-256 with, 43.39%, 66.03%, 45.28%, 54.71%, 50.94%, 33.96%, 33.62%, 24.52%, and, 15.09% respectively.

相似文献

14.

Design and Comparison of FFT VLSI Architectures for SoC Telecom Applications with Different Flexibility,Speed and Complexity Trade-Offs

Sergio?Saponara Email author Massimo?Rovini Luca?Fanucci Athanasios?Karachalios George?Lentaris Dionysios?Reisis 《Circuits, Systems, and Signal Processing》2012,31(2):627-649

The design of Fast Fourier Transform (FFT) integrated architectures for System-on-Chip (SoC) telecom applications is addressed in this paper. After reviewing the FFT processing requirements of wireless and wired Orthogonal Frequency Division Multiplexing (OFDM) standards, including the emerging Multiple Input Multiple Output (MIMO) and OFDM Access (OFDMA) schemes, three FFT architectures are proposed: a fully parallel, a pipelined cascade and an in-place variable-size architecture, which offer different trade-offs among flexibility, processing speed and complexity. Silicon implementation results and comparisons with the state-of-the-art prove that each macrocell outperforms the known works for a target application. The fully parallel is optimized for throughput requirements up to several GSamples/s enabling Ultra-wideband (UWB) communications by using all channels foreseen in the standard. The pipelined cascade macrocell minimizes complexity for large size FFTs sustaining throughput up to 100 MSamples/s. The in-place variable-size FFT macrocell stands for its flexibility by allowing run-time reconfigurability required in OFDMA schemes while attaining the required throughput to support MIMO communications. The three architectures are also compared with common case-studies and target technology. 相似文献

15.

A Flexible LDPC/Turbo Decoder Architecture

Yang Sun Joseph R. Cavallaro 《Journal of Signal Processing Systems》2011,64(1):1-16

Low-density parity-check (LDPC) codes and convolutional Turbo codes are two of the most powerful error correcting codes that are widely used in modern communication systems. In a multi-mode baseband receiver, both LDPC and Turbo decoders may be required. However, the different decoding approaches for LDPC and Turbo codes usually lead to different hardware architectures. In this paper we propose a unified message passing algorithm for LDPC and Turbo codes and introduce a flexible soft-input soft-output (SISO) module to handle LDPC/Turbo decoding. We employ the trellis-based maximum a posteriori (MAP) algorithm as a bridge between LDPC and Turbo codes decoding. We view the LDPC code as a concatenation of n super-codes where each super-code has a simpler trellis structure so that the MAP algorithm can be easily applied to it. We propose a flexible functional unit (FFU) for MAP processing of LDPC and Turbo codes with a low hardware overhead (about 15% area and timing overhead). Based on the FFU, we propose an area-efficient flexible SISO decoder architecture to support LDPC/Turbo codes decoding. Multiple such SISO modules can be embedded into a parallel decoder for higher decoding throughput. As a case study, a flexible LDPC/Turbo decoder has been synthesized on a TSMC 90 nm CMOS technology with a core area of 3.2 mm². The decoder can support IEEE 802.16e LDPC codes, IEEE 802.11n LDPC codes, and 3GPP LTE Turbo codes. Running at 500 MHz clock frequency, the decoder can sustain up to 600 Mbps LDPC decoding or 450 Mbps Turbo decoding. 相似文献

16.

High Speed FPGA-Based Implementations of Delayed-LMS Filters

Y. Yi R. Woods L.K. Ting C.F.N. Cowan 《Journal of Signal Processing Systems》2005,39(1-2):113-131

A variation of the least means squares (LMS) algorithm, called the delayed LMS (DLMS) algorithm is ideally suited for highly pipelined, adaptive digital filter implementations. In this paper, we present an efficient method to determine the delays in the DLMS filter. Furthermore, in order to achieve fully pipelined circuit architectures for FPGA implementation, we transfer these delays using retiming. The method has been used to derive a series of retimed delayed LMS (RDLMS) architectures, which allow a 66.7% reduction in delays and 5 times faster convergence time thereby giving superior performance in terms of throughput rate when compared to previous work. Three circuit architectures and three hardware shared versions are presented which have been implemented using the Virtex-II FPGA technology resulting in a throughput rate of 182 Msample/s. 相似文献

17.

High-performance VLSI architectures for turbo decoders with QPP interleaver

Shivani Verma 《International Journal of Electronics》2013,100(4):599-618

This paper analyses different VLSI architectures for 3GPP LTE/LTE-advanced turbo decoders for trade-offs in terms of throughput and area requirement. Data flow graphs for standard SISO MAP (maximum a posteriori) turbo decoder, SW – SISO MAP turbo decoder, PW SISO MAP turbo decoder have been presented, thus analysing their performance. Two variants of quadratic permutation polynomial (QPP) interleaver have been proposed which tend to simplify the complexity of ‘mod’ operator implementation and provide best compromise between area, delay and power dissipation. Implementation of decoder using one variant of QPP interleaver has also been discussed. A novel approach for area optimisation has been proposed to reduce required number of interleavers for parallel window turbo decoder. Multi-port memory has also been used for parallel turbo decoder. To increase the throughput without any effective increase in area complexity, circuit-level pipelining and retiming have been used. Proposed architectures have been synthesised using Synopsys Design Compiler using 45-nm CMOS technology. 相似文献

18.

Area-efficient high-throughput MAP decoder architectures

Seok-Jun Lee Shanbhag N.R. Singer A.C. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(8):921-933

Iterative decoders such as turbo decoders have become integral components of modern broadband communication systems because of their ability to provide substantial coding gains. A key computational kernel in iterative decoders is the maximum a posteriori probability (MAP) decoder. The MAP decoder is recursive and complex, which makes high-speed implementations extremely difficult to realize. In this paper, we present block-interleaved pipelining (BIP) as a new high-throughput technique for MAP decoders. An area-efficient symbol-based BIP MAP decoder architecture is proposed by combining BIP with the well-known look-ahead computation. These architectures are compared with conventional parallel architectures in terms of speed-up, memory and logic complexity, and area. Compared to the parallel architecture, the BIP architecture provides the same speed-up with a reduction in logic complexity by a factor of M, where M is the level of parallelism. The symbol-based architecture provides a speed-up in the range from 1 to 2 with a logic complexity that grows exponentially with M and a state metric storage requirement that is reduced by a factor of M as compared to a parallel architecture. The symbol-based BIP architecture provides speed-up in the range M to 2M with an exponentially higher logic complexity and a reduced memory complexity compared to a parallel architecture. These high-throughput architectures are synthesized in a 2.5-V 0.25-/spl mu/m CMOS standard cell library and post-layout simulations are conducted. For turbo decoder applications, we find that the BIP architecture provides a throughput gain of 1.96 at the cost of 63% area overhead. For turbo equalizer applications, the symbol-based BIP architecture enables us to achieve a throughput gain of 1.79 with an area savings of 25%. 相似文献

19.

Improving Network-on-Chip-based Turbo Decoder Architectures

Maurizio Martina Guido Masera 《Journal of Signal Processing Systems》2013,73(1):83-100

In this work novel results concerning Network-on-Chip-based turbo decoder architectures are presented. Stemming from previous publications, this work concentrates first on improving the throughput by exploiting adaptive-bandwidth-reduction techniques. This technique shows in the best case an improvement of more than 60 Mb/s. Moreover, it is known that double-binary turbo decoders require higher area than binary ones. This characteristic has the negative effect of increasing the data width of the network nodes. Thus, the second contribution of this work is to reduce the network complexity to support double-binary codes, by exploiting bit-level and pseudo-floating-point representation of the extrinsic information. These two techniques allow for an area reduction of up to more than the 40 % with a performance degradation of about 0.2 dB. 相似文献

20.

Novel Area-Efficient FPGA Architectures for FIR Filtering With Symmetric Signal Extension

Benkrid Abd.S. Benkrid K. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(5):709-722

This paper presents four novel area-efficient field-programmable gate-array (FPGA) bit-parallel architectures of finite impulse response (FIR) filters that smartly support the technique of symmetric signal extension while processing finite length signals at their boundaries. The key to this is a clever use of variable-depth shift registers which are efficiently implemented in Xilinx FPGAs in the form of shift register logic (SRL) components. Comparisons with the conventional architecture of FIR filter with symmetric boundary processing show considerable area saving especially with long-tap filters. For instance, our architecture implementation of the 8-tap low Daubechies-8 FIR filter achieves ~ 30% reduction in the area requirement (in terms of slices) compared to the conventional architecture while maintaining the same throughput. Two of the above-cited novel architectures are dedicated to the special case of symmetric FIR filters. The first architecture is highly area-efficient but requires a clock frequency doubler. While this reduces the overall processing speed (to a maximum of 2), it does maintain a high throughput. Moreover, this speed penalty is cancelled in bi-phase filters which are widely used in multirate architectures (e.g., wavelets). Our second symmetric FIR filter architecture saves less logic than the first architecture (e.g., 10% with the 9-tap low Biorthogonal 9&7 symmetric filter instead of 37% with the first architecture) but overcomes its speed penalty as it matches the throughput of the conventional architecture. 相似文献