期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

郭福顺李莲治《小型微型计算机系统》1996,17(4):25-30

本文研究在分布式系统中如何利用分治算法来并行求解问题，并给出并行算法有效性度量和分析方法。相似文献

2.

吴良晶曹云峰丁萌庄丽葵《单片机与嵌入式系统应用》2016,(11):58-62

提出一种基于Altera SoC FPGA进行硬件加速的方案,该方案为运行在ARM端Linux系统的视觉算法利用FPGA进行加速提供传输通道.首先把ARM端的图像数据传输到FPGA部分的SDRAM中,接着控制FPGA相关IP核读取SDRAM中的数据,然后视觉算法IP核接收图像数据并对其进行加速处理,最后把处理后的图像数据通过特定的IP核传回Linux系统.实验验证了该方案的可行性、可靠性和加速性能. 相似文献

3.

归一化互相关灰度图像匹配的多核信号处理器实现

刘毅飞张旭明丁明跃《计算机应用》2011,31(12):3334-3336

为了满足图像处理对处理器性能的高要求,以基于灰度的归一化互相关(NCC)匹配算法为例,采用高性能、低功耗的多核数字信号处理器(DSP)系统,根据归一化互相关算法中模板图像在源图像中逐个像素搜索并计算相关性的特点,将搜索区域分成六个部分并使TMS320C6472的六个核并行搜索计算这六个区域,并在不同图像存储位置采用不同图像和模板大小实现了多核DSP归一化互相关图像匹配算法。实验结果表明,多核DSP具有作为数字信号处理器的高速信号和图像处理的特点,同时可以根据不同算法通过核间任务分配实现多核并行处理。对于归一化互相关灰度图像匹配算法,TMS320C6472六核DSP和单核DSP比较获得接近单核DSP六倍的性能,对于较大尺寸的图像和PC相比也具有一定的性能加速。相似文献

4.

基于FPGA的变M/T归一化测速算法研究

李小闯于强《计算机应用与软件》2019,36(9)

针对永磁同步电机的测速问题,提出关于增量编码器的变M/T归一化测速算法,并基于Xilinx FPGA板级系统完成算法设计与实现。通过分析M/T算法在低速范围的局限性,添加速度预测单元实时指导计数脉冲个数选取。为便于算法移植,在算法中引入归一化的设计思想。针对算法中使用较多的乘法器,引入Booth算法降低整体算法设计对FPGA片内DSP的依赖性。仿真结果表明,基于FPGA的变M/T归一化测速算法转速测量误差在±0.4 r/min内,测速响应时间不超过3 ms。相似文献

5.

一种解相关变步长归一化LMS自适应算法

下载免费PDF全文

龙剑友王梓展夏舜晖段正华《计算机工程与科学》2006,28(4):60-62

本文对变步长（VSS）自适应滤波算法进行了分析,针对输入信号高度相关时算法收敛速度下降导致性能下降的问题,提出了一种解相关变步长归一化LMS自适应算法,引入相关性原理,使得算法保持良好的收敛性能。计算机仿真结果与理论分析相一致。相似文献

6.

基于FPGA加速的低功耗的MobileNetV2网络识别系统

下载免费PDF全文

孙小坚林瑞全方子卿马驰《计算机测量与控制》2023,31(5):221-227

近年来,卷积神经网络由于其出色的性能被广泛应用在各个领域,如图像识别、语音识别与翻译和自动驾驶等;但是传统卷积神经网络（Convolutional Neural Network,CNN）存在参数多,计算量大,部署在CPU与GPU上推理速度慢、功耗大的问题。针对上述问题,采用量化感知训练（Quantization Aware Training,QAT）的方式在保证图像分类准确率的前提下,将网络参数总量压缩为原网络的1/4;将网络权重全部部署在FPGA的片内资源上,克服了片外存储带宽的限制,减少了访问片外存储资源带来的功耗;在MobileNetV2网络的层内以及相邻的点卷积层之间提出一种协同配合的流水线结构,极大的提高了网络的实时性;提出一种存储器与数据读取的优化策略,根据并行度调整数据的存储排列方式及读取顺序,进一步节约了片内BRAM资源。最终在Xilinx的Virtex-7 VC707开发板上实现了一套性能优、功耗小的轻量级卷积神经网络MobileNetV2识别系统,200HZ时钟下达到了170.06 GOP/s的吞吐量,功耗仅为6.13W,能耗比达到了27.74 GOP/s/W,是CPU的92倍,GPU的25倍,性能较其他实现有明显的优势。相似文献

7.

自适应定阶的快速Burg算法设计与FPGA实现

郭鸣晗陈立平张浩赵坤柏伟《电子技术应用》2021,47(11):62-67+72

针对信号频谱分析的实时性要求,设计了一种适用于短序列的自适应定阶的快速Burg算法硬件加速电路。以FPGA为平台进行实验,将快速Burg算法与最终预测误差(Final Prediction Error,FPE)准则结合可做到自回归(Auto-Regressive, AR)参数自适应定阶。实现了灵活控制的并行二级流水线结构和并行化计算单元,同时优化了存储单元,达到速度与面积的平衡。实验结果表明,该算法对短序列也能准确地估计信号频率,与Burg算法硬件实现方案的计算时间对比,该算法将运算时间降低了75%,确实起到了加速作用,并且节省了内存空间,符合设计要求。相似文献

8.

基于指数函数的归一化变步长LMS算法 总被引：1，自引：0，他引：1

下载免费PDF全文

杨逸曹祥玉杨群《计算机工程》2012,38(10):134-136

在研究归一化最小均方误差(NLMS)算法的基础上,提出一种基于指数函数的变步长LMS算法。通过建立误差和步长的函数关系,实时调整步长,并对输入信号完成时域信号解相关,解决稳态失调系数与收敛速度的矛盾。仿真实验结果证明,该算法与传统LMS算法、SVS_LMS算法、NLMS算法以及双曲正切变步长LMS算法相比,具有更高的收敛速度和较小的稳态失调系数。相似文献

9.

深度学习批归一化及其相关算法研究进展 总被引：4，自引：0，他引：4

刘建伟赵会丹罗雄麟许鋆《自动化学报》2020,46(6):1090-1120

深度学习已经广泛应用到各个领域, 如计算机视觉和自然语言处理等, 并都取得了明显优于早期机器学习算法的效果. 在信息技术飞速发展的今天, 训练数据逐渐趋于大数据集, 深度神经网络不断趋于大型化, 导致训练越来越困难, 速度和精度都有待提升. 2013年, Ioffe等指出训练深度神经网络过程中存在一个严重问题: 中间协变量迁移(Internal covariate shift), 使网络训练过程对参数初值敏感、收敛速度变慢, 并提出了批归一化(Batch normalization, BN)方法, 以减少中间协变量迁移问题, 加快神经网络训练过程收敛速度. 目前很多网络都将BN作为一种加速网络训练的重要手段, 鉴于BN的应用价值, 本文系统综述了BN及其相关算法的研究进展. 首先对BN的原理进行了详细分析. BN虽然简单实用, 但也存在一些问题, 如依赖于小批量数据集的大小、训练和推理过程对数据处理方式不同等, 于是很多学者相继提出了BN的各种相关结构与算法, 本文对这些结构和算法的原理、优势和可以解决的主要问题进行了分析与归纳. 然后对BN在各个神经网络领域的应用方法进行了概括总结, 并且对其他常用于提升神经网络训练性能的手段进行了归纳. 最后进行了总结, 并对BN的未来研究方向进行了展望. 相似文献

10.

虹膜纹理归一化算法研究

穆伟斌金成陈大同《网络安全技术与应用》2014,(2):53-54

对原有的虹膜归一化算法进行了改进,本文研究将直角坐标系下的大小不一的虹膜采集成像变换为大小一致的极坐标,并在插值运算过程中运用双线性插值算法,对虹膜归一化图像进行矫正,从而得到较为理想的归一化虹膜图像。相似文献

11.

YOLOv3-tiny的硬件加速设计及FPGA实现

陈浩敏姚森敬席禹张凡辛文成王龙海任超《计算机工程与科学》2021,43(12):2139-2149

YOLOv3-tiny具有优秀的目标检测能力,但模型所需的计算力依然较大,难以实现面向嵌入式领域的应用。提出一种YOLOv3-tiny的硬件加速方法,并在FPGA平台上实现。首先,针对网络定点化设计,以数据精度与资源消耗为设计指标,通过对模型中数据分布的统计以及数据类型的划分,提出了不同的定点化策略。其次,针对网络并行化设计,通过对卷积神经网络计算特性的分析,使用循环调整、循环分块、循环展开和数组分割等方法,设计了可扩展的常用硬件计算单元架构。然后,针对网络流水化设计,从层间与层内2个方面进行研究,以层间数据流方向和层内任务划分为基础,设计了一种灵活的流水化计算架构。最后,在XILINX XC7Z020CLG400-1平台上进行实验,结果表明,相较于667 MHz的单核ARM-A9处理器,加速比高达290.56。相似文献

12.

A hardware centric algorithm for the best matching unit searching stage of the SOM-based quantizer and its FPGA implementation

W. Kurdthongmee 《Journal of Real-Time Image Processing》2016,12(1):71-80

Parts of a self-organizing map (SOM)-based quantizer can be performed in parallel, i.e., distance calculation between an input pixel and a group of codewords or processing elements (PEs), and updating codewords. To search for the best matching unit (BMU) whose distance is the minimum, all distances are inevitably required to compare with each other. It is true that a group of comparators and registers can be instantiated with equal size to the distances (which is equivalent to the number of PEs) and performed in a multistage manner to come up with the minimum distance and its index. In this way, the algorithm requires n = log₂ C clock cycles, where C is the number of PEs and \(\sum\nolimits_{k=0}^{n-1}{2^k}\) are the number of comparators and registers. In this paper, we propose a novel hardware centric algorithm with the objective to accelerate the BMU searching stage of the SOM-based quantizer. In a simple form, the algorithm relies on using a PE’s distance as an address of a memory to store its index. Simultaneously with storing indices of all PEs, the states of all ‘non-empty’ addresses within the memory are prepared. In this way, it can be stated that the position of the first non-empty state corresponds to the memory address whose content is the BMU index. The approach to find the first position of the non-empty state within a single clock cycle is also detailed. The algorithm is also adapted to make it more feasible to realize on an FPGA platform. The synthesis results compared with the conventional BMU searching indicate that the FPGA resource requirements of the algorithm are 1.8 and 1.57 times in terms of slices and LUT usages, respectively. In terms of acceleration, the algorithm outperforms the conventional ones by a factor of 1.8 for a test image of size 512 × 512 pixels. 相似文献

13.

A hardware oriented fuzzification algorithm and its VLSI implementation

Mohammad Haji Seyed Javadi Hamid Reza Mahdiani Esmaeil Zeinali Kh 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2013,17(4):683-690

An efficient fuzzification algorithm named as Dynamic Precision Fuzzification (DPF) is introduced in this paper which is mainly developed for hardware implementation. The DPF which might be generally used with any piecewise linear membership function, exploits an inherent capacity of the normal fuzzification algorithm to improve its efficiency when realized in a finite-precision implementation bed such as digital VLSI. The accuracy simulation results of the DPF and normal fuzzification method are presented and compared to show the superiority of the DPF. As the word-length is the most important parameter in a finite-precision implementation environment which determines the system cost-precision trade-off, the simulation results show that DPF provides suitable precision improvements with respect to traditional fuzzification without increasing the system word-length. The VLSI synthesis results of both methods are also presented to show that this considerable accuracy improvement is achieved by an acceptable increase in its VLSI implementation costs in terms of area, delay, and power consumption with respect to traditional methods. 相似文献

14.

A parallel algorithm for constructing reduced visibility graph and its FPGA implementation

《Journal of Systems Architecture》2004,50(10):635-644

A central geometric structure in applications such as robotic path planning and hidden line elimination in computer graphics is the visibility graph. A new parallel algorithm to construct the reduced visibility graph in a convex polygonal environment is presented in this paper. The computational complexity is O(p²log(n/p)) where p is the number of objects and n is the total number of vertices. A key feature of the algorithm is that it supports easy mapping to hardware. The algorithm has been simulated (and verified) using C. Results of hardware implementation show that the design operates at high speed requiring only small space. In particular, the hardware implementation operates at approximately 53 MHz and accommodates the reduced visibility graph of an environment with 80 vertices in one XCV3200E device. 相似文献

15.

Efficient algorithm for automatic road sign recognition and its hardware implementation

Chokri Souani Hassene Faiedh Kamel Besbes 《Journal of Real-Time Image Processing》2014,9(1):79-93

The automatic detection of road signs is an application that alerts the vehicle’s driver of the presence of signals and invites him to react on time in the aim to avoid potential traffic accidents. This application can thus improve the road safety of persons and vehicles traveling in the road. Several techniques and algorithms allowing automatic detection of road signs are developed and implemented in software and do not allow embedded application. We propose in this work an efficient algorithm and its hardware implementation in an embedded system running in real time. In this paper we propose to implement the application of automatic recognition of road signs in real time by optimizing the techniques used in different phases of the recognition process. The system is implemented in a Virtex4 FPGA family which is connected to a camera mounted in the moving vehicle. The system can be integrated into the dashboard of the vehicle. The performance of the system shows a good compromise between speed and efficiency. 相似文献

16.

二维椭圆硬件加速算法研究及其FPGA实现

谢周标周毅龙斌《计算机工程与应用》2015,(3):45-49,60

针对现有的嵌入式二维图形加速系统中椭圆加速功能缺失或者不足的缺陷,提出了一种支持椭圆绘制和填充的功能齐全的椭圆硬件加速单元设计方案。采用自顶向下的设计方法,根据功能需求定义了椭圆加速单元的总体结构及功能模块划分,内部各功能单元采用流水线控制,将图形分解成水平线段输出;提出了适用于本设计的图形硬件实现算法,用Verilog HDL语言编写代码完成各模块的逻辑设计;通过仿真后在FPGA上综合实现。仿真及调试结果表明：提出的图形算法切实可行;设计的椭圆硬件加速单元能够正确快速地完成各种椭圆参数配置组合的椭圆绘制和填充功能,能够很好地满足二维图形加速系统的需求。相似文献

17.

FPGA based hardware acceleration for elliptic curve public key cryptosystems

M. Ernst B. Henhapl S. Klupsch S. Huss 《Journal of Systems and Software》2004,70(3):299-313

This paper addresses public key cryptosystems based on elliptic curves, which are aimed to high-performance digital signature schemes. Elliptic curve algorithms are characterized by the fact that one can work with considerably shorter keys compared to the RSA approach at the same level of security. A general and highly efficient method for mapping the most time-critical operations to a configurable co-processor is proposed. By means of real-time measurements the resulting performance values are compared to previously published state of the art hardware implementations.

A generator based approach is advocated for that purpose which supports application specific co-processor configurations in a flexible and straight forward way. Such a configurable CryptoProcessor has been integrated into a Java-based digital signature environment resulting in a considerable increase of its performance. The outlined approach combines in an unique way the advantages of mapping functionality to either hardware or software and it results in high-speed cryptosystems which are both portable and easy to update according to future security requirements. 相似文献

18.

dPMR接收机定时估计算法及FPGA实现

朱子文张涛关汉兴《电子技术应用》2019,45(5):27-30

符号定时同步的准确度对数字通信系统解调性能有极大影响,dPMR通信系统要求接收机的符号同步具有快速捕获和良好跟踪性能的特点。针对该要求,提出一种定时估计算法。该算法结合前导码定时算法和数字平方滤波算法的优点,首先捕捉突发信息的前导码,使用前导码定时算法实现高精度快速定时估计,之后以384个符号为间隔,使用数字平方滤波算法实现定时估计的跟踪校正。同时提出一种结构简单的FPGA实现方案,相对于经典的同步波形匹配滤波定时算法,不仅提升了接收机的解调性能且节约了硬件资源。相似文献

19.

基于FPGA的相关干涉仪测向算法的设计与实现

下载免费PDF全文

林泽龙曲英杰《计算机测量与控制》2024,32(5):318-324

为了适应日渐复杂的电磁对抗环境,有别于传统的处理器采用串行架构使用IQ数据进行测向算法运算,为了能更快速、更稳定的计算出相位差,提出一种优化后的相关干涉仪测向算法架构并在FPGA上实现设计。详细给出FPGA测向系统架构和运算流程,并结合实际采取一个稳定的、经济实惠的系统架构。为了满足后续实际应用需求,在Xilinx的xc7vx690tffg1927-2为逻辑控制单元的载板进行板级测试中,验证了FPGA进行相关干涉仪测向识别结果的稳定性和高效性,经过验证发现,相较于传统的FPGA相关干涉仪测向架构,经实验测试后实现了所提出的架构在LUT资源占用率上降低了39.5%,测向速度提高17%,可处理信号带宽80MHz,跳速高达1200跳/s。相似文献

20.

一种摄像头自动聚焦方法及硬件实现 总被引：1，自引：1，他引：0

程永强黄英男谢克明《电子技术应用》2009,35(1)

提出了一种基于数字图像处理技术实现自动聚焦的方法,给出了一种改进的灰度差分法作为图像聚焦是否清楚的评价函数。对图像质量进行比较,根据比较结果由单片机驱动镜头到达聚焦点,实现自动聚焦。相似文献