首页 | 官方网站   微博 | 高级检索  
     

基于Storm的流数据KNN分类算法的研究与实现
引用本文:周志阳,冯百明,杨朋霖,温向慧.基于Storm的流数据KNN分类算法的研究与实现[J].计算机工程与应用,2017,53(19):71-75.
作者姓名:周志阳  冯百明  杨朋霖  温向慧
作者单位:西北师范大学 计算机科学与工程学院,兰州 730070
摘    要:KNN算法是一种简单、有效且易于实现的分类算法,可用于类域较大的分类。近年来对KNN算法的研究偏向于静态大数据集,不过,在越来越多的情况下要用KNN算法在线实时处理流数据。考虑到流式数据流量大,连续且快速,不易存储和恢复等特性,以及流处理系统Storm对流数据处理具有实时性、可靠性的特点,提出了基于Storm的流数据KNN分类算法,该算法首先对整个样本集进行划分,形成多个片集,然后计算出待分类向量在各片集上的K]近邻,最后再将所有片集K]近邻归约得出整体K]近邻,实现待分类向量的分类。实验结果表明,基于Storm的流数据KNN分类算法能够满足大数据背景下对流数据分类的高吞吐量、可扩展性、实时性和准确性的要求。

关 键 词:Storm  KNN算法  流数据  大数据  数据划分  

Research and Implementation of KNN classification algorithm for streaming data based on Storm
ZHOU Zhiyang,FENG Baiming,YANG Penglin,WEN Xianghui.Research and Implementation of KNN classification algorithm for streaming data based on Storm[J].Computer Engineering and Applications,2017,53(19):71-75.
Authors:ZHOU Zhiyang  FENG Baiming  YANG Penglin  WEN Xianghui
Affiliation:College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China
Abstract:KNN(K-Nearest Neighbor) algorithm is a kind of classification algorithm which is simpler, more effective and easier to implement. It can be applied in the classification for larger data domain. In recent years, KNN algorithm has been paid great attention to study static big data sets, however, KNN algorithm has to be processed the streaming data sets online in more and more scenarios. Considering the streaming data with the characteristics of large, continuous, fast, not easy to store and restore; and the streaming processing system Storm with the characteristics of real-time and reliability, a modified KNN is proposed, which implements KNN on Strom to classify the streaming data online. By partitioning the whole sample set into multiple piece sets first, it then computes KNN of those to-be-classified vectors on each piece set, finally, the KNN are reduced to the whole KNN, thereby to achieve the classification of the to-be-classified vectors. Experiment results show that the proposed algorithm is able to meet the requirements of high throughput, scalability, real-time and accuracy for the classification of streaming data on the big data background.
Keywords:Storm  K-Nearest Neighbor(KNN)  streaming data  big data  data partition  
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号