首页 | 官方网站   微博 | 高级检索  
     

基于K-Means算法的Web日志用户聚类研究
引用本文:陈洲,陆南.基于K-Means算法的Web日志用户聚类研究[J].计算机与数字工程,2020,48(3):643-647.
作者姓名:陈洲  陆南
作者单位:江苏科技大学电子信息学院 镇江 212003;江苏科技大学电子信息学院 镇江 212003
摘    要:Web日志作为服务器的记录文件,记录了网站最重要的信息,随着大数据时代数据量的骤然增加,提出一种应对大数据量的数据挖掘算法,更有效地分析日志文件迫在眉睫。用户聚类是在对日志文件进行数据预处理的基础上,建立用户会话序列矩阵,进而对其进行聚类分析,论文针对K-Means算法在选取初始中心点上存在的问题,以及在构建用户会话矩阵后存在的孤立点的问题,提出了一种密度参数和KCR算法的优化算法-ICKM算法,该算法利用密度参数最大的对象作为第一中心点,随后从数据集中将此对象删除,利用KCR算法寻找下一个中心点,算法借助MapReduce计算框架,提高大数据环境下的数据处理速度,通过实验表明,ICKM算法在寻找初始中心点以及用户聚类上具有较高的准确度,在处理大数据量的数据集时,有较好的的运算速度。

关 键 词:用户聚类  K-MEANS算法  KCR算法  MAPREDUCE

Research on Web Log User Clustering Based on K-Means Algorithm
CHEN Zhou,LU Nan.Research on Web Log User Clustering Based on K-Means Algorithm[J].Computer and Digital Engineering,2020,48(3):643-647.
Authors:CHEN Zhou  LU Nan
Affiliation:(School of Electronic Information,Jiangsu University of Science and Technology,Zhenjiang 212003)
Abstract:As a server log file,the Web log records the most important information of the website. With the sudden increase of data volume in the era of big data,it proposes a data mining algorithm to deal with a large amount of data. It is imminent to analyze log files more effectively. User clustering is based on the data pretreatment of log files,establishes the user session sequence matrix,and then carries on the cluster analysis to it,this text is aimed at the problem that K-Means algorithm chooses initial center point,and the user constructs the conversation. For the problem of isolated points after the matrix,a density parameter and the algorithm of the KCR algorithm,the ICKM algorithm is proposed. This algorithm uses the object with the largest density parameter as the first center point,and then deletes this object from the data set,using the KCR. The algorithm finds the next center point. The algorithm uses the MapReduce calculation framework to improve the data processing speed in the big data environment. Experiments show that the ICKM algorithm has high accuracy in finding the initial center point and user clustering,and is dealing with big data. When the amount of data set,there is a better speed of operation.
Keywords:User clustering  K-Means  KCR  MapReduce
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号