首页 | 官方网站   微博 | 高级检索  
     

新的降维标准下的高维数据聚类算法
引用本文:万静,吴凡,何云斌,李松.新的降维标准下的高维数据聚类算法[J].计算机科学与探索,2020,14(1):96-107.
作者姓名:万静  吴凡  何云斌  李松
作者单位:哈尔滨理工大学 计算机科学与技术学院,哈尔滨 150080;哈尔滨理工大学 计算机科学与技术学院,哈尔滨 150080;哈尔滨理工大学 计算机科学与技术学院,哈尔滨 150080;哈尔滨理工大学 计算机科学与技术学院,哈尔滨 150080
基金项目:the Science and Technology Research Project of Hei-longjiang Provincial Education Department under Grant No. 12531z004 (黑龙江省教育厅科学技术研究项目);The National Natural Science Foundation of China under Grant No. 61872105 (国家自然科学基金);the Natural Science Foundation of Heilongjiang Province under Grant No. F201302 (黑龙江省自然科学基金)
摘    要:为了解决主成分分析(PCA)算法无法处理高维数据降维后再聚类精确度下降的问题,提出了一种新的属性空间概念,通过属性空间与信息熵的结合构建了基于特征相似度的降维标准,提出了新的降维算法ENPCA。针对降维后特征是原特征的线性组合而导致可解释性变差以及输入不够灵活的问题,提出了基于岭回归的稀疏主成分算法(ESPCA)。ESPCA算法的输入为主成分降维结果,不需要迭代获得稀疏结果,增加了灵活性和求解速度。最后在降维数据的基础上,针对遗传算法聚类收敛速度慢等问题,对遗传算法的初始化、选择、交叉、变异等操作进行改进,提出了新的聚类算法GKA++。实验分析表明EN-PCA算法表现稳定,GKA++算法在聚类有效性和效率方面表现良好。

关 键 词:聚类  主成分分析(PCA)  特征相似度  岭回归  遗传算法

Clustering Algorithm for High-Dimensional Data Under New Dimensionality Reduction Criteria
WAN Jing,WU Fan,HE Yunbin,LI Song.Clustering Algorithm for High-Dimensional Data Under New Dimensionality Reduction Criteria[J].Journal of Frontier of Computer Science and Technology,2020,14(1):96-107.
Authors:WAN Jing  WU Fan  HE Yunbin  LI Song
Affiliation:(School of Computer Science and Technology,Harbin University of Science and Technology,Harbin 150080,China)
Abstract:In order to solve the problem that principal component analysis(PCA) algorithm can t deal with the reduction of clustering accuracy after high dimensional data reduction, a new attribute space concept is proposed.Based on the combination of attribute space and information entropy, the dimensionality reduction standard based on feature similarity is constructed. A new dimensionality reduction algorithm(entropy-PCA, EN-PCA) is proposed.Aiming at the problem that the post-dimension feature is a linear combination of original features, which leads to poor interpretability and inflexible input, a sparse principal component algorithm based on ridge regression(ESPCA) is proposed. The input of ESPCA algorithm is the PCA dimension reduction result. It does not require iteration to obtain sparse results, which increases the flexibility and speed of solution. Finally, on the basis of dimensionality reduction data, initialization, selection, crossover, mutation and other operations are improved for the problem of slow convergence of genetic algorithm clustering, and a new clustering algorithm(genetic K-means algorithm ++, GKA ++) is proposed. Experimental analysis shows that the EN-PCA algorithm is stable, and the GKA++ algorithm performs well in terms of clustering effectiveness and efficiency.
Keywords:clustering  principal component analysis(PCA)  feature similarity  ridge regression  genetic algorithm
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号