首页 | 官方网站   微博 | 高级检索  
     

基于改进的半监督聚类的不平衡分类算法
引用本文:陆宇,赵凌云,白斌雯,姜震.基于改进的半监督聚类的不平衡分类算法[J].计算机应用,2022,42(12):3750-3755.
作者姓名:陆宇  赵凌云  白斌雯  姜震
作者单位:江苏大学 计算机科学与通信工程学院,江苏 镇江 212013
基金项目:国家自然科学基金资助项目(61906077);江苏大学大学生实践创新训练计划项目(202010299312X)
摘    要:不平衡分类的相关算法是机器学习领域的研究热点之一,其中的过采样通过重复抽取或者人工合成来增加少数类样本,以实现数据集的再平衡。然而当前的过采样方法大部分是基于原有的样本分布进行的,难以揭示更多的数据集分布特征。为了解决以上问题,首先,提出一种改进的半监督聚类算法来挖掘数据的分布特征;其次,基于半监督聚类的结果,在属于少数类的簇中选择置信度高的无标签数据(伪标签样本)加入原始训练集,这样做除了实现数据集的再平衡外,还可以利用半监督聚类获得的分布特征来辅助不平衡分类;最后,融合半监督聚类和分类的结果来预测最终的类别标签,从而进一步提高算法的不平衡分类性能。选择G-mean和曲线下面积(AUC)作为评价指标,将所提算法与TU、CDSMOTE等7个基于过采样或欠采样的不平衡分类算法在10个公开数据集上进行了对比分析。实验结果表明,与TU、CDSMOTE相比,所提算法在AUC指标上分别平均提高了6.7%和3.9%,在G-mean指标上分别平均提高了7.6%和2.1%,且在两个评价指标上相较于所有对比算法都取得了最高的平均结果。可见所提算法能够有效地提高不平衡分类性能。

关 键 词:不平衡分类  半监督聚类  伪标签样本  过采样  融合  
收稿时间:2021-10-28
修稿时间:2022-01-06

Imbalanced classification algorithm based on improved semi-supervised clustering
Yu LU,Lingyun ZHAO,Binwen BAI,Zhen JIANG.Imbalanced classification algorithm based on improved semi-supervised clustering[J].journal of Computer Applications,2022,42(12):3750-3755.
Authors:Yu LU  Lingyun ZHAO  Binwen BAI  Zhen JIANG
Affiliation:College of Computer Science and Communication Engineering,Jiangsu University,Zhenjiang Jiangsu 212013,China
Abstract:Imbalanced classification is one of the research hotspots in the field of machine learning, where oversampling increases minority samples through repeated extraction or artificial synthesis to rebalance the dataset. However, most of the existing oversampling methods are based on the original data distribution, and are difficult to reveal more dataset distribution characteristics. To address the above problem, firstly, an improved semi-supervised clustering algorithm was proposed to mine the data distribution characteristics. Secondly, based on the results of semi-supervised clustering, the highly-confident unlabeled data (pseudo-labeled samples) was selected from minority-class clusters to join into the original training set. In this way, in addition to rebalancing the dataset, the distribution characteristics obtained by semi-supervised clustering was able to be used to assist the imbalanced classification. Finally, the results of semi-supervised clustering and classification were fused to predict the final labels, which further improved the model performance of imbalanced classification. With G-mean and Area Under Curve (AUC) selected as evaluation indicators, the proposed algorithm was compared with seven oversampling-/undersampling-based imbalanced classification algorithms, such as TU (Trainable Undersampling) and CDSMOTE (Class Decomposition Synthetic Minority Oversampling TEchnique) on 10 public datasets. Experimental results show that compared with TU and CDSMOTE, the proposed algorithm has the average AUC increased by 6.7% and 3.9% respectively, the average G-mean improved by 7.6% and 2.1% respectively. At the same time, the proposed algorithm achieves the highest average results on both evaluation indicators than all the algorithms to be compared. It can be seen that the proposed algorithm can effectively improve the imbalanced classification performance.
Keywords:imbalanced classification  semi-supervised clustering  pseudo-labeled sample  oversampling  fusion  
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号